Visible to the public Link Harvesting on the Dark Web

TitleLink Harvesting on the Dark Web
Publication TypeConference Paper
Year of Publication2021
AuthorsDalvi, Ashwini, Siddavatam, Irfan, Thakkar, Viraj, Jain, Apoorva, Kazi, Faruk, Bhirud, Sunil
Conference Name2021 IEEE Bombay Section Signature Conference (IBSSC)
KeywordsCrawlers, dark web, Data collection, Human Behavior, Hyperlink Extraction, IEEE Sections, Information age, Link Harvesting, pubcrawl, Text recognition, Uniform resource locators, Web pages, Web scraping
AbstractIn this information age, web crawling on the internet is a prime source for data collection. And with the surface web already being dominated by giants like Google and Microsoft, much attention has been on the Dark Web. While research on crawling approaches is generally available, a considerable gap is present for URL extraction on the dark web. With most literature using the regular expressions methodology or built-in parsers, the problem with these methods is the higher number of false positives generated with the Dark Web, which makes the crawler less efficient. This paper proposes the dedicated parsers methodology for extracting URLs from the dark web, which when compared proves to be better than the regular expression methodology. Factors that make link harvesting on the Dark Web a challenge are discussed in the paper.
DOI10.1109/IBSSC53889.2021.9673428
Citation Keydalvi_link_2021