Title | Link Harvesting on the Dark Web |
Publication Type | Conference Paper |
Year of Publication | 2021 |
Authors | Dalvi, Ashwini, Siddavatam, Irfan, Thakkar, Viraj, Jain, Apoorva, Kazi, Faruk, Bhirud, Sunil |
Conference Name | 2021 IEEE Bombay Section Signature Conference (IBSSC) |
Keywords | Crawlers, dark web, Data collection, Human Behavior, Hyperlink Extraction, IEEE Sections, Information age, Link Harvesting, pubcrawl, Text recognition, Uniform resource locators, Web pages, Web scraping |
Abstract | In this information age, web crawling on the internet is a prime source for data collection. And with the surface web already being dominated by giants like Google and Microsoft, much attention has been on the Dark Web. While research on crawling approaches is generally available, a considerable gap is present for URL extraction on the dark web. With most literature using the regular expressions methodology or built-in parsers, the problem with these methods is the higher number of false positives generated with the Dark Web, which makes the crawler less efficient. This paper proposes the dedicated parsers methodology for extracting URLs from the dark web, which when compared proves to be better than the regular expression methodology. Factors that make link harvesting on the Dark Web a challenge are discussed in the paper. |
DOI | 10.1109/IBSSC53889.2021.9673428 |
Citation Key | dalvi_link_2021 |