Visible to the public Biblio

Filters: Keyword is Web crawler  [Clear All Filters]
2022-02-25
Bolbol, Noor, Barhoom, Tawfiq.  2021.  Mitigating Web Scrapers using Markup Randomization. 2021 Palestinian International Conference on Information and Communication Technology (PICICT). :157—162.

Web Scraping is the technique of extracting desired data in an automated way by scanning the internal links and content of a website, this activity usually performed by systematically programmed bots. This paper explains our proposed solution to protect the blog content from theft and from being copied to other destinations by mitigating the scraping bots. To achieve our purpose we applied two steps in two levels, the first one, on the main blog page level, mitigated the work of crawler bots by adding extra empty articles anchors among real articles, and the next step, on the article page level, we add a random number of empty and hidden spans with randomly generated text among the article's body. To assess this solution we apply it to a local project developed using PHP language in Laravel framework, and put four criteria that measure the effectiveness. The results show that the changes in the file size before and after the application do not affect it, also, the processing time increased by few milliseconds which still in the acceptable range. And by using the HTML-similarity tool we get very good results that show the symmetric over style, with a few bit changes over the structure. Finally, to assess the effects on the bots, scraper bot reused and get the expected results from the programmed middleware. These results show that the solution is feasible to be adopted and use to protect blogs content.

2020-09-11
Kim, Donghoon, Sample, Luke.  2019.  Search Prevention with Captcha Against Web Indexing: A Proof of Concept. 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC). :219—224.
A website appears in search results based on web indexing conducted by a search engine bot (e.g., a web crawler). Some webpages do not want to be found easily because they include sensitive information. There are several methods to prevent web crawlers from indexing in search engine database. However, such webpages can still be indexed by malicious web crawlers. Through this study, we explore a paradox perspective on a new use of captchas for search prevention. Captchas are used to prevent web crawlers from indexing by converting sensitive words to captchas. We have implemented the web-based captcha conversion tool based on our search prevention algorithm. We also describe our proof of concept with the web-based chat application modified to utilize our algorithm. We have conducted the experiment to evaluate our idea on Google search engine with two versions of webpages, one containing plain text and another containing sensitive words converted to captchas. The experiment results show that the sensitive words on the captcha version of the webpages are unable to be found by Google's search engine, while the plain text versions are.
2020-03-09
Nathezhtha, T., Sangeetha, D., Vaidehi, V..  2019.  WC-PAD: Web Crawling based Phishing Attack Detection. 2019 International Carnahan Conference on Security Technology (ICCST). :1–6.
Phishing is a criminal offense which involves theft of user's sensitive data. The phishing websites target individuals, organizations, the cloud storage hosting sites and government websites. Currently, hardware based approaches for anti-phishing is widely used but due to the cost and operational factors software based approaches are preferred. The existing phishing detection approaches fails to provide solution to problem like zero-day phishing website attacks. To overcome these issues and precisely detect phishing occurrence a three phase attack detection named as Web Crawler based Phishing Attack Detector(WC-PAD) has been proposed. It takes the web traffics, web content and Uniform Resource Locator(URL) as input features, based on these features classification of phishing and non phishing websites are done. The experimental analysis of the proposed WC-PAD is done with datasets collected from real phishing cases. From the experimental results, it is found that the proposed WC-PAD gives 98.9% accuracy in both phishing and zero-day phishing attack detection.
2017-12-20
Park, A. J., Quadari, R. N., Tsang, H. H..  2017.  Phishing website detection framework through web scraping and data mining. 2017 8th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON). :680–684.

Phishers often exploit users' trust on the appearance of a site by using webpages that are visually similar to an authentic site. In the past, various research studies have tried to identify and classify the factors contributing towards the detection of phishing websites. The focus of this research is to establish a strong relationship between those identified heuristics (content-based) and the legitimacy of a website by analyzing training sets of websites (both phishing and legitimate websites) and in the process analyze new patterns and report findings. Many existing phishing detection tools are often not very accurate as they depend mostly on the old database of previously identified phishing websites. However, there are thousands of new phishing websites appearing every year targeting financial institutions, cloud storage/file hosting sites, government websites, and others. This paper presents a framework called Phishing-Detective that detects phishing websites based on existing and newly found heuristics. For this framework, a web crawler was developed to scrape the contents of phishing and legitimate websites. These contents were analyzed to rate the heuristics and their contribution scale factor towards the illegitimacy of a website. The data set collected from Web Scraper was then analyzed using a data mining tool to find patterns and report findings. A case study shows how this framework can be used to detect a phishing website. This research is still in progress but shows a new way of finding and using heuristics and the sum of their contributing weights to effectively and accurately detect phishing websites. Further development of this framework is discussed at the end of the paper.

2017-11-03
Zulkarnine, A. T., Frank, R., Monk, B., Mitchell, J., Davies, G..  2016.  Surfacing collaborated networks in dark web to find illicit and criminal content. 2016 IEEE Conference on Intelligence and Security Informatics (ISI). :109–114.
The Tor Network, a hidden part of the Internet, is becoming an ideal hosting ground for illegal activities and services, including large drug markets, financial frauds, espionage, child sexual abuse. Researchers and law enforcement rely on manual investigations, which are both time-consuming and ultimately inefficient. The first part of this paper explores illicit and criminal content identified by prominent researchers in the dark web. We previously developed a web crawler that automatically searched websites on the internet based on pre-defined keywords and followed the hyperlinks in order to create a map of the network. This crawler has demonstrated previous success in locating and extracting data on child exploitation images, videos, keywords and linkages on the public internet. However, as Tor functions differently at the TCP level, and uses socket connections, further technical challenges are faced when crawling Tor. Some of the other inherent challenges for advanced Tor crawling include scalability, content selection tradeoffs, and social obligation. We discuss these challenges and the measures taken to meet them. Our modified web crawler for Tor, termed the “Dark Crawler” has been able to access Tor while simultaneously accessing the public internet. We present initial findings regarding what extremist and terrorist contents are present in Tor and how this content is connected to each other in a mapped network that facilitates dark web crimes. Our results so far indicate the most popular websites in the dark web are acting as catalysts for dark web expansion by providing necessary knowledgebase, support and services to build Tor hidden services and onion websites.