Visible to the public Biblio

Filters: Keyword is Crawlers  [Clear All Filters]
2023-09-01
Chen, Guangxuan, Chen, Guangxiao, Wu, Di, Liu, Qiang, Zhang, Lei.  2022.  A Crawler-based Digital Forensics Method Oriented to Illegal Website. 2022 IEEE 5th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC). 5:1883—1887.
There are a large number of illegal websites on the Internet, such as pornographic websites, gambling websites, online fraud websites, online pyramid selling websites, etc. This paper studies the use of crawler technology for digital forensics on illegal websites. First, a crawler based illegal website forensics program is designed and developed, which can detect the peripheral information of illegal websites, such as domain name, IP address, network topology, and crawl key information such as website text, pictures, and scripts. Then, through comprehensive analysis such as word cloud analysis, word frequency analysis and statistics on the obtained data, it can help judge whether a website is illegal.
2023-06-02
Dalvi, Ashwini, Patil, Gunjan, Bhirud, S G.  2022.  Dark Web Marketplace Monitoring - The Emerging Business Trend of Cybersecurity. 2022 International Conference on Trends in Quantum Computing and Emerging Business Technologies (TQCEBT). :1—6.

Cyber threat intelligence (CTI) is vital for enabling effective cybersecurity decisions by providing timely, relevant, and actionable information about emerging threats. Monitoring the dark web to generate CTI is one of the upcoming trends in cybersecurity. As a result, developing CTI capabilities with the dark web investigation is a significant focus for cybersecurity companies like Deepwatch, DarkOwl, SixGill, ThreatConnect, CyLance, ZeroFox, and many others. In addition, the dark web marketplace (DWM) monitoring tools are of much interest to law enforcement agencies (LEAs). The fact that darknet market participants operate anonymously and online transactions are pseudo-anonymous makes it challenging to identify and investigate them. Therefore, keeping up with the DWMs poses significant challenges for LEAs today. Nevertheless, the offerings on the DWM give insights into the dark web economy to LEAs. The present work is one such attempt to describe and analyze dark web market data collected for CTI using a dark web crawler. After processing and labeling, authors have 53 DWMs with their product listings and pricing.

2022-12-20
Hassanshahi, Behnaz, Lee, Hyunjun, Krishnan, Paddy.  2022.  Gelato: Feedback-driven and Guided Security Analysis of Client-side Web Applications. 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). :618–629.
Modern web applications are getting more sophisticated by using frameworks that make development easy, but pose challenges for security analysis tools. New analysis techniques are needed to handle such frameworks that grow in number and popularity. In this paper, we describe Gelato that addresses the most crucial challenges for a security-aware client-side analysis of highly dynamic web applications. In particular, we use a feedback-driven and state-aware crawler that is able to analyze complex framework-based applications automatically, and is guided to maximize coverage of security-sensitive parts of the program. Moreover, we propose a new lightweight client-side taint analysis that outperforms the state-of-the-art tools, requires no modification to browsers, and reports non-trivial taint flows on modern JavaScript applications. Gelato reports vulnerabilities with higher accuracy than existing tools and achieves significantly better coverage on 12 applications of which three are used in production.
ISSN: 1534-5351
2022-05-23
Wen, Kaiyuan, Gang, Su, Li, Zhifeng, Zou, Zhexiang.  2021.  Design of Remote Control Intelligent Vehicle System with Three-dimensional Immersion. 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE). :287–290.
The project uses 3D immersive technology to innovatively apply virtual reality technology to the monitoring field, and proposes the concept and technical route of remote 3D immersive intelligent control. A design scheme of a three-dimensional immersive remote somatosensory intelligent controller is proposed, which is applied to the remote three-dimensional immersive control of a crawler mobile robot, and the test and analysis of the principle prototype are completed.
2022-04-12
Dalvi, Ashwini, Siddavatam, Irfan, Thakkar, Viraj, Jain, Apoorva, Kazi, Faruk, Bhirud, Sunil.  2021.  Link Harvesting on the Dark Web. 2021 IEEE Bombay Section Signature Conference (IBSSC). :1—5.
In this information age, web crawling on the internet is a prime source for data collection. And with the surface web already being dominated by giants like Google and Microsoft, much attention has been on the Dark Web. While research on crawling approaches is generally available, a considerable gap is present for URL extraction on the dark web. With most literature using the regular expressions methodology or built-in parsers, the problem with these methods is the higher number of false positives generated with the Dark Web, which makes the crawler less efficient. This paper proposes the dedicated parsers methodology for extracting URLs from the dark web, which when compared proves to be better than the regular expression methodology. Factors that make link harvesting on the Dark Web a challenge are discussed in the paper.
Furumoto, Keisuke, Umizaki, Mitsuhiro, Fujita, Akira, Nagata, Takahiko, Takahashi, Takeshi, Inoue, Daisuke.  2021.  Extracting Threat Intelligence Related IoT Botnet From Latest Dark Web Data Collection. 2021 IEEE International Conferences on Internet of Things (iThings) and IEEE Green Computing Communications (GreenCom) and IEEE Cyber, Physical Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics). :138—145.
As it is easy to ensure the confidentiality of users on the Dark Web, malware and exploit kits are sold on the market, and attack methods are discussed in forums. Some services provide IoT Botnet to perform distributed denial-of-service (DDoS as a Service: DaaS), and it is speculated that the purchase of these services is made on the Dark Web. By crawling such information and storing it in a database, threat intelligence can be obtained that cannot otherwise be obtained from information on the Surface Web. However, crawling sites on the Dark Web present technical challenges. For this paper, we implemented a crawler that can solve these challenges. We also collected information on markets and forums on the Dark Web by operating the implemented crawler. Results confirmed that the dataset collected by crawling contains threat intelligence that is useful for analyzing cyber attacks, particularly those related to IoT Botnet and DaaS. Moreover, by uncovering the relationship with security reports, we demonstrated that the use of data collected from the Dark Web can provide more extensive threat intelligence than using information collected only on the Surface Web.
Dalvi, Ashwini, Ankamwar, Lukesh, Sargar, Omkar, Kazi, Faruk, Bhirud, S.G..  2021.  From Hidden Wiki 2020 to Hidden Wiki 2021: What Dark Web Researchers Comprehend with Tor Directory Services? 2021 5th International Conference on Information Systems and Computer Networks (ISCON). :1—4.
The dark web searching mechanism is unlike surface web searching. On one popular dark web, Tor dark web, the search is often directed by directory like services such as Hidden Wiki. The numerous dark web data collection mechanisms are discussed and implemented via crawling. The dark web crawler assumes seed link, i.e. hidden service from where the crawling begins. One such popular Tor directory service is Hidden Wiki. Most of the hidden services listed on the Hidden Wiki 2020 page became unreachable with the recent upgrade in the Tor version. The Hidden Wiki 2021 page has a limited listing of services compared to the Hidden Wiki 2020 page. This motivated authors of the present work to establish the role of Hidden wiki service in dark web research and proposed the hypothesis that the dark web could be reached better through customized harvested links than Hidden Wiki-like service. The work collects unique hidden services/ onion links using the opensource crawler TorBot and runs similarity analysis on collected pages to map to corresponding categories.
2022-02-25
Bolbol, Noor, Barhoom, Tawfiq.  2021.  Mitigating Web Scrapers using Markup Randomization. 2021 Palestinian International Conference on Information and Communication Technology (PICICT). :157—162.

Web Scraping is the technique of extracting desired data in an automated way by scanning the internal links and content of a website, this activity usually performed by systematically programmed bots. This paper explains our proposed solution to protect the blog content from theft and from being copied to other destinations by mitigating the scraping bots. To achieve our purpose we applied two steps in two levels, the first one, on the main blog page level, mitigated the work of crawler bots by adding extra empty articles anchors among real articles, and the next step, on the article page level, we add a random number of empty and hidden spans with randomly generated text among the article's body. To assess this solution we apply it to a local project developed using PHP language in Laravel framework, and put four criteria that measure the effectiveness. The results show that the changes in the file size before and after the application do not affect it, also, the processing time increased by few milliseconds which still in the acceptable range. And by using the HTML-similarity tool we get very good results that show the symmetric over style, with a few bit changes over the structure. Finally, to assess the effects on the bots, scraper bot reused and get the expected results from the programmed middleware. These results show that the solution is feasible to be adopted and use to protect blogs content.

2021-03-09
Badawi, E., Jourdan, G.-V., Bochmann, G., Onut, I.-V..  2020.  An Automatic Detection and Analysis of the Bitcoin Generator Scam. 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS PW). :407—416.

We investigate what we call the "Bitcoin Generator Scam" (BGS), a simple system in which the scammers promise to "generate" new bitcoins using the ones that were sent to them. A typical offer will suggest that, for a small fee, one could receive within minutes twice the amount of bitcoins submitted. BGS is clearly not a very sophisticated attack. The modus operandi is simply to put up some web page on which to find the address to send the money and wait for the payback. The pages are then indexed by search engines, and ready to find for victims looking for free bitcoins. We describe here a generic system to find and analyze scams such as BGS. We have trained a classifier to detect these pages, and we have a crawler searching for instances using a series of search engines. We then monitor the instances that we find to trace payments and bitcoin addresses that are being used over time. Unlike most bitcoin-based scam monitoring systems, we do not rely on analyzing transactions on the blockchain to find scam instances. Instead, we proactively find these instances through the web pages advertising the scam. Thus our system is able to find addresses with very few transactions, or even none at all. Indeed, over half of the addresses that have eventually received funds were detected before receiving any transactions. The data for this paper was collected over four months, from November 2019 to February 2020. We have found more than 1,300 addresses directly associated with the scam, hosted on over 500 domains. Overall, these addresses have received (at least) over 5 million USD to the scam, with an average of 47.3 USD per transaction.

2020-12-11
Huang, S., Chuang, T., Huang, S., Ban, T..  2019.  Malicious URL Linkage Analysis and Common Pattern Discovery. 2019 IEEE International Conference on Big Data (Big Data). :3172—3179.

Malicious domain names are consistently changing. It is challenging to keep blacklists of malicious domain names up-to-date because of the time lag between its creation and detection. Even if a website is clean itself, it does not necessarily mean that it won't be used as a pivot point to redirect users to malicious destinations. To address this issue, this paper demonstrates how to use linkage analysis and open-source threat intelligence to visualize the relationship of malicious domain names whilst verifying their categories, i.e., drive-by download, unwanted software etc. Featured by a graph-based model that could present the inter-connectivity of malicious domain names in a dynamic fashion, the proposed approach proved to be helpful for revealing the group patterns of different kinds of malicious domain names. When applied to analyze a blacklisted set of URLs in a real enterprise network, it showed better effectiveness than traditional methods and yielded a clearer view of the common patterns in the data.

2020-11-09
Pflanzner, T., Feher, Z., Kertesz, A..  2019.  A Crawling Approach to Facilitate Open IoT Data Archiving and Reuse. 2019 Sixth International Conference on Internet of Things: Systems, Management and Security (IOTSMS). :235–242.
Several cloud providers have started to offer specific data management services by responding to the new trend called the Internet of Things (IoT). In recent years, we have already seen that cloud computing has managed to serve IoT needs for data retrieval, processing and visualization transparent for the user side. IoT-Cloud systems for smart cities and smart regions can be very complex, therefore their design and analysis should be supported by means of simulation. Nevertheless, the models used in simulation environments should be as close as to the real world utilization to provide reliable results. To facilitate such simulations, in earlier works we proposed an IoT trace archiving service called SUMMON that can be used to gather real world datasets, and to reuse them for simulation experiments. In this paper we provide an extension to SUMMON with an automated web crawling service that gathers IoT and sensor data from publicly available websites. We introduce the architecture and operation of this approach, and exemplify it utilization with three use cases. The provided archiving solution can be used by simulators to perform realistic evaluations.
2020-09-11
Kim, Donghoon, Sample, Luke.  2019.  Search Prevention with Captcha Against Web Indexing: A Proof of Concept. 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC). :219—224.
A website appears in search results based on web indexing conducted by a search engine bot (e.g., a web crawler). Some webpages do not want to be found easily because they include sensitive information. There are several methods to prevent web crawlers from indexing in search engine database. However, such webpages can still be indexed by malicious web crawlers. Through this study, we explore a paradox perspective on a new use of captchas for search prevention. Captchas are used to prevent web crawlers from indexing by converting sensitive words to captchas. We have implemented the web-based captcha conversion tool based on our search prevention algorithm. We also describe our proof of concept with the web-based chat application modified to utilize our algorithm. We have conducted the experiment to evaluate our idea on Google search engine with two versions of webpages, one containing plain text and another containing sensitive words converted to captchas. The experiment results show that the sensitive words on the captcha version of the webpages are unable to be found by Google's search engine, while the plain text versions are.
2020-07-10
Koloveas, Paris, Chantzios, Thanasis, Tryfonopoulos, Christos, Skiadopoulos, Spiros.  2019.  A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence. 2019 IEEE World Congress on Services (SERVICES). 2642-939X:3—8.

The clear, social, and dark web have lately been identified as rich sources of valuable cyber-security information that -given the appropriate tools and methods-may be identified, crawled and subsequently leveraged to actionable cyber-threat intelligence. In this work, we focus on the information gathering task, and present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web. The proposed architecture adopts a two-phase approach to data harvesting. Initially a machine learning-based crawler is used to direct the harvesting towards websites of interest, while in the second phase state-of-the-art statistical language modelling techniques are used to represent the harvested information in a latent low-dimensional feature space and rank it based on its potential relevance to the task at hand. The proposed architecture is realised using exclusively open-source tools, and a preliminary evaluation with crowdsourced results demonstrates its effectiveness.

2020-03-09
Nathezhtha, T., Sangeetha, D., Vaidehi, V..  2019.  WC-PAD: Web Crawling based Phishing Attack Detection. 2019 International Carnahan Conference on Security Technology (ICCST). :1–6.
Phishing is a criminal offense which involves theft of user's sensitive data. The phishing websites target individuals, organizations, the cloud storage hosting sites and government websites. Currently, hardware based approaches for anti-phishing is widely used but due to the cost and operational factors software based approaches are preferred. The existing phishing detection approaches fails to provide solution to problem like zero-day phishing website attacks. To overcome these issues and precisely detect phishing occurrence a three phase attack detection named as Web Crawler based Phishing Attack Detector(WC-PAD) has been proposed. It takes the web traffics, web content and Uniform Resource Locator(URL) as input features, based on these features classification of phishing and non phishing websites are done. The experimental analysis of the proposed WC-PAD is done with datasets collected from real phishing cases. From the experimental results, it is found that the proposed WC-PAD gives 98.9% accuracy in both phishing and zero-day phishing attack detection.
2019-12-16
Hou, Xin-Yu, Zhao, Xiao-Lin, Wu, Mei-Jing, Ma, Rui, Chen, Yu-Peng.  2018.  A Dynamic Detection Technique for XSS Vulnerabilities. 2018 4th Annual International Conference on Network and Information Systems for Computers (ICNISC). :34–43.

This paper studies the principle of vulnerability generation and mechanism of cross-site scripting attack, designs a dynamic cross-site scripting vulnerabilities detection technique based on existing theories of black box vulnerabilities detection. The dynamic detection process contains five steps: crawler, feature construct, attacks simulation, results detection and report generation. Crawling strategy in crawler module and constructing algorithm in feature construct module are key points of this detection process. Finally, according to the detection technique proposed in this paper, a detection tool is accomplished in Linux using python language to detect web applications. Experiments were launched to verify the results and compare with the test results of other existing tools, analyze the usability, advantages and disadvantages of the detection method above, confirm the feasibility of applying dynamic detection technique to cross-site scripting vulnerabilities detection.

2019-02-25
Vyamajala, S., Mohd, T. K., Javaid, A..  2018.  A Real-World Implementation of SQL Injection Attack Using Open Source Tools for Enhanced Cybersecurity Learning. 2018 IEEE International Conference on Electro/Information Technology (EIT). :0198–0202.

SQL injection is well known a method of executing SQL queries and retrieving sensitive information from a website connected database. This process poses a threat to those applications which are poorly coded in the today's world. SQL is considered as one of the top 10 vulnerabilities even in 2018. To keep a track of the vulnerabilities that each of the websites are facing, we employ a tool called Acunetix which allows us to find the vulnerabilities of a specific website. This tool also suggests measures on how to ensure preventive measures. Using this implementation, we discover vulnerabilities in an actual website. Such a real-world implementation would be useful for instructional use in a foundational cybersecurity course.

2018-09-12
Rahayuda, I. G. S., Santiari, N. P. L..  2017.  Crawling and cluster hidden web using crawler framework and fuzzy-KNN. 2017 5th International Conference on Cyber and IT Service Management (CITSM). :1–7.
Today almost everyone is using internet for daily activities. Whether it's for social, academic, work or business. But only a few of us are aware that internet generally we access only a small part of the overall of internet access. The Internet or the world wide web is divided into several levels, such as web surfaces, deep web or dark web. Accessing internet into deep or dark web is a dangerous thing. This research will be conducted with research on web content and deep content. For a faster and safer search, in this research will be use crawler framework. From the search process will be obtained various kinds of data to be stored into the database. The database classification process will be implemented to know the level of the website. The classification process is done by using the fuzzy-KNN method. The fuzzy-KNN method classifies the results of the crawling framework that contained in the database. Crawling framework will generate data in the form of url address, page info and other. Crawling data will be compared with predefined sample data. The classification result of fuzzy-KNN will result in the data of the web level based on the value of the word specified in the sample data. From the research conducted on several data tests that found there are as much as 20% of the web surface, 7.5% web bergie, 20% deep web, 22.5% charter and 30% dark web. Research is only done on some test data, it is necessary to add some data in order to get better result. Better crawler frameworks can speed up crawling results, especially at certain web levels because not all crawler frameworks can work at a particular web level, the tor browser's can be used but the crawler framework sometimes can not work.
2017-12-20
Park, A. J., Quadari, R. N., Tsang, H. H..  2017.  Phishing website detection framework through web scraping and data mining. 2017 8th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON). :680–684.

Phishers often exploit users' trust on the appearance of a site by using webpages that are visually similar to an authentic site. In the past, various research studies have tried to identify and classify the factors contributing towards the detection of phishing websites. The focus of this research is to establish a strong relationship between those identified heuristics (content-based) and the legitimacy of a website by analyzing training sets of websites (both phishing and legitimate websites) and in the process analyze new patterns and report findings. Many existing phishing detection tools are often not very accurate as they depend mostly on the old database of previously identified phishing websites. However, there are thousands of new phishing websites appearing every year targeting financial institutions, cloud storage/file hosting sites, government websites, and others. This paper presents a framework called Phishing-Detective that detects phishing websites based on existing and newly found heuristics. For this framework, a web crawler was developed to scrape the contents of phishing and legitimate websites. These contents were analyzed to rate the heuristics and their contribution scale factor towards the illegitimacy of a website. The data set collected from Web Scraper was then analyzed using a data mining tool to find patterns and report findings. A case study shows how this framework can be used to detect a phishing website. This research is still in progress but shows a new way of finding and using heuristics and the sum of their contributing weights to effectively and accurately detect phishing websites. Further development of this framework is discussed at the end of the paper.

2017-11-27
Kuze, N., Ishikura, S., Yagi, T., Chiba, D., Murata, M..  2016.  Detection of vulnerability scanning using features of collective accesses based on information collected from multiple honeypots. NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium. :1067–1072.

Attacks against websites are increasing rapidly with the expansion of web services. An increasing number of diversified web services make it difficult to prevent such attacks due to many known vulnerabilities in websites. To overcome this problem, it is necessary to collect the most recent attacks using decoy web honeypots and to implement countermeasures against malicious threats. Web honeypots collect not only malicious accesses by attackers but also benign accesses such as those by web search crawlers. Thus, it is essential to develop a means of automatically identifying malicious accesses from mixed collected data including both malicious and benign accesses. Specifically, detecting vulnerability scanning, which is a preliminary process, is important for preventing attacks. In this study, we focused on classification of accesses for web crawling and vulnerability scanning since these accesses are too similar to be identified. We propose a feature vector including features of collective accesses, e.g., intervals of request arrivals and the dispersion of source port numbers, obtained with multiple honeypots deployed in different networks for classification. Through evaluation using data collected from 37 honeypots in a real network, we show that features of collective accesses are advantageous for vulnerability scanning and crawler classification.

2017-11-03
Zulkarnine, A. T., Frank, R., Monk, B., Mitchell, J., Davies, G..  2016.  Surfacing collaborated networks in dark web to find illicit and criminal content. 2016 IEEE Conference on Intelligence and Security Informatics (ISI). :109–114.
The Tor Network, a hidden part of the Internet, is becoming an ideal hosting ground for illegal activities and services, including large drug markets, financial frauds, espionage, child sexual abuse. Researchers and law enforcement rely on manual investigations, which are both time-consuming and ultimately inefficient. The first part of this paper explores illicit and criminal content identified by prominent researchers in the dark web. We previously developed a web crawler that automatically searched websites on the internet based on pre-defined keywords and followed the hyperlinks in order to create a map of the network. This crawler has demonstrated previous success in locating and extracting data on child exploitation images, videos, keywords and linkages on the public internet. However, as Tor functions differently at the TCP level, and uses socket connections, further technical challenges are faced when crawling Tor. Some of the other inherent challenges for advanced Tor crawling include scalability, content selection tradeoffs, and social obligation. We discuss these challenges and the measures taken to meet them. Our modified web crawler for Tor, termed the “Dark Crawler” has been able to access Tor while simultaneously accessing the public internet. We present initial findings regarding what extremist and terrorist contents are present in Tor and how this content is connected to each other in a mapped network that facilitates dark web crimes. Our results so far indicate the most popular websites in the dark web are acting as catalysts for dark web expansion by providing necessary knowledgebase, support and services to build Tor hidden services and onion websites.
Iliou, C., Kalpakis, G., Tsikrika, T., Vrochidis, S., Kompatsiaris, I..  2016.  Hybrid Focused Crawling for Homemade Explosives Discovery on Surface and Dark Web. 2016 11th International Conference on Availability, Reliability and Security (ARES). :229–234.
This work proposes a generic focused crawling framework for discovering resources on any given topic that reside on the Surface or the Dark Web. The proposed crawler is able to seamlessly traverse the Surface Web and several darknets present in the Dark Web (i.e. Tor, I2P and Freenet) during a single crawl by automatically adapting its crawling behavior and its classifier-guided hyperlink selection strategy based on the network type. This hybrid focused crawler is demonstrated for the discovery of Web resources containing recipes for producing homemade explosives. The evaluation experiments indicate the effectiveness of the proposed ap-proach both for the Surface and the Dark Web.