Biblio
To adapt to the rapidly evolving landscape of cyber threats, security professionals are actively exchanging Indicators of Compromise (IOC) (e.g., malware signatures, botnet IPs) through public sources (e.g. blogs, forums, tweets, etc.). Such information, often presented in articles, posts, white papers etc., can be converted into a machine-readable OpenIOC format for automatic analysis and quick deployment to various security mechanisms like an intrusion detection system. With hundreds of thousands of sources in the wild, the IOC data are produced at a high volume and velocity today, which becomes increasingly hard to manage by humans. Efforts to automatically gather such information from unstructured text, however, is impeded by the limitations of today's Natural Language Processing (NLP) techniques, which cannot meet the high standard (in terms of accuracy and coverage) expected from the IOCs that could serve as direct input to a defense system. In this paper, we present iACE, an innovation solution for fully automated IOC extraction. Our approach is based upon the observation that the IOCs in technical articles are often described in a predictable way: being connected to a set of context terms (e.g., "download") through stable grammatical relations. Leveraging this observation, iACE is designed to automatically locate a putative IOC token (e.g., a zip file) and its context (e.g., "malware", "download") within the sentences in a technical article, and further analyze their relations through a novel application of graph mining techniques. Once the grammatical connection between the tokens is found to be in line with the way that the IOC is commonly presented, these tokens are extracted to generate an OpenIOC item that describes not only the indicator (e.g., a malicious zip file) but also its context (e.g., download from an external source). Running on 71,000 articles collected from 45 leading technical blogs, this new approach demonstrates a remarkable performance: it generated 900K OpenIOC items with a precision of 95% and a coverage over 90%, which is way beyond what the state-of-the-art NLP technique and industry IOC tool can achieve, at a speed of thousands of articles per hour. Further, by correlating the IOCs mined from the articles published over a 13-year span, our study sheds new light on the links across hundreds of seemingly unrelated attack instances, particularly their shared infrastructure resources, as well as the impacts of such open-source threat intelligence on security protection and evolution of attack strategies.
To adapt to the rapidly evolving landscape of cyber threats, security professionals are actively exchanging Indicators of Compromise (IOC) (e.g., malware signatures, botnet IPs) through public sources (e.g. blogs, forums, tweets, etc.). Such information, often presented in articles, posts, white papers etc., can be converted into a machine-readable OpenIOC format for automatic analysis and quick deployment to various security mechanisms like an intrusion detection system. With hundreds of thousands of sources in the wild, the IOC data are produced at a high volume and velocity today, which becomes increasingly hard to manage by humans. Efforts to automatically gather such information from unstructured text, however, is impeded by the limitations of today's Natural Language Processing (NLP) techniques, which cannot meet the high standard (in terms of accuracy and coverage) expected from the IOCs that could serve as direct input to a defense system. In this paper, we present iACE, an innovation solution for fully automated IOC extraction. Our approach is based upon the observation that the IOCs in technical articles are often described in a predictable way: being connected to a set of context terms (e.g., "download") through stable grammatical relations. Leveraging this observation, iACE is designed to automatically locate a putative IOC token (e.g., a zip file) and its context (e.g., "malware", "download") within the sentences in a technical article, and further analyze their relations through a novel application of graph mining techniques. Once the grammatical connection between the tokens is found to be in line with the way that the IOC is commonly presented, these tokens are extracted to generate an OpenIOC item that describes not only the indicator (e.g., a malicious zip file) but also its context (e.g., download from an external source). Running on 71,000 articles collected from 45 leading technical blogs, this new approach demonstrates a remarkable performance: it generated 900K OpenIOC items with a precision of 95% and a coverage over 90%, which is way beyond what the state-of-the-art NLP technique and industry IOC tool can achieve, at a speed of thousands of articles per hour. Further, by correlating the IOCs mined from the articles published over a 13-year span, our study sheds new light on the links across hundreds of seemingly unrelated attack instances, particularly their shared infrastructure resources, as well as the impacts of such open-source threat intelligence on security protection and evolution of attack strategies.
Cyber Threat Intelligence (CTI) sharing facilitates a comprehensive understanding of adversary activity and enables enterprise networks to prioritize their cyber defense technologies. To that end, we introduce HogMap, a novel software-defined infrastructure that simplifies and incentivizes collaborative measurement and monitoring of cyber-threat activity. HogMap proposes to transform the cyber-threat monitoring landscape by integrating several novel SDN-enabled capabilities: (i) intelligent in-place filtering of malicious traffic, (ii) dynamic migration of interesting and extraordinary traffic and (iii) a software-defined marketplace where various parties can opportunistically subscribe to and publish cyber-threat intelligence services in a flexible manner. We present the architectural vision and summarize our preliminary experience in developing and operating an SDN-based HoneyGrid, which spans three enterprises and implements several of the enabling capabilities (e.g., traffic filtering, traffic forwarding and connection migration). We find that SDN technologies greatly simplify the design and deployment of such globally distributed and elastic HoneyGrids.
The significant dependence on cyberspace has indeed brought new risks that often compromise, exploit and damage invaluable data and systems. Thus, the capability to proactively infer malicious activities is of paramount importance. In this context, inferring probing events, which are commonly the first stage of any cyber attack, render a promising tactic to achieve that task. We have been receiving for the past three years 12 GB of daily malicious real darknet data (i.e., Internet traffic destined to half a million routable yet unallocated IP addresses) from more than 12 countries. This paper exploits such data to propose a novel approach that aims at capturing the behavior of the probing sources in an attempt to infer their orchestration (i.e., coordination) pattern. The latter defines a recently discovered characteristic of a new phenomenon of probing events that could be ominously leveraged to cause drastic Internet-wide and enterprise impacts as precursors of various cyber attacks. To accomplish its goals, the proposed approach leverages various signal and statistical techniques, information theoretical metrics, fuzzy approaches with real malware traffic and data mining methods. The approach is validated through one use case that arguably proves that a previously analyzed orchestrated probing event from last year is indeed still active, yet operating in a stealthy, very low rate mode. We envision that the proposed approach that is tailored towards darknet data, which is frequently, abundantly and effectively used to generate cyber threat intelligence, could be used by network security analysts, emergency response teams and/or observers of cyber events to infer large-scale orchestrated probing events for early cyber attack warning and notification.