Biblio
Telephone spam has become an increasingly prevalent problem in many countries all over the world. For example, the US Federal Trade Commission's (FTC) National Do Not Call Registry's number of cumulative complaints of spam/scam calls reached 30.9 million submissions in 2016. Naturally, telephone carriers can play an important role in the fight against spam. However, due to the extremely large volume of calls that transit across large carrier networks, it is challenging to mine their vast amounts of call detail records (CDRs) to accurately detect and block spam phone calls. This is because CDRs only contain high-level metadata (e.g., source and destination numbers, call start time, call duration, etc.) related to each phone calls. In addition, ground truth about both benign and spam-related phone numbers is often very scarce (only a tiny fraction of all phone numbers can be labeled). More importantly, telephone carriers are extremely sensitive to false positives, as they need to avoid blocking any non-spam calls, making the detection of spam-related numbers even more challenging. In this paper, we present a novel detection system that aims to discover telephone numbers involved in spam campaigns. Given a small seed of known spam phone numbers, our system uses a combination of unsupervised and supervised machine learning methods to mine new, previously unknown spam numbers from large datasets of call detail records (CDRs). Our objective is not to detect all possible spam phone calls crossing a carrier's network, but rather to expand the list of known spam numbers while aiming for zero false positives, so that the newly discovered numbers may be added to a phone blacklist, for example. To evaluate our system, we have conducted experiments over a large dataset of real-world CDRs provided by a leading telephony provider in China, while tuning the system to produce no false positives. The experimental results show that our system is able to greatly expand on the initial seed of known spam numbers by up to about 250%.
In this paper we propose Mastino, a novel defense system to detect malware download events. A download event is a 3-tuple that identifies the action of downloading a file from a URL that was triggered by a client (machine). Mastino utilizes global situation awareness and continuously monitors various network- and system-level events of the clients' machines across the Internet and provides real time classification of both files and URLs to the clients upon submission of a new, unknown file or URL to the system. To enable detection of the download events, Mastino builds a large download graph that captures the subtle relationships among the entities of download events, i.e. files, URLs, and machines. We implemented a prototype version of Mastino and evaluated it in a large-scale real-world deployment. Our experimental evaluation shows that Mastino can accurately classify malware download events with an average of 95.5% true positive (TP), while incurring less than 0.5% false positives (FP). In addition, we show the Mastino can classify a new download event as either benign or malware in just a fraction of a second, and is therefore suitable as a real time defense system.
In an attempt to coerce useful information about the behavior of new malware families, threat analysts commonly force newly collected malicious software samples to run within a sandboxed environment. The main goal is to gather intelligence that can later be leveraged to detect and enumerate new malware infections within a network. Currently, most analysis environments "blindly" execute each newly collected malware sample for a predetermined amount of time (e.g., four to five minutes). However, a large majority of malware samples that are forced through sandbox execution are simply repackaged versions of previously seen (and already analyzed) malware. Consequently, a significant amount of time may be wasted in analyzing samples that do not generate new intelligence. In this paper, we propose MAXS, a novel probabilistic multi-hypothesis testing framework for scaling execution in malware analysis environments, including bare-metal execution environments. Our main goal is to automatically recognize whether a malware sample that is undergoing dynamic analysis has likely been seen before (e.g., in a "differently packed" form), and determine if we could therefore stop its execution early while avoiding loss of valuable malware intelligence (e.g., without missing DNS queries to never-before-seen malware command-and-control domains). We have tested our prototype implementation of MAXS over two large collections of malware execution traces obtained from two distinct production-level analysis environments. Our experimental results show that using MAXS we are able to reduce malware execution time by up to 50% in average, with less than 0.3% information loss. This roughly translates into the ability to double the capacity of malware sandbox environments, thus significantly optimizing the resources dedicated to malware execution and analysis. Our results are particularly important for bare-metal execution environments, in which it is not easy to leverage the economies of scale that characterize virtual-machine or emulation based malware sandboxes. For example, MAXS could be used to significantly cut the cost of bare-metal analysis environments by reducing the hardware resources needed to analyze a predetermined daily number of new malware samples.