Biblio
The advanced persistent threat (APT) landscape has been studied without quantifiable data, for which indicators of compromise (IoC) may be uniformly analyzed, replicated, or used to support security mechanisms. This work culminates extensive academic and industry APT analysis, not as an incremental step in existing approaches to APT detection, but as a new benchmark of APT related opportunity. We collect 15,259 APT IoC hashes, retrieving subsequent sandbox execution logs across 41 different file types. This work forms an initial focus on Windows-based threat detection. We present a novel Windows APT executable (APT-EXE) dataset, made available to the research community. Manual and statistical analysis of the APT-EXE dataset is conducted, along with supporting feature analysis. We draw upon repeat and common APT paths access, file types, and operations within the APT-EXE dataset to generalize APT execution footprints. A baseline case analysis successfully identifies a majority of 117 of 152 live APT samples from campaigns across 2018 and 2019.
The rapid rise of cyber-crime activities and the growing number of devices threatened by them place software security issues in the spotlight. As around 90% of all attacks exploit known types of security issues, finding vulnerable components and applying existing mitigation techniques is a viable practical approach for fighting against cyber-crime. In this paper, we investigate how the state-of-the-art machine learning techniques, including a popular deep learning algorithm, perform in predicting functions with possible security vulnerabilities in JavaScript programs. We applied 8 machine learning algorithms to build prediction models using a new dataset constructed for this research from the vulnerability information in public databases of the Node Security Project and the Snyk platform, and code fixing patches from GitHub. We used static source code metrics as predictors and an extensive grid-search algorithm to find the best performing models. We also examined the effect of various re-sampling strategies to handle the imbalanced nature of the dataset. The best performing algorithm was KNN, which created a model for the prediction of vulnerable functions with an F-measure of 0.76 (0.91 precision and 0.66 recall). Moreover, deep learning, tree and forest based classifiers, and SVM were competitive with F-measures over 0.70. Although the F-measures did not vary significantly with the re-sampling strategies, the distribution of precision and recall did change. No re-sampling seemed to produce models preferring high precision, while re-sampling strategies balanced the IR measures.
Anonymity networks provide privacy to the users by relaying their data to multiple destinations in order to reach the final destination anonymously. Multilayer of encryption is used to protect the users' privacy from attacks or even from the operators of the stations. In this research, we showed how flow analysis could be used to identify encrypted anonymity network traffic under four scenarios: (i) Identifying anonymity networks compared to normal background traffic; (ii) Identifying the type of applications used on the anonymity networks; (iii) Identifying traffic flow behaviors of the anonymity network users; and (iv) Identifying / profiling the users on an anonymity network based on the traffic flow behavior. In order to study these, we employ a machine learning based flow analysis approach and explore how far we can push such an approach.
Event discovery from single pictures is a challenging problem that has raised significant interest in the last decade. During this time, a number of interesting solutions have been proposed to tackle event discovery in still images. However, a large scale benchmarking image dataset for the evaluation and comparison of event discovery algorithms from single images is still lagging behind. To this aim, in this paper we provide a large-scale properly annotated and balanced dataset of 490,000 images, covering every aspect of 14 different types of social events, selected among the most shared ones in the social network. Such a large scale collection of event-related images is intended to become a powerful support tool for the research community in multimedia analysis by providing a common benchmark for training, testing, validation and comparison of existing and novel algorithms. In this paper, we provide a detailed description of how the dataset is collected, organized and how it can be beneficial for the researchers in the multimedia analysis domain. Moreover, a deep learning based approach is introduced into event discovery from single images as one of the possible applications of this dataset with a belief that deep learning can prove to be a breakthrough also in this research area. By providing this dataset, we hope to gather research community in the multimedia and signal processing domains to advance this application.
A new dataset is presented composed of music identification matches from Gracenote, a leading global music metadata company. Matches from January 1, 2014 to December 31, 2014 have been curated and made available as a public dataset called Gracenote Music Identification 2014, or GNMID14, at the following address: https://developer.gracenote.com/mid2014. This collection is the first significant music identification dataset and one of the largest music related datasets available containing more than 110M matches in 224 countries for 3M unique tracks, and 509K unique artists. It features geotemporal information (i.e. country and match date), genre and mood metadata. In this paper, we characterize the dataset and demonstrate its utility for Information Retrieval (IR) research.