Biblio
Network based attacks on ecommerce websites can have serious economic consequences. Hence, anomaly detection in dynamic network traffic has become an increasingly important research topic in recent years. This paper proposes a novel dynamic Graph and sparse Autoencoder based Anomaly Detection algorithm named GAAD. In GAAD, the network traffic over contiguous time intervals is first modelled as a series of dynamic bipartite graph increments. One mode projection is performed on each bipartite graph increment and the adjacency matrix derived. Columns of the resultant adjacency matrix are then used to train a sparse autoencoder to reconstruct it. The sum of squared errors between the reconstructed approximation and original adjacency matrix is then calculated. An online learning algorithm is then used to estimate a Gaussian distribution that models the error distribution. Outlier error values are deemed to represent anomalous traffic flows corresponding to possible attacks. In the experiment, a network emulator was used to generate representative ecommerce traffic flows over a time period of 225 minutes with five attacks injected, including SYN scans, host emulation and DDoS attacks. ROC curves were generated to investigate the influence of the autoencoder hyper-parameters. It was found that increasing the number of hidden nodes and their activation level, and increasing sparseness resulted in improved performance. Analysis showed that the sparse autoencoder was unable to encode the highly structured adjacency matrix structures associated with attacks, hence they were detected as anomalies. In contrast, SVD and variants, such as the compact matrix decomposition, were found to accurately encode the attack matrices, hence they went undetected.
Anomaly detection on security logs is receiving more and more attention. Authentication events are an important component of security logs, and being able to produce trustful and accurate predictions minimizes the effort of cyber-experts to stop false attacks. Observed events are classified into Normal, for legitimate user behavior, and Malicious, for malevolent actions. These classes are consistently excessively imbalanced which makes the classification problem harder; in the commonly used Los Alamos dataset, the malicious class comprises only 0.00033% of the total. This work proposes a novel method to extract advanced composite features, and a supervised learning technique for classifying authentication logs trustfully; the models are Random Forest, LogitBoost, Logistic Regression, and ultimately Majority Voting which leverages the predictions of the previous models and gives the final prediction for each authentication event. We measure the performance of our experiments by using the False Negative Rate and False Positive Rate. In overall we achieve 0 False Negative Rate (i.e. no attack was missed), and on average a False Positive Rate of 0.0019.
Resilient control systems should efficiently restore control into physical systems not only after the sabotage of themselves, but also after breaking physical systems. To enhance resilience of control systems, given an originally minimal-input controlled linear-time invariant(LTI) physical system, we address the problem of efficient control recovery into it after removing a known system vertex by finding the minimum number of inputs. According to the minimum input theorem, given a digraph embedded into LTI model and involving a precomputed maximum matching, this problem is modeled into recovering controllability of it after removing a known network vertex. Then, we recover controllability of the residual network by efficiently finding a maximum matching rather than recomputation. As a result, except for precomputing a maximum matching and the following removed vertex, the worst-case execution time of control recovery into the residual LTI physical system is linear.
Ubiquitous deployment of low-cost mobile positioning devices and the widespread use of high-speed wireless networks enable massive collection of large-scale trajectory data of individuals moving on road networks. Trajectory data mining finds numerous applications including understanding users' historical travel preferences and recommending places of interest to new visitors. Privacy-preserving trajectory mining is an important and challenging problem as exposure of sensitive location information in the trajectories can directly invade the location privacy of the users associated with the trajectories. In this paper, we propose a differentially private trajectory analysis algorithm for points-of-interest recommendation to users that aims at maximizing the accuracy of the recommendation results while protecting the privacy of the exposed trajectories with differential privacy guarantees. Our algorithm first transforms the raw trajectory dataset into a bipartite graph with nodes representing the users and the points-of-interest and the edges representing the visits made by the users to the locations, and then extracts the association matrix representing the bipartite graph to inject carefully calibrated noise to meet έ-differential privacy guarantees. A post-processing of the perturbed association matrix is performed to suppress noise prior to performing a Hyperlink-Induced Topic Search (HITS) on the transformed data that generates an ordered list of recommended points-of-interest. Extensive experiments on a real trajectory dataset show that our algorithm is efficient, scalable and demonstrates high recommendation accuracy while meeting the required differential privacy guarantees.
Traditional privacy-preserving data disclosure solutions have focused on protecting the privacy of individual's information with the assumption that all aggregate (statistical) information about individuals is safe for disclosure. Such schemes fail to support group privacy where aggregate information about a group of individuals may also be sensitive and users of the published data may have different levels of access privileges entitled to them. We propose the notion ofεg-Group Differential Privacy that protects sensitive information of groups of individuals at various defined privacy levels, enabling data users to obtain the level of access entitled to them. We present a preliminary evaluation of the proposed notion of group privacy through experiments on real association graph data that demonstrate the guarantees on group privacy on the disclosed data.
Detection of insider threats relies on monitoring individuals and their interactions with organizational resources. Identification of anomalous insiders typically relies on supervised learning models that use labeled data. However, such labeled data is not easily obtainable. The labeled data that does exist is also limited by current insider threat detection methods and undetected insiders would not be included. These models also inherently assume that the insider threat is not rapidly evolving between model generation and use of the model in detection. Yet there is a large body of research that illustrates that the insider threat changes significantly after some types of precipitating events, such as layoffs, significant restructuring, and plant or facility closure. To capture this temporal evolution of user-system interactions, we use an unsupervised learning framework to evaluate whether potential insider threat events are triggered following precipitating events. The analysis leverages a bipartite graph of user and system interactions. The approach shows a clear correlation between precipitating events and the number of apparent anomalies. The results of our empirical analysis show a clear shift in behaviors after events which have previously been shown to increase insider activity, specifically precipitating events. We argue that this metadata about the level of insider threat behaviors validates the potential of the approach. We apply our method to a dataset that comprises interactions between engineers and software components in an enterprise version control system spanning more than 22 years. We use this unlabeled dataset and automatically detect statistically significant events. We show that there is statistically significant evidence that a subset of users diversify their committing behavior after precipitating events have been announced. Although these findings do not constitute detection of insider threat events per se, they do identify patterns of potentially malicious high-risk insider behavior. They reinforce the idea that insider operations can be motivated by the insiders' environment. Our proposed framework outperforms algorithms based on naive random approaches and algorithms using volume dependent statistics. This graph mining technique has potential for early detection of insider threat behavior in user-system interactions independent of the volume of interactions. The proposed method also enables organizations without a corpus of identified insider threats to train its own anomaly detection system.
Data from cyber logs can often be represented as a bipartite graph (e.g. internal IP-external IP, user-application, or client-server). State-of-the-art graph based anomaly detection often generalizes across all types of graphs — namely bipartite and non-bipartite. This confounds the interpretation and use of specific graph features such as degree, page rank, and eigencentrality that can provide a security analyst with rapid situational awareness of their network. Furthermore, graph algorithms applied to data collected from large, distributed enterprise scale networks require accompanying methods that allow them to scale to the data collected. In this paper, we provide a novel, scalable, directional graph projection framework that operates on cyber logs that can be represented as bipartite graphs. This framework computes directional graph projections and identifies a set of interpretable graph features that describe anomalies within each partite.
In the age of Big Data, we are witnessing a huge proliferation of digital data capturing our lives and our surroundings. Data privacy is a critical barrier to data analytics and privacy-preserving data disclosure becomes a key aspect to leveraging large-scale data analytics due to serious privacy risks. Traditional privacy-preserving data publishing solutions have focused on protecting individual's private information while considering all aggregate information about individuals as safe for disclosure. This paper presents a new privacy-aware data disclosure scheme that considers group privacy requirements of individuals in bipartite association graph datasets (e.g., graphs that represent associations between entities such as customers and products bought from a pharmacy store) where even aggregate information about groups of individuals may be sensitive and need protection. We propose the notion of $ε$g-Group Differential Privacy that protects sensitive information of groups of individuals at various defined group protection levels, enabling data users to obtain the level of information entitled to them. Based on the notion of group privacy, we develop a suite of differentially private mechanisms that protect group privacy in bipartite association graphs at different group privacy levels based on specialization hierarchies. We evaluate our proposed techniques through extensive experiments on three real-world association graph datasets and our results demonstrate that the proposed techniques are effective, efficient and provide the required guarantees on group privacy.
Graph analysis can capture relationships between network entities and can be used to identify and rank anomalous hosts, users, or applications from various types of cyber logs. It is often the case that the data in the logs can be represented as a bipartite graph (e.g. internal IP-external IP, user-application, or client-server). State-of-the-art graph based anomaly detection often generalizes across all types of graphs — namely bipartite and non-bipartite. This confounds the interpretation and use of specific graph features such as degree, page rank, and eigencentrality that can provide a security analyst with situational awareness and even insights to potential attacks on enterprise scale networks. Furthermore, graph algorithms applied to data collected from large, distributed enterprise scale networks require accompanying methods that allow them to scale to the data collected. In this paper, we provide a novel, scalable, directional graph projection framework that operates on cyber logs that can be represented as bipartite graphs. We also present methodologies to further narrow returned results to anomalous/outlier cases that may be indicative of a cyber security event. This framework computes directional graph projections and identifies a set of interpretable graph features that describe anomalies within each partite.
While most organizations continue to invest in traditional network defences, a formidable security challenge has been brewing within their own boundaries. Malicious insiders with privileged access in the guise of a trusted source have carried out many attacks causing far reaching damage to financial stability, national security and brand reputation for both public and private sector organizations. Growing exposure and impact of the whistleblower community and concerns about job security with changing organizational dynamics has further aggravated this situation. The unpredictability of malicious attackers, as well as the complexity of malicious actions, necessitates the careful analysis of network, system and user parameters correlated with insider threat problem. Thus it creates a high dimensional, heterogeneous data analysis problem in isolating suspicious users. This research work proposes an insider threat detection framework, which utilizes the attributed graph clustering techniques and outlier ranking mechanism for enterprise users. Empirical results also confirm the effectiveness of the method by achieving the best area under curve value of 0.7648 for the receiver operating characteristic curve.
Address clustering tries to construct the one-to-many mapping from entities to addresses in the Bitcoin system. Simple heuristics based on the micro-structure of transactions have proved very effective in practice. In this paper we describe the primary reasons behind this effectiveness: address reuse, avoidable merging, super-clusters with high centrality,, the incremental growth of address clusters. We quantify their impact during Bitcoin's first seven years of existence.
Hash tables form a core component of many algorithms as well as network devices. Because of their large size, they often require a combined memory model, in which some of the elements are stored in a fast memory (for example, cache or on-chip SRAM) while others are stored in much slower memory (namely, the main memory or off-chip DRAM). This makes the implementation of real-life hash tables particularly delicate, as a suboptimal choice of the hashing scheme parameters may result in a higher average query time, and therefore in a lower throughput. In this paper, we focus on multiple-choice hash tables. Given the number of choices, we study the tradeoff between the load of a hash table and its average lookup time. The problem is solved by analyzing an equivalent problem: the expected maximum matching size of a random bipartite graph with a fixed left-side vertex degree. Given two choices, we provide exact results for any finite system, and also deduce asymptotic results as the fast memory size increases. In addition, we further consider other variants of this problem and model the impact of several parameters. Finally, we evaluate the performance of our models on Internet backbone traces, and illustrate the impact of the memories speed difference on the choice of parameters. In particular, we show that the common intuition of entirely avoiding slow memory accesses by using highly efficient schemes (namely, with many fast-memory choices) is not always optimal.