Visible to the public Biblio

Filters: Keyword is active learning  [Clear All Filters]
2023-01-20
Kim, Yeongwoo, Dán, György.  2022.  An Active Learning Approach to Dynamic Alert Prioritization for Real-time Situational Awareness. 2022 IEEE Conference on Communications and Network Security (CNS). :154–162.

Real-time situational awareness (SA) plays an essential role in accurate and timely incident response. Maintaining SA is, however, extremely costly due to excessive false alerts generated by intrusion detection systems, which require prioritization and manual investigation by security analysts. In this paper, we propose a novel approach to prioritizing alerts so as to maximize SA, by formulating the problem as that of active learning in a hidden Markov model (HMM). We propose to use the entropy of the belief of the security state as a proxy for the mean squared error (MSE) of the belief, and we develop two computationally tractable policies for choosing alerts to investigate that minimize the entropy, taking into account the potential uncertainty of the investigations' results. We use simulations to compare our policies to a variety of baseline policies. We find that our policies reduce the MSE of the belief of the security state by up to 50% compared to static baseline policies, and they are robust to high false alert rates and to the investigation errors.

2022-10-20
Boukela, Lynda, Zhang, Gongxuan, Yacoub, Meziane, Bouzefrane, Samia.  2021.  A near-autonomous and incremental intrusion detection system through active learning of known and unknown attacks. 2021 International Conference on Security, Pattern Analysis, and Cybernetics(SPAC). :374—379.
Intrusion detection is a traditional practice of security experts, however, there are several issues which still need to be tackled. Therefore, in this paper, after highlighting these issues, we present an architecture for a hybrid Intrusion Detection System (IDS) for an adaptive and incremental detection of both known and unknown attacks. The IDS is composed of supervised and unsupervised modules, namely, a Deep Neural Network (DNN) and the K-Nearest Neighbors (KNN) algorithm, respectively. The proposed system is near-autonomous since the intervention of the expert is minimized through the active learning (AL) approach. A query strategy for the labeling process is presented, it aims at teaching the supervised module to detect unknown attacks and improve the detection of the already-known attacks. This teaching is achieved through sliding windows (SW) in an incremental fashion where the DNN is retrained when the data is available over time, thus rendering the IDS adaptive to cope with the evolutionary aspect of the network traffic. A set of experiments was conducted on the CICIDS2017 dataset in order to evaluate the performance of the IDS, promising results were obtained.
2022-08-12
Berman, Maxwell, Adams, Stephen, Sherburne, Tim, Fleming, Cody, Beling, Peter.  2019.  Active Learning to Improve Static Analysis. 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA). :1322–1327.
Static analysis tools are programs that run on source code prior to their compilation to binary executables and attempt to find flaws or defects in the code during the early stages of development. If left unresolved, these flaws could pose security risks. While numerous static analysis tools exist, there is no single tool that is optimal. Therefore, many static analysis tools are often used to analyze code. Further, some of the alerts generated by the static analysis tools are low-priority or false alarms. Machine learning algorithms have been developed to distinguish between true alerts and false alarms, however significant man hours need to be dedicated to labeling data sets for training. This study investigates the use of active learning to reduce the number of labeled alerts needed to adequately train a classifier. The numerical experiments demonstrate that a query by committee active learning algorithm can be utilized to significantly reduce the number of labeled alerts needed to achieve similar performance as a classifier trained on a data set of nearly 60,000 labeled alerts.
2022-07-13
Mennecozzi, Gian Marco, Hageman, Kaspar, Panum, Thomas Kobber, Türkmen, Ahmet, Mahmoud, Rasmi-Vlad, Pedersen, Jens Myrup.  2021.  Bridging the Gap: Adapting a Security Education Platform to a New Audience. 2021 IEEE Global Engineering Education Conference (EDUCON). :153—159.
The current supply of a highly specialized cyber security professionals cannot meet the demands for societies seeking digitization. To close the skill gap, there is a need for introducing students in higher education to cyber security, and to combine theoretical knowledge with practical skills. This paper presents how the cyber security training platform Haaukins, initially developed to increase interest and knowledge of cyber security among high school students, was further developed to support the need for training in higher education. Based on the differences between the existing and new target audiences, a set of design principles were derived which shaped the technical adjustments required to provide a suitable platform - mainly related to dynamic tooling, centralized access to exercises, and scalability of the platform to support courses running over longer periods of time. The implementation of these adjustments has led to a series of teaching sessions in various institutions of higher education, demonstrating the viability for Haaukins for the new target audience.
2022-05-03
Wang, Tingting, Zhao, Xufeng, Lv, Qiujian, Hu, Bo, Sun, Degang.  2021.  Density Weighted Diversity Based Query Strategy for Active Learning. 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD). :156—161.

Deep learning has made remarkable achievements in various domains. Active learning, which aims to reduce the budget for training a machine-learning model, is especially useful for the Deep learning tasks with the demand of a large number of labeled samples. Unfortunately, our empirical study finds that many of the active learning heuristics are not effective when applied to Deep learning models in batch settings. To tackle these limitations, we propose a density weighted diversity based query strategy (DWDS), which makes use of the geometry of the samples. Within a limited labeling budget, DWDS enhances model performance by querying labels for the new training samples with the maximum informativeness and representativeness. Furthermore, we propose a beam-search based method to obtain a good approximation to the optimum of such samples. Our experiments show that DWDS outperforms existing algorithms in Deep learning tasks.

2021-09-21
Chen, Chin-Wei, Su, Ching-Hung, Lee, Kun-Wei, Bair, Ping-Hao.  2020.  Malware Family Classification Using Active Learning by Learning. 2020 22nd International Conference on Advanced Communication Technology (ICACT). :590–595.
In the past few years, the malware industry has been thriving. Malware variants among the same malware family shared similar behavioural patterns or signatures reflecting their purpose. We propose an approach that combines support vector machine (SVM) classifiers and active learning by learning (ALBL) techniques to deal with insufficient labeled data in terms of the malware classification tasks. The proposed approach is evaluated with the malware family dataset from Microsoft Malware Classification Challenge (BIG 2015) on Kaggle. The results show that ALBL techniques can effectively boost the performance of our machine learning models and improve the quality of labeled samples.
2021-03-04
Wang, Y., Wang, Z., Xie, Z., Zhao, N., Chen, J., Zhang, W., Sui, K., Pei, D..  2020.  Practical and White-Box Anomaly Detection through Unsupervised and Active Learning. 2020 29th International Conference on Computer Communications and Networks (ICCCN). :1—9.

To ensure quality of service and user experience, large Internet companies often monitor various Key Performance Indicators (KPIs) of their systems so that they can detect anomalies and identify failure in real time. However, due to a large number of various KPIs and the lack of high-quality labels, existing KPI anomaly detection approaches either perform well only on certain types of KPIs or consume excessive resources. Therefore, to realize generic and practical KPI anomaly detection in the real world, we propose a KPI anomaly detection framework named iRRCF-Active, which contains an unsupervised and white-box anomaly detector based on Robust Random Cut Forest (RRCF), and an active learning component. Specifically, we novelly propose an improved RRCF (iRRCF) algorithm to overcome the drawbacks of applying original RRCF in KPI anomaly detection. Besides, we also incorporate the idea of active learning to make our model benefit from high-quality labels given by experienced operators. We conduct extensive experiments on a large-scale public dataset and a private dataset collected from a large commercial bank. The experimental resulta demonstrate that iRRCF-Active performs better than existing traditional statistical methods, unsupervised learning methods and supervised learning methods. Besides, each component in iRRCF-Active has also been demonstrated to be effective and indispensable.

2020-11-04
Švábenský, V., Vykopal, J..  2018.  Gathering Insights from Teenagers’ Hacking Experience with Authentic Cybersecurity Tools. 2018 IEEE Frontiers in Education Conference (FIE). :1—4.

This Work-In-Progress Paper for the Innovative Practice Category presents a novel experiment in active learning of cybersecurity. We introduced a new workshop on hacking for an existing science-popularizing program at our university. The workshop participants, 28 teenagers, played a cybersecurity game designed for training undergraduates and professionals in penetration testing. Unlike in learning environments that are simplified for young learners, the game features a realistic virtual network infrastructure. This allows exploring security tools in an authentic scenario, which is complemented by a background story. Our research aim is to examine how young players approach using cybersecurity tools by interacting with the professional game. A preliminary analysis of the game session showed several challenges that the workshop participants faced. Nevertheless, they reported learning about security tools and exploits, and 61% of them reported wanting to learn more about cybersecurity after the workshop. Our results support the notion that young learners should be allowed more hands-on experience with security topics, both in formal education and informal extracurricular events.

Thomas, L. J., Balders, M., Countney, Z., Zhong, C., Yao, J., Xu, C..  2019.  Cybersecurity Education: From Beginners to Advanced Players in Cybersecurity Competitions. 2019 IEEE International Conference on Intelligence and Security Informatics (ISI). :149—151.

Cybersecurity competitions have been shown to be an effective approach for promoting student engagement through active learning in cybersecurity. Players can gain hands-on experience in puzzle-based or capture-the-flag type tasks that promote learning. However, novice players with limited prior knowledge in cybersecurity usually found difficult to have a clue to solve a problem and get frustrated at the early stage. To enhance student engagement, it is important to study the experiences of novices to better understand their learning needs. To achieve this goal, we conducted a 4-month longitudinal case study which involves 11 undergraduate students participating in a college-level cybersecurity competition, National Cyber League (NCL) competition. The competition includes two individual games and one team game. Questionnaires and in-person interviews were conducted before and after each game to collect the players' feedback on their experience, learning challenges and needs, and information about their motivation, interests and confidence level. The collected data demonstrate that the primary concern going into these competitions stemmed from a lack of knowledge regarding cybersecurity concepts and tools. Players' interests and confidence can be increased by going through systematic training.

2020-07-20
Pengcheng, Li, Yi, Jinfeng, Zhang, Lijun.  2018.  Query-Efficient Black-Box Attack by Active Learning. 2018 IEEE International Conference on Data Mining (ICDM). :1200–1205.
Deep neural network (DNN) as a popular machine learning model is found to be vulnerable to adversarial attack. This attack constructs adversarial examples by adding small perturbations to the raw input, while appearing unmodified to human eyes but will be misclassified by a well-trained classifier. In this paper, we focus on the black-box attack setting where attackers have almost no access to the underlying models. To conduct black-box attack, a popular approach aims to train a substitute model based on the information queried from the target DNN. The substitute model can then be attacked using existing white-box attack approaches, and the generated adversarial examples will be used to attack the target DNN. Despite its encouraging results, this approach suffers from poor query efficiency, i.e., attackers usually needs to query a huge amount of times to collect enough information for training an accurate substitute model. To this end, we first utilize state-of-the-art white-box attack methods to generate samples for querying, and then introduce an active learning strategy to significantly reduce the number of queries needed. Besides, we also propose a diversity criterion to avoid the sampling bias. Our extensive experimental results on MNIST and CIFAR-10 show that the proposed method can reduce more than 90% of queries while preserve attacking success rates and obtain an accurate substitute model which is more than 85% similar with the target oracle.
2018-05-09
Abate, Alessandro.  2017.  Formal Verification of Complex Systems: Model-Based and Data-Driven Methods. Proceedings of the 15th ACM-IEEE International Conference on Formal Methods and Models for System Design. :91–93.

Two known shortcomings of standard techniques in formal verification are the limited capability to provide system-level assertions, and the scalability to large, complex models, such as those needed in Cyber-Physical Systems (CPS) applications. Leveraging data, which nowadays is becoming ever more accessible, has the potential to mitigate such limitations. However, this leads to a lack of formal proofs that are needed for modern safety-critical systems. This contribution presents a research initiative that addresses these shortcomings by bringing model-based techniques and data-driven methods together, which can help pushing the envelope of existing algorithms and tools in formal verification and thus expanding their applicability to complex engineering systems, such as CPS. In the first part of the contribution, we discuss a new, formal, measurement-driven and model-based automated technique, for the quantitative verification of physical systems with partly unknown dynamics. We formulate this setup as a data-driven Bayesian inference problem, formally embedded within a quantitative, model-based verification procedure. We argue that the approach can be applied to complex physical systems that are key for CPS applications, dealing with spatially continuous variables, evolving under complex dynamics, driven by external inputs, and accessed under noisy measurements. In the second part of the contribution, we concentrate on systems represented by models that evolve under probabilistic and heterogeneous (continuous/discrete - that is "hybrid" - as well as nonlinear) dynamics. Such stochastic hybrid models (also known as SHS) are a natural mathematical framework for CPS. With focus on model-based verification procedures, we provide algorithms for quantitative model checking of temporal specifications on SHS with formal guarantees. This is attained via the development of formal abstraction techniques that are based on quantitative approximations. Theory is complemented by algorithms, all packaged in software tools that are available to users, and which are applied here in the domain of Smart Energy.

2018-05-01
Wang, Weiyu, Zhu, Quanyan.  2017.  On the Detection of Adversarial Attacks Against Deep Neural Networks. Proceedings of the 2017 Workshop on Automated Decision Making for Active Cyber Defense. :27–30.

Deep learning model has been widely studied and proven to achieve high accuracy in various pattern recognition tasks, especially in image recognition. However, due to its non-linear architecture and high-dimensional inputs, its ill-posedness [1] towards adversarial perturbations-small deliberately crafted perturbations on the input will lead to completely different outputs, has also attracted researchers' attention. This work takes the traffic sign recognition system on the self-driving car as an example, and aims at designing an additional mechanism to improve the robustness of the recognition system. It uses a machine learning model which learns the results of the deep learning model's predictions, with human feedback as labels and provides the credibility of current prediction. The mechanism makes use of both the input image and the recognition result as sample space, querying a human user the True/False of current classification result the least number of times, and completing the task of detecting adversarial attacks.

2018-02-06
Huang, Lulu, Matwin, Stan, de Carvalho, Eder J., Minghim, Rosane.  2017.  Active Learning with Visualization for Text Data. Proceedings of the 2017 ACM Workshop on Exploratory Search and Interactive Data Analytics. :69–74.

Labeled datasets are always limited, and oftentimes the quantity of labeled data is a bottleneck for data analytics. This especially affects supervised machine learning methods, which require labels for models to learn from the labeled data. Active learning algorithms have been proposed to help achieve good analytic models with limited labeling efforts, by determining which additional instance labels will be most beneficial for learning for a given model. Active learning is consistent with interactive analytics as it proceeds in a cycle in which the unlabeled data is automatically explored. However, in active learning users have no control of the instances to be labeled, and for text data, the annotation interface is usually document only. Both of these constraints seem to affect the performance of an active learning model. We hypothesize that visualization techniques, particularly interactive ones, will help to address these constraints. In this paper, we implement a pilot study of visualization in active learning for text classification, with an interactive labeling interface. We compare the results of three experiments. Early results indicate that visualization improves high-performance machine learning model building with an active learning algorithm.

2018-01-10
Bhattacharjee, S. Das, Talukder, A., Al-Shaer, E., Doshi, P..  2017.  Prioritized active learning for malicious URL detection using weighted text-based features. 2017 IEEE International Conference on Intelligence and Security Informatics (ISI). :107–112.

Data analytics is being increasingly used in cyber-security problems, and found to be useful in cases where data volumes and heterogeneity make it cumbersome for manual assessment by security experts. In practical cyber-security scenarios involving data-driven analytics, obtaining data with annotations (i.e. ground-truth labels) is a challenging and known limiting factor for many supervised security analytics task. Significant portions of the large datasets typically remain unlabelled, as the task of annotation is extensively manual and requires a huge amount of expert intervention. In this paper, we propose an effective active learning approach that can efficiently address this limitation in a practical cyber-security problem of Phishing categorization, whereby we use a human-machine collaborative approach to design a semi-supervised solution. An initial classifier is learnt on a small amount of the annotated data which in an iterative manner, is then gradually updated by shortlisting only relevant samples from the large pool of unlabelled data that are most likely to influence the classifier performance fast. Prioritized Active Learning shows a significant promise to achieve faster convergence in terms of the classification performance in a batch learning framework, and thus requiring even lesser effort for human annotation. An useful feature weight update technique combined with active learning shows promising classification performance for categorizing Phishing/malicious URLs without requiring a large amount of annotated training samples to be available during training. In experiments with several collections of PhishMonger's Targeted Brand dataset, the proposed method shows significant improvement over the baseline by as much as 12%.

2017-09-15
Ghassemi, Mohsen, Sarwate, Anand D., Wright, Rebecca N..  2016.  Differentially Private Online Active Learning with Applications to Anomaly Detection. Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security. :117–128.

In settings where data instances are generated sequentially or in streaming fashion, online learning algorithms can learn predictors using incremental training algorithms such as stochastic gradient descent. In some security applications such as training anomaly detectors, the data streams may consist of private information or transactions and the output of the learning algorithms may reveal information about the training data. Differential privacy is a framework for quantifying the privacy risk in such settings. This paper proposes two differentially private strategies to mitigate privacy risk when training a classifier for anomaly detection in an online setting. The first is to use a randomized active learning heuristic to screen out uninformative data points in the stream. The second is to use mini-batching to improve classifier performance. Experimental results show how these two strategies can trade off privacy, label complexity, and generalization performance.

Silva, Rodrigo M., Gomes, Guilherme C.M., Alvim, Mário S., Gonçalves, Marcos A..  2016.  Compression-Based Selective Sampling for Learning to Rank. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :247–256.

Learning to rank (L2R) algorithms use a labeled training set to generate a ranking model that can be later used to rank new query results. These training sets are very costly and laborious to produce, requiring human annotators to assess the relevance or order of the documents in relation to a query. Active learning (AL) algorithms are able to reduce the labeling effort by actively sampling an unlabeled set and choosing data instances that maximize the effectiveness of a learning function. But AL methods require constant supervision, as documents have to be labeled at each round of the process. In this paper, we propose that certain characteristics of unlabeled L2R datasets allow for an unsupervised, compression-based selection process to be used to create small and yet highly informative and effective initial sets that can later be labeled and used to bootstrap a L2R system. We implement our ideas through a novel unsupervised selective sampling method, which we call Cover, that has several advantages over AL methods tailored to L2R. First, it does not need an initial labeled seed set and can select documents from scratch. Second, selected documents do not need to be labeled as the iterations of the method progress since it is unsupervised (i.e., no learning model needs to be updated). Thus, an arbitrarily sized training set can be selected without human intervention depending on the available budget. Third, the method is efficient and can be run on unlabeled collections containing millions of query-document instances. We run various experiments with two important L2R benchmarking collections to show that the proposed method allows for the creation of small, yet very effective training sets. It achieves full training-like performance with less than 10% of the original sets selected, outperforming the baselines in both effectiveness and scalability.

2017-07-24
Ghassemi, Mohsen, Sarwate, Anand D., Wright, Rebecca N..  2016.  Differentially Private Online Active Learning with Applications to Anomaly Detection. Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security. :117–128.

In settings where data instances are generated sequentially or in streaming fashion, online learning algorithms can learn predictors using incremental training algorithms such as stochastic gradient descent. In some security applications such as training anomaly detectors, the data streams may consist of private information or transactions and the output of the learning algorithms may reveal information about the training data. Differential privacy is a framework for quantifying the privacy risk in such settings. This paper proposes two differentially private strategies to mitigate privacy risk when training a classifier for anomaly detection in an online setting. The first is to use a randomized active learning heuristic to screen out uninformative data points in the stream. The second is to use mini-batching to improve classifier performance. Experimental results show how these two strategies can trade off privacy, label complexity, and generalization performance.