Visible to the public Biblio

Found 213 results

Filters: Keyword is static analysis  [Clear All Filters]
2019-06-10
Kim, H. M., Song, H. M., Seo, J. W., Kim, H. K..  2018.  Andro-Simnet: Android Malware Family Classification Using Social Network Analysis. 2018 16th Annual Conference on Privacy, Security and Trust (PST). :1-8.

While the rapid adaptation of mobile devices changes our daily life more conveniently, the threat derived from malware is also increased. There are lots of research to detect malware to protect mobile devices, but most of them adopt only signature-based malware detection method that can be easily bypassed by polymorphic and metamorphic malware. To detect malware and its variants, it is essential to adopt behavior-based detection for efficient malware classification. This paper presents a system that classifies malware by using common behavioral characteristics along with malware families. We measure the similarity between malware families with carefully chosen features commonly appeared in the same family. With the proposed similarity measure, we can classify malware by malware's attack behavior pattern and tactical characteristics. Also, we apply community detection algorithm to increase the modularity within each malware family network aggregation. To maintain high classification accuracy, we propose a process to derive the optimal weights of the selected features in the proposed similarity measure. During this process, we find out which features are significant for representing the similarity between malware samples. Finally, we provide an intuitive graph visualization of malware samples which is helpful to understand the distribution and likeness of the malware networks. In the experiment, the proposed system achieved 97% accuracy for malware classification and 95% accuracy for prediction by K-fold cross-validation using the real malware dataset.

Jiang, H., Turki, T., Wang, J. T. L..  2018.  DLGraph: Malware Detection Using Deep Learning and Graph Embedding. 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). :1029-1033.

In this paper we present a new approach, named DLGraph, for malware detection using deep learning and graph embedding. DLGraph employs two stacked denoising autoencoders (SDAs) for representation learning, taking into consideration computer programs' function-call graphs and Windows application programming interface (API) calls. Given a program, we first use a graph embedding technique that maps the program's function-call graph to a vector in a low-dimensional feature space. One SDA in our deep learning model is used to learn a latent representation of the embedded vector of the function-call graph. The other SDA in our model is used to learn a latent representation of the given program's Windows API calls. The two learned latent representations are then merged to form a combined feature vector. Finally, we use softmax regression to classify the combined feature vector for predicting whether the given program is malware or not. Experimental results based on different datasets demonstrate the effectiveness of the proposed approach and its superiority over a related method.

Jain, D., Khemani, S., Prasad, G..  2018.  Identification of Distributed Malware. 2018 IEEE 3rd International Conference on Communication and Information Systems (ICCIS). :242-246.

Smartphones have evolved over the years from simple devices to communicate with each other to fully functional portable computers although with comparatively less computational power but inholding multiple applications within. With the smartphone revolution, the value of personal data has increased. As technological complexities increase, so do the vulnerabilities in the system. Smartphones are the latest target for attacks. Android being an open source platform and also the most widely used smartphone OS draws the attention of many malware writers to exploit the vulnerabilities of it. Attackers try to take advantage of these vulnerabilities and fool the user and misuse their data. Malwares have come a long way from simple worms to sophisticated DDOS using Botnets, the latest trends in computer malware tend to go in the distributed direction, to evade the multiple anti-virus apps developed to counter generic viruses and Trojans. However, the recent trend in android system is to have a combination of applications which acts as malware. The applications are benign individually but when grouped, these may result into a malicious activity. This paper proposes a new category of distributed malware in android system, how it can be used to evade the current security, and how it can be detected with the help of graph matching algorithm.

2019-03-04
Laverdière, M., Merlo, E..  2018.  Detection of protection-impacting changes during software evolution. 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). :434–444.

Role-Based Access Control (RBAC) is often used in web applications to restrict operations and protect security sensitive information and resources. Web applications regularly undergo maintenance and evolution and their security may be affected by source code changes between releases. To prevent security regression and vulnerabilities, developers have to take re-validation actions before deploying new releases. This may become a significant undertaking, especially when quick and repeated releases are sought. We define protection-impacting changes as those changed statements during evolution that alter privilege protection of some code. We propose an automated method that identifies protection-impacting changes within all changed statements between two versions. The proposed approach compares statically computed security protection models and repository information corresponding to different releases of a system to identify protection-impacting changes. Results of experiments present the occurrence of protection-impacting changes over 210 release pairs of WordPress, a PHP content management web application. First, we show that only 41% of the release pairs present protection-impacting changes. Second, for these affected release pairs, protection-impacting changes can be identified and represent a median of 47.00 lines of code, that is 27.41% of the total changed lines of code. Over all investigated releases in WordPress, protection-impacting changes amounted to 10.89% of changed lines of code. Conversely, an average of about 89% of changed source code have no impact on RBAC security and thus need no re-validation nor investigation. The proposed method reduces the amount of candidate causes of protection changes that developers need to investigate. This information could help developers re-validate application security, identify causes of negative security changes, and perform repairs in a more effective way.

Krishnamurthy, R., Meinel, M., Haupt, C., Schreiber, A., Mader, P..  2018.  DLR Secure Software Engineering. 2018 IEEE/ACM 1st International Workshop on Security Awareness from Design to Deployment (SEAD). :49–50.
DLR as research organization increasingly faces the task to share its self-developed software with partners or publish openly. Hence, it is very important to harden the softwares to avoid opening attack vectors. Especially since DLR software is typically not developed by software engineering or security experts. In this paper we describe the data-oriented approach of our new found secure software engineering group to improve the software development process towards more secure software. Therefore, we have a look at the automated security evaluation of software as well as the possibilities to capture information about the development process. Our aim is to use our information sources to improve software development processes to produce high quality secure software.
2019-02-22
Bakour, K., Ünver, H. M., Ghanem, R..  2018.  The Android Malware Static Analysis: Techniques, Limitations, and Open Challenges. 2018 3rd International Conference on Computer Science and Engineering (UBMK). :586-593.

This paper aims to explain static analysis techniques in detail, and to highlight the weaknesses and challenges which face it. To this end, more than 80 static analysis-based framework have been studied, and in their light, the process of detecting malicious applications has been divided into four phases that were explained in a schematic manner. Also, the features that is used in static analysis were discussed in detail by dividing it into four categories namely, Manifest-based features, code-based features, semantic features and app's metadata-based features. Also, the challenges facing methods based on static analysis were discussed in detail. Finally, a case study was conducted to test the strength of some known commercial antivirus and one of the stat-of-art academic static analysis frameworks against obfuscation techniques used by developers of malicious applications. The results showed a significant impact on the performance of the most tested antiviruses and frameworks, which is reflecting the urgent need for more accurately tools.

Gauthier, F., Keynes, N., Allen, N., Corney, D., Krishnan, P..  2018.  Scalable Static Analysis to Detect Security Vulnerabilities: Challenges and Solutions. 2018 IEEE Cybersecurity Development (SecDev). :134-134.

Parfait [1] is a static analysis tool originally developed to find implementation defects in C/C++ systems code. Parfait's focus is on proving both high precision (low false positives) as well as scaling to systems with millions of lines of code (typically requiring 10 minutes of analysis time per million lines). Parfait has since been extended to detect security vulnerabilities in applications code, supporting the Java EE and PL/SQL server stack. In this abstract we describe some of the challenges we encountered in this process including some of the differences seen between the applications code being analysed, our solutions that enable us to analyse a variety of applications, and a summary of the challenges that remain.

Novikov, A. S., Ivutin, A. N., Troshina, A. G., Vasiliev, S. N..  2018.  Detecting the Use of Unsafe Data in Software of Embedded Systems by Means of Static Analysis Methodology. 2018 7th Mediterranean Conference on Embedded Computing (MECO). :1-4.

The article considers the approach to identifying potentially unsafe data in program code of embedded systems which can lead to errors and fails in the functioning of equipment. The sources of invalid data are revealed and the process of changing the status of this data in process of static code analysis is shown. The mechanism for annotating functions that operate on unsafe data is described, which allows to control the entire process of using them and thus it will improve the quality of the output code.

Gao, Qing, Ma, Sen, Shao, Sihao, Sui, Yulei, Zhao, Guoliang, Ma, Luyao, Ma, Xiao, Duan, Fuyao, Deng, Xiao, Zhang, Shikun et al..  2018.  CoBOT: Static C/C++ Bug Detection in the Presence of Incomplete Code. Proceedings of the 26th Conference on Program Comprehension. :385-388.

To obtain precise and sound results, most of existing static analyzers require whole program analysis with complete source code. However, in reality, the source code of an application always interacts with many third-party libraries, which are often not easily accessible to static analyzers. Worse still, more than 30% of legacy projects [1] cannot be compiled easily due to complicated configuration environments (e.g., third-party libraries, compiler options and macros), making ideal "whole-program analysis" unavailable in practice. This paper presents CoBOT [2], a static analysis tool that can detect bugs in the presence of incomplete code. It analyzes function APIs unavailable in application code by either using function summarization or automatically downloading and analyzing the corresponding library code as inferred from the application code and its configuration files. The experiments show that CoBOT is not only easy to use, but also effective in detecting bugs in real-world programs with incomplete code. Our demonstration video is at: https://youtu.be/bhjJp3e7LPM.

Querel, Louis-Philippe, Rigby, Peter C..  2018.  WarningsGuru: Integrating Statistical Bug Models with Static Analysis to Provide Timely and Specific Bug Warnings. Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. :892-895.

The detection of bugs in software systems has been divided into two research areas: static code analysis and statistical modeling of historical data. Static analysis indicates precise problems on line numbers but has the disadvantage of suggesting many warning which are often false positives. In contrast, statistical models use the history of the system to suggest which files or commits are likely to contain bugs. These course-grained predictions do not indicate to the developer the precise reasons for the bug prediction. We combine static analysis with statistical bug models to limit the number of warnings and provide specific warnings information at the line level. Previous research was able to process only a limited number of releases, our tool, WarningsGuru, can analyze all commits in a source code repository and we currently have processed thousands of commits and warnings. Since we process every commit, we present developers with more precise information about when a warning is introduced allowing us to show recent warnings that are introduced in statistically risky commits. Results from two OSS projects show that CommitGuru's statistical model flags 25% and 29% of all commits as risky. When we combine this with static analysis in WarningsGuru the number of risky commits with warnings is 20% for both projects and the number commits with new warnings is only 3% and 6%. We can drastically reduce the number of commits and warnings developers have to examine. The tool, source code, and demo is available at https://github.com/louisq/warningsguru.

2019-02-14
Facon, A., Guilley, S., Lec'Hvien, M., Schaub, A., Souissi, Y..  2018.  Detecting Cache-Timing Vulnerabilities in Post-Quantum Cryptography Algorithms. 2018 IEEE 3rd International Verification and Security Workshop (IVSW). :7-12.

When implemented on real systems, cryptographic algorithms are vulnerable to attacks observing their execution behavior, such as cache-timing attacks. Designing protected implementations must be done with knowledge and validation tools as early as possible in the development cycle. In this article we propose a methodology to assess the robustness of the candidates for the NIST post-quantum standardization project to cache-timing attacks. To this end we have developed a dedicated vulnerability research tool. It performs a static analysis with tainting propagation of sensitive variables across the source code and detects leakage patterns. We use it to assess the security of the NIST post-quantum cryptography project submissions. Our results show that more than 80% of the analyzed implementations have at least one potential flaw, and three submissions total more than 1000 reported flaws each. Finally, this comprehensive study of the competitors security allows us to identify the most frequent weaknesses amongst candidates and how they might be fixed.

Raghothaman, Mukund, Kulkarni, Sulekha, Heo, Kihong, Naik, Mayur.  2018.  User-Guided Program Reasoning Using Bayesian Inference. Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. :722-735.

Program analyses necessarily make approximations that often lead them to report true alarms interspersed with many false alarms. We propose a new approach to leverage user feedback to guide program analyses towards true alarms and away from false alarms. Our approach associates each alarm with a confidence value by performing Bayesian inference on a probabilistic model derived from the analysis rules. In each iteration, the user inspects the alarm with the highest confidence and labels its ground truth, and the approach recomputes the confidences of the remaining alarms given this feedback. It thereby maximizes the return on the effort by the user in inspecting each alarm. We have implemented our approach in a tool named Bingo for program analyses expressed in Datalog. Experiments with real users and two sophisticated analyses–-a static datarace analysis for Java programs and a static taint analysis for Android apps–-show significant improvements on a range of metrics, including false alarm rates and number of bugs found.

2019-01-31
Muslukhov, Ildar, Boshmaf, Yazan, Beznosov, Konstantin.  2018.  Source Attribution of Cryptographic API Misuse in Android Applications. Proceedings of the 2018 on Asia Conference on Computer and Communications Security. :133–146.

Recent research suggests that 88% of Android applications that use Java cryptographic APIs make at least one mistake, which results in an insecure implementation. It is unclear, however, if these mistakes originate from code written by application or third-party library developers. Understanding the responsible party for a misuse case is important for vulnerability disclosure. In this paper, we bridge this knowledge gap and introduce source attribution to the analysis of cryptographic API misuse. We developed BinSight, a static program analyzer that supports source attribution, and we analyzed 132K Android applications collected in years 2012, 2015, and 2016. Our results suggest that third-party libraries are the main source of cryptographic API misuse. In particular, 90% of the violating applications, which contain at least one call-site to Java cryptographic API, originate from libraries. When compared to 2012, we found the use of ECB mode for symmetric ciphers has significantly decreased in 2016, for both application and third-party library code. Unlike application code, however, third-party libraries have significantly increased their reliance on static encryption keys for symmetric ciphers and static IVs for CBC mode ciphers. Finally, we found that the insecure RC4 and DES ciphers were the second and the third most used ciphers in 2016.

2019-01-21
Chernis, Boris, Verma, Rakesh.  2018.  Machine Learning Methods for Software Vulnerability Detection. Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics. :31–39.

Software vulnerabilities are a primary concern in the IT security industry, as malicious hackers who discover these vulnerabilities can often exploit them for nefarious purposes. However, complex programs, particularly those written in a relatively low-level language like C, are difficult to fully scan for bugs, even when both manual and automated techniques are used. Since analyzing code and making sure it is securely written is proven to be a non-trivial task, both static analysis and dynamic analysis techniques have been heavily investigated, and this work focuses on the former. The contribution of this paper is a demonstration of how it is possible to catch a large percentage of bugs by extracting text features from functions in C source code and analyzing them with a machine learning classifier. Relatively simple features (character count, character diversity, entropy, maximum nesting depth, arrow count, "if" count, "if" complexity, "while" count, and "for" count) were extracted from these functions, and so were complex features (character n-grams, word n-grams, and suffix trees). The simple features performed unexpectedly better compared to the complex features (74% accuracy compared to 69% accuracy).

Umar, K., Sultan, A. B., Zulzalil, H., Admodisastro, N., Abdullah, M. T..  2018.  Formulation of SQL Injection Vulnerability Detection as Grammar Reachability Problem. 2018 International Conference on Information and Communication Technology for the Muslim World (ICT4M). :179–184.

Data dependency flow have been reformulated as Context Free Grammar (CFG) reachability problem, and the idea was explored in detection of some web vulnerabilities, particularly Cross Site Scripting (XSS) and Access Control. However, reformulation of SQL Injection Vulnerability (SQLIV) detection as grammar reachability problem has not been investigated. In this paper, concepts of data dependency flow was used to reformulate SQLIVs detection as a CFG reachability problem. The paper, consequently defines reachability analysis strategy for SQLIVs detection.

2018-06-20
Lee, Y., Choi, S. S., Choi, J., Song, J..  2017.  A Lightweight Malware Classification Method Based on Detection Results of Anti-Virus Software. 2017 12th Asia Joint Conference on Information Security (AsiaJCIS). :5–9.

With the development of cyber threats on the Internet, the number of malware, especially unknown malware, is also dramatically increasing. Since all of malware cannot be analyzed by analysts, it is very important to find out new malware that should be analyzed by them. In order to cope with this issue, the existing approaches focused on malware classification using static or dynamic analysis results of malware. However, the static and the dynamic analyses themselves are also too costly and not easy to build the isolated, secure and Internet-like analysis environments such as sandbox. In this paper, we propose a lightweight malware classification method based on detection results of anti-virus software. Since the proposed method can reduce the volume of malware that should be analyzed by analysts, it can be used as a preprocess for in-depth analysis of malware. The experimental showed that the proposed method succeeded in classification of 1,000 malware samples into 187 unique groups. This means that 81% of the original malware samples do not need to analyze by analysts.

Hassen, M., Carvalho, M. M., Chan, P. K..  2017.  Malware classification using static analysis based features. 2017 IEEE Symposium Series on Computational Intelligence (SSCI). :1–7.

Anti-virus vendors receive hundreds of thousands of malware to be analysed each day. Some are new malware while others are variations or evolutions of existing malware. Because analyzing each malware sample by hand is impossible, automated techniques to analyse and categorize incoming samples are needed. In this work, we explore various machine learning features extracted from malware samples through static analysis for classification of malware binaries into already known malware families. We present a new feature based on control statement shingling that has a comparable accuracy to ordinary opcode n-gram based features while requiring smaller dimensions. This, in turn, results in a shorter training time.

Aslanyan, H., Avetisyan, A., Arutunian, M., Keropyan, G., Kurmangaleev, S., Vardanyan, V..  2017.  Scalable Framework for Accurate Binary Code Comparison. 2017 Ivannikov ISPRAS Open Conference (ISPRAS). :34–38.
Comparison of two binary files has many practical applications: the ability to detect programmatic changes between two versions, the ability to find old versions of statically linked libraries to prevent the use of well-known bugs, malware analysis, etc. In this article, a framework for comparison of binary files is presented. Framework uses IdaPro [1] disassembler and Binnavi [2] platform to recover structure of the target program and represent it as a call graph (CG). A program dependence graph (PDG) corresponds to each vertex of the CG. The proposed comparison algorithm consists of two main stages. At the first stage, several heuristics are applied to find the exact matches. Two functions are matched if at least one of the calculated heuristics is the same and unique in both binaries. At the second stage, backward and forward slicing is applied on matched vertices of CG to find further matches. According to empiric results heuristic method is effective and has high matching quality for unchanged or slightly modified functions. As a contradiction, to match heavily modified functions, binary code clone detection is used and it is based on finding maximum common subgraph for pair of PDGs. To achieve high performance on extensive binaries, the whole matching process is parallelized. The framework is tested on the number of real world libraries, such as python, openssh, openssl, libxml2, rsync, php, etc. Results show that in most cases more than 95% functions are truly matched. The tool is scalable due to parallelization of functions matching process and generation of PDGs and CGs.
2018-06-07
Obster, M., Kowalewski, S..  2017.  A live static code analysis architecture for PLC software. 2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation (ETFA). :1–4.

Static code analysis is a convenient technique to support the development of software. Without prior test setup, information about a later runtime behavior can be inferred and errors in the code can be found before using a regular compiler. Solutions to apply static code analysis to PLC software following the IEC 61131-3 already exist, but using these separate tools usually creates a gap in the development process. In this paper we introduce an architecture to use static analysis directly in a development environment and give instant feedback to the developer while he is still editing the PLC software.

Kübler, Florian, Müller, Patrick, Hermann, Ben.  2017.  SootKeeper: Runtime Reusability for Modular Static Analysis. Proceedings of the 6th ACM SIGPLAN International Workshop on State Of the Art in Program Analysis. :19–24.
In order to achieve a higher reusability and testability, static analyses are increasingly being build as modular pipelines of analysis components. However, to build, debug, test, and evaluate these components the complete pipeline has to be executed every time. This process recomputes intermediate results which have already been computed in a previous run but are lost because the preceding process ended and removed them from memory. We propose to leverage runtime reusability for static analysis pipelines and introduce SootKeeper, a framework to modularize static analyses into OSGi (Open Service Gateway initiative) bundles, which takes care of the automatic caching of intermediate results. Little to no change to the original analysis is necessary to use SootKeeper while speeding up the execution of code-build-debug cycles or evaluation pipelines significantly.
Tymchuk, Yuriy, Ghafari, Mohammad, Nierstrasz, Oscar.  2017.  Renraku: The One Static Analysis Model to Rule Them All. Proceedings of the 12th Edition of the International Workshop on Smalltalk Technologies. :13:1–13:10.
Most static analyzers are monolithic applications that define their own ways to analyze source code and present the results. Therefore aggregating multiple static analyzers into a single tool or integrating a new analyzer into existing tools requires a significant amount of effort. Over the last few years, we cultivated Renraku — a static analysis model that acts as a mediator between the static analyzers and the tools that present the reports. When used by both analysis and tool developers, this single quality model can reduce the cost to both introduce a new type of analysis to existing tools and create a tool that relies on existing analyzers.
Nashaat, M., Ali, K., Miller, J..  2017.  Detecting Security Vulnerabilities in Object-Oriented PHP Programs. 2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM). :159–164.

PHP is one of the most popular web development tools in use today. A major concern though is the improper and insecure uses of the language by application developers, motivating the development of various static analyses that detect security vulnerabilities in PHP programs. However, many of these approaches do not handle recent, important PHP features such as object orientation, which greatly limits the use of such approaches in practice. In this paper, we present OOPIXY, a security analysis tool that extends the PHP security analyzer PIXY to support reasoning about object-oriented features in PHP applications. Our empirical evaluation shows that OOPIXY detects 88% of security vulnerabilities found in micro benchmarks. When used on real-world PHP applications, OOPIXY detects security vulnerabilities that could not be detected using state-of-the-art tools, retaining a high level of precision. We have contacted the maintainers of those applications, and two applications' development teams verified the correctness of our findings. They are currently working on fixing the bugs that lead to those vulnerabilities.

2018-05-24
Kwon, Y., Kim, H. K., Koumadi, K. M., Lim, Y. H., Lim, J. I..  2017.  Automated Vulnerability Analysis Technique for Smart Grid Infrastructure. 2017 IEEE Power Energy Society Innovative Smart Grid Technologies Conference (ISGT). :1–5.

A smart grid is a fully automated power electricity network, which operates, protects and controls all its physical environments of power electricity infrastructure being able to supply energy in an efficient and reliable way. As the importance of cyber-physical system (CPS) security is growing, various vulnerability analysis methodologies for general systems have been suggested, whereas there has been few practical research targeting the smart grid infrastructure. In this paper, we highlight the significance of security vulnerability analysis in the smart grid environment. Then we introduce various automated vulnerability analysis techniques from executable files. In our approach, we propose a novel binary-based vulnerability discovery method for AMI and EV charging system to automatically extract security-related features from the embedded software. Finally, we present the test result of vulnerability discovery applied for AMI and EV charging system in Korean smart grid environment.

Johnson, Claiborne, MacGahan, Thomas, Heaps, John, Baldor, Kevin, von Ronne, Jeffery, Niu, Jianwei.  2017.  Verifiable Assume-Guarantee Privacy Specifications for Actor Component Architectures. Proceedings of the 22Nd ACM on Symposium on Access Control Models and Technologies. :167–178.

Many organizations process personal information in the course of normal operations. Improper disclosure of this information can be damaging, so organizations must obey privacy laws and regulations that impose restrictions on its release or risk penalties. Since electronic management of personal information must be held in strict compliance with the law, software systems designed for such purposes must have some guarantee of compliance. To support this, we develop a general methodology for designing and implementing verifiable information systems. This paper develops the design of the History Aware Programming Language into a framework for creating systems that can be mechanically checked against privacy specifications. We apply this framework to create and verify a prototypical Electronic Medical Record System (EMRS) expressed as a set of actor components and first-order linear temporal logic specifications in assume-guarantee form. We then show that the implementation of the EMRS provably enforces a formalized Health Insurance Portability and Accountability Act (HIPAA) policy using a combination of model checking and static analysis techniques.

2018-05-09
Zeng, Y. G..  2017.  Identifying Email Threats Using Predictive Analysis. 2017 International Conference on Cyber Security And Protection Of Digital Services (Cyber Security). :1–2.

Malicious emails pose substantial threats to businesses. Whether it is a malware attachment or a URL leading to malware, exploitation or phishing, attackers have been employing emails as an effective way to gain a foothold inside organizations of all kinds. To combat email threats, especially targeted attacks, traditional signature- and rule-based email filtering as well as advanced sandboxing technology both have their own weaknesses. In this paper, we propose a predictive analysis approach that learns the differences between legit and malicious emails through static analysis, creates a machine learning model and makes detection and prediction on unseen emails effectively and efficiently. By comparing three different machine learning algorithms, our preliminary evaluation reveals that a Random Forests model performs the best.