Biblio
Filters: Keyword is Tools [Clear All Filters]
A New Approach to Use Big Data Tools to Substitute Unstructured Data Warehouse. 2020 IEEE Conference on Big Data and Analytics (ICBDA). :26–31.
.
2020. Data warehouse and big data have become the trend to help organise data effectively. Business data are originating in various kinds of sources with different forms from conventional structured data to unstructured data, it is the input for producing useful information essential for business sustainability. This research will navigate through the complicated designs of the common big data and data warehousing technologies to propose an effective approach to use these technologies for designing and building an unstructured textual data warehouse, a crucial and essential tool for most enterprises nowadays for decision making and gaining business competitive advantages. In this research, we utilised the IBM BigInsights Text Analytics, PostgreSQL, and Pentaho tools, an unstructured data warehouse is implemented and worked excellently with the unstructured text from Amazon review datasets, the new proposed approach creates a practical solution for building an unstructured data warehouse.
A Blockchain-Based Testing Approach for Collaborative Software Development. 2020 IEEE International Conference on Blockchain (Blockchain). :98–105.
.
2020. Development of large-scale and complex software systems requires multiple teams, including software development teams, domain experts, user representatives, and other project stakeholders, to work collaboratively to achieve software development goals. These teams rely on the use of agreed software development processes, knowledge management tools, and communication channels collaboratively in the software development project. Software testing is an important and complicated process due to reasons such as difficulties in achieving testing goals with the given time constraint, absence of efficient data sharing policies, vague testing acceptance criteria at various levels of testing, and lack of trusted coordination among the teams involved in software testing. The efficiency of the software testing relies on efficient, reliable, and trusted information sharing among these teams. Existing approaches to software testing for collaborative software development use centralized or decentralize tools for software testing, knowledge management, and communication channels. Existing approaches have the limitations of centralized authority, a single point of failure/compromise, lack of automatic requirement compliance checking and transparency in information sharing, and lack of unified data sharing policy, and reliable knowledge management repositories for sharing and storing past software testing artifacts and data. In this paper, a software testing approach for collaborative software development using private blockchain is presented, and the desirable properties of private blockchain, such as distributed data management, tamper-resistance, auditability and automatic requirement compliance checking, are incorporated to greatly improve the quality of software testing for collaborative software development.
Identifying Vulnerable IoT Applications Using Deep Learning. 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). :582–586.
.
2020. This paper presents an approach for the identification of vulnerable IoT applications using deep learning algorithms. The approach focuses on a category of vulnerabilities that leads to sensitive information leakage which can be identified using taint flow analysis. First, we analyze the source code of IoT apps in order to recover tokens along their frequencies and tainted flows. Second, we develop, Token2Vec, which transforms the source code tokens into vectors. We have also developed Flow2Vec, which transforms the identified tainted flows into vectors. Third, we use the recovered vectors to train a deep learning algorithm to build a model for the identification of tainted apps. We have evaluated the approach on two datasets and the experiments show that the proposed approach of combining tainted flows features with the base benchmark that uses token frequencies only, has improved the accuracy of the prediction models from 77.78% to 92.59% for Corpus1 and 61.11% to 87.03% for Corpus2.
Analyzing Cryptographic API Usages for Android Applications Using HMM and N-Gram. 2020 International Symposium on Theoretical Aspects of Software Engineering (TASE). :153–160.
.
2020. A recent research shows that 88 % of Android applications that use cryptographic APIs make at least one mistake. For this reason, several tools have been proposed to detect crypto API misuses, such as CryptoLint, CMA, and CogniCryptSAsT. However, these tools depend heavily on manually designed rules, which require much cryptographic knowledge and could be error-prone. In this paper, we propose an approach based on probabilistic models, namely, hidden Markov model and n-gram model, to analyzing crypto API usages in Android applications. The difficulty lies in that crypto APIs are sensitive to not only API orders, but also their arguments. To address this, we have created a dataset consisting of crypto API sequences with arguments, wherein symbolic execution is performed. Finally, we have also conducted some experiments on our models, which shows that ( i) our models are effective in capturing the usages, detecting and locating the misuses; (ii) our models perform better than the ones without symbolic execution, especially in misuse detection; and (iii) compared with CogniCryptSAsT, our models can detect several new misuses.
Compositional Taint Analysis of Native Codes for Security Vetting of Android Applications. 2020 10th International Conference on Computer and Knowledge Engineering (ICCKE). :567–572.
.
2020. Security vetting of Android applications is one of the crucial aspects of the Android ecosystem. Regarding the state of the art tools for this goal, most of them doesn't consider analyzing native codes and only analyze the Java code. However, Android concedes its developers to implement a part or all of their applications using C or C++ code. Thus, applying conservative manners for analyzing Android applications while ignoring native codes would lead to less precision in results. Few works have tried to analyze Android native codes, but only JN-SAF has applied taint analysis using static techniques such as symbolic execution. However, symbolic execution has some problems when is used in large programs. One of these problems is the exponential growth of program paths that would raise the path explosion issue. In this work, we have tried to alleviate this issue by introducing our new tool named CTAN. CTAN applies new symbolic execution methods to angr in a particular way that it can make JN-SAF more efficient and faster. We have introduced compositional taint analysis in CTAN by combining satisfiability modulo theories with symbolic execution. Our experiments show that CTAN is 26 percent faster than its previous work JN-SAF and it also leads to more precision by detecting more data-leakage in large Android native codes.
Scaling Application-Level Dynamic Taint Analysis to Enterprise-Scale Distributed Systems. 2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). :270–271.
.
2020. With the increasing deployment of enterprise-scale distributed systems, effective and practical defenses for such systems against various security vulnerabilities such as sensitive data leaks are urgently needed. However, most existing solutions are limited to centralized programs. For real-world distributed systems which are of large scales, current solutions commonly face one or more of scalability, applicability, and portability challenges. To overcome these challenges, we develop a novel dynamic taint analysis for enterprise-scale distributed systems. To achieve scalability, we use a multi-phase analysis strategy to reduce the overall cost. We infer implicit dependencies via partial-ordering method events in distributed programs to address the applicability challenge. To achieve greater portability, the analysis is designed to work at an application level without customizing platforms. Empirical results have shown promising scalability and capabilities of our approach.
Neutaint: Efficient Dynamic Taint Analysis with Neural Networks. 2020 IEEE Symposium on Security and Privacy (SP). :1527–1543.
.
2020. Dynamic taint analysis (DTA) is widely used by various applications to track information flow during runtime execution. Existing DTA techniques use rule-based taint-propagation, which is neither accurate (i.e., high false positive rate) nor efficient (i.e., large runtime overhead). It is hard to specify taint rules for each operation while covering all corner cases correctly. Moreover, the overtaint and undertaint errors can accumulate during the propagation of taint information across multiple operations. Finally, rule-based propagation requires each operation to be inspected before applying the appropriate rules resulting in prohibitive performance overhead on large real-world applications.In this work, we propose Neutaint, a novel end-to-end approach to track information flow using neural program embeddings. The neural program embeddings model the target's programs computations taking place between taint sources and sinks, which automatically learns the information flow by observing a diverse set of execution traces. To perform lightweight and precise information flow analysis, we utilize saliency maps to reason about most influential sources for different sinks. Neutaint constructs two saliency maps, a popular machine learning approach to influence analysis, to summarize both coarse-grained and fine-grained information flow in the neural program embeddings.We compare Neutaint with 3 state-of-the-art dynamic taint analysis tools. The evaluation results show that Neutaint can achieve 68% accuracy, on average, which is 10% improvement while reducing 40× runtime overhead over the second-best taint tool Libdft on 6 real world programs. Neutaint also achieves 61% more edge coverage when used for taint-guided fuzzing indicating the effectiveness of the identified influential bytes. We also evaluate Neutaint's ability to detect real world software attacks. The results show that Neutaint can successfully detect different types of vulnerabilities including buffer/heap/integer overflows, division by zero, etc. Lastly, Neutaint can detect 98.7% of total flows, the highest among all taint analysis tools.
Situational Awareness in Distribution Grid Using Micro-PMU Data: A Machine Learning Approach. 2020 IEEE Power Energy Society General Meeting (PESGM). :1–1.
.
2020. The recent development of distribution-level phasor measurement units, a.k.a. micro-PMUs, has been an important step towards achieving situational awareness in power distribution networks. The challenge however is to transform the large amount of data that is generated by micro-PMUs to actionable information and then match the information to use cases with practical value to system operators. This open problem is addressed in this paper. First, we introduce a novel data-driven event detection technique to pick out valuable portion of data from extremely large raw micro-PMU data. Subsequently, a datadriven event classifier is developed to effectively classify power quality events. Importantly, we use field expert knowledge and utility records to conduct an extensive data-driven event labeling. Moreover, certain aspects from event detection analysis are adopted as additional features to be fed into the classifier model. In this regard, a multi-class support vector machine (multi-SVM) classifier is trained and tested over 15 days of real-world data from two micro-PMUs on a distribution feeder in Riverside, CA. In total, we analyze 1.2 billion measurement points, and 10,700 events. The effectiveness of the developed event classifier is compared with prevalent multi-class classification methods, including k-nearest neighbor method as well as decision-tree method. Importantly, two real-world use-cases are presented for the proposed data analytics tools, including remote asset monitoring and distribution-level oscillation analysis.
Identification and Mitigation Tool For Cross-Site Request Forgery (CSRF). 2020 IEEE 8th R10 Humanitarian Technology Conference (R10-HTC). :1–5.
.
2020. Most organizations use web applications for sharing resources and communication via the internet and information security is one of the biggest concerns in most organizations. Web applications are becoming vulnerable to threats and malicious attacks every day, which lead to violation of confidentiality, integrity, and availability of information assets.We have proposed and implemented a new automated tool for the identification and mitigation of Cross-Site Request Forgery (CSRF) vulnerability. A secret token pattern based has been used in the automated tool, which applies effective security mechanism on PHP based web applications, without damaging the content and its functionalities, where the authenticated users can perform web activities securely.
Technical Threat Intelligence Analytics: What and How to Visualize for Analytic Process. 2020 24th International Conference Electronics. :1–4.
.
2020. Visual Analytics uses data visualization techniques for enabling compelling data analysis by engaging graphical and visual portrayal. In the domain of cybersecurity, convincing visual representation of data enables to ascertain valuable observations that allow the domain experts to construct efficient cyberattack mitigation strategies and provide useful decision support. We present a survey of visual analytics tools and methods in the domain of cybersecurity. We explore and discuss Technical Threat Intelligence visualization tools using the Five Question Method. We conclude the analysis of the works using Moody's Physics of Notations, and VIS4ML ontology as a methodological background of visual analytics process. We summarize our analysis as a high-level model of visual analytics for cybersecurity threat analysis.
Tactical Provenance Analysis for Endpoint Detection and Response Systems. 2020 IEEE Symposium on Security and Privacy (SP). :1172–1189.
.
2020. Endpoint Detection and Response (EDR) tools provide visibility into sophisticated intrusions by matching system events against known adversarial behaviors. However, current solutions suffer from three challenges: 1) EDR tools generate a high volume of false alarms, creating backlogs of investigation tasks for analysts; 2) determining the veracity of these threat alerts requires tedious manual labor due to the overwhelming amount of low-level system logs, creating a "needle-in-a-haystack" problem; and 3) due to the tremendous resource burden of log retention, in practice the system logs describing long-lived attack campaigns are often deleted before an investigation is ever initiated.This paper describes an effort to bring the benefits of data provenance to commercial EDR tools. We introduce the notion of Tactical Provenance Graphs (TPGs) that, rather than encoding low-level system event dependencies, reason about causal dependencies between EDR-generated threat alerts. TPGs provide compact visualization of multi-stage attacks to analysts, accelerating investigation. To address EDR's false alarm problem, we introduce a threat scoring methodology that assesses risk based on the temporal ordering between individual threat alerts present in the TPG. In contrast to the retention of unwieldy system logs, we maintain a minimally-sufficient skeleton graph that can provide linkability between existing and future threat alerts. We evaluate our system, RapSheet, using the Symantec EDR tool in an enterprise environment. Results show that our approach can rank truly malicious TPGs higher than false alarm TPGs. Moreover, our skeleton graph reduces the long-term burden of log retention by up to 87%.
Petri Nets Based Verification of Epistemic Logic and Its Application on Protocols of Privacy and Security. 2020 IEEE World Congress on Services (SERVICES). :25–28.
.
2020. Epistemic logic can specify many design requirements of privacy and security of multi-agent systems (MAS). The existing model checkers of epistemic logic use some programming languages to describe MAS, induce Kripke models as the behavioral representation of MAS, apply Ordered Binary Decision Diagrams (OBDD) to encode Kripke models to solve their state explosion problem and verify epistemic logic based on the encoded Kripke models. However, these programming languages are usually non-intuitive. More seriously, their OBDD-based model checking processes are often time-consuming due to their dynamic variable ordering for OBDD. Therefore, we define Knowledge-oriented Petri Nets (KPN) to intuitively describe MAS, induce similar reachability graphs as the behavioral representation of KPN, apply OBDD to encode all reachable states, and finally verify epistemic logic. Although we also use OBDD, we adopt a heuristic method for the computation of a static variable order instead of dynamic variable ordering. More importantly, while verifying an epistemic formula, we dynamically generate its needed similar relations, which makes our model checking process much more efficient. In this paper, we introduce our work.
Tamarin software – the tool for protocols verification security. 2020 Baltic URSI Symposium (URSI). :118–123.
.
2020. In order to develop safety-reliable standards for IoT (Internet of Things) networks, appropriate tools for their verification are needed. Among them there is a group of tools based on automated symbolic analysis. Such a tool is Tamarin software. Its usage for creating formal proofs of security protocols correctness has been presented in this paper using the simple example of an exchange of messages with asynchronous encryption between two agents. This model can be used in sensor networks or IoT e.g. in TLS protocol to provide a mechanism for secure cryptographic key exchange.
PrivacyCheck's Machine Learning to Digest Privacy Policies: Competitor Analysis and Usage Patterns. 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT). :291–298.
.
2020. Online privacy policies are lengthy and hard to comprehend. To address this problem, researchers have utilized machine learning (ML) to devise tools that automatically summarize online privacy policies for web users. One such tool is our free and publicly available browser extension, PrivacyCheck. In this paper, we enhance PrivacyCheck by adding a competitor analysis component-a part of PrivacyCheck that recommends other organizations in the same market sector with better privacy policies. We also monitored the usage patterns of about a thousand actual PrivacyCheck users, the first work to track the usage and traffic of an ML-based privacy analysis tool. Results show: (1) there is a good number of privacy policy URLs checked repeatedly by the user base; (2) the users are particularly interested in privacy policies of software services; and (3) PrivacyCheck increased the number of times a user consults privacy policies by 80%. Our work demonstrates the potential of ML-based privacy analysis tools and also sheds light on how these tools are used in practice to give users actionable knowledge they can use to pro-actively protect their privacy.
Examining the Relationship of Code and Architectural Smells with Software Vulnerabilities. 2020 27th Asia-Pacific Software Engineering Conference (APSEC). :31–40.
.
2020. Context: Security is vital to software developed for commercial or personal use. Although more organizations are realizing the importance of applying secure coding practices, in many of them, security concerns are not known or addressed until a security failure occurs. The root cause of security failures is vulnerable code. While metrics have been used to predict software vulnerabilities, we explore the relationship between code and architectural smells with security weaknesses. As smells are surface indicators of a deeper problem in software, determining the relationship between smells and software vulnerabilities can play a significant role in vulnerability prediction models. Objective: This study explores the relationship between smells and software vulnerabilities to identify the smells. Method: We extracted the class, method, file, and package level smells for three systems: Apache Tomcat, Apache CXF, and Android. We then compared their occurrences in the vulnerable classes which were reported to contain vulnerable code and in the neutral classes (non-vulnerable classes where no vulnerability had yet been reported). Results: We found that a vulnerable class is more likely to have certain smells compared to a non-vulnerable class. God Class, Complex Class, Large Class, Data Class, Feature Envy, Brain Class have a statistically significant relationship with software vulnerabilities. We found no significant relationship between architectural smells and software vulnerabilities. Conclusion: We can conclude that for all the systems examined, there is a statistically significant correlation between software vulnerabilities and some smells.
SIDE: Security-Aware Integrated Development Environment. 2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). :149–150.
.
2020. An effective way for building secure software is to embed security into software in the early stages of software development. Thus, we aim to study several evidences of code anomalies introduced during the software development phase, that may be indicators of security issues in software, such as code smells, structural complexity represented by diverse software metrics, the issues detected by static code analysers, and finally missing security best practices. To use such evidences for vulnerability prediction and removal, we first need to understand how they are correlated with security issues. Then, we need to discover how these imperfect raw data can be integrated to achieve a reliable, accurate and valuable decision about a portion of code. Finally, we need to construct a security actuator providing suggestions to the developers to remove or fix the detected issues from the code. All of these will lead to the construction of a framework, including security monitoring, security analyzer, and security actuator platforms, that are necessary for a security-aware integrated development environment (SIDE).
A Representation of Business Oriented Cyber Threat Intelligence and the Objects Assembly. 2020 10th International Conference on Information Science and Technology (ICIST). :105–113.
.
2020. Cyber threat intelligence (CTI) is an effective approach to improving cyber security of businesses. CTI provides information of business contexts affected by cyber threats and the corresponding countermeasures. If businesses can identify relevant CTI, they can take defensive actions before the threats, described in the relevant CTI, take place. However, businesses still lack knowledge to help identify relevant CTI. Furthermore, information in real-world systems is usually vague, imprecise, inconsistent and incomplete. This paper defines a business object that is a business context surrounded by CTI. A business object models the connection knowledge for CTI onto the business. To assemble the business objects, this paper proposes a novel representation of business oriented CTI and a system used for constructing and extracting the business objects. Generalised grey numbers, fuzzy sets and rough sets are used for the representation, and set approximations are used for the extraction of the business objects. We develop a prototype of the system and use a case study to demonstrate how the system works. We then conclude the paper together with the future research directions.
Digraph Modeling of Information Security Systems. 2020 International Multi-Conference on Industrial Engineering and Modern Technologies (FarEastCon). :1–4.
.
2020. When modeling information security systems (ISS), the vast majority of works offer various models of threats to the object of protection (threat trees, Petri nets, etc.). However, ISS is not only a mean to prevent threats or reduce damage from their implementation, but also other components - the qualifications of employees responsible for IS, the internal climate in the team, the company's position on the market, and many others. The article considers the cognitive model of the state of the information security system of an average organization. The model is a weighted oriented graph, its' vertices are standard elements of the organization's information security system. The most significant factors affecting the condition of information security of the organization are identified based on the model. Influencing these factors is providing the most effect if IS level.
ATPG-Guided Fault Injection Attacks on Logic Locking. 2020 IEEE Physical Assurance and Inspection of Electronics (PAINE). :1–6.
.
2020. Logic Locking is a well-accepted protection technique to enable trust in the outsourced design and fabrication processes of integrated circuits (ICs) where the original design is modified by incorporating additional key gates in the netlist, resulting in a key-dependent functional circuit. The original functionality of the chip is recovered once it is programmed with the secret key, otherwise, it produces incorrect results for some input patterns. Over the past decade, different attacks have been proposed to break logic locking, simultaneously motivating researchers to develop more secure countermeasures. In this paper, we propose a novel stuck-at fault-based differential fault analysis (DFA) attack, which can be used to break logic locking that relies on a stored secret key. This proposed attack is based on self-referencing, where the secret key is determined by injecting faults in the key lines and comparing the response with its fault-free counterpart. A commercial ATPG tool can be used to generate test patterns that detect these faults, which will be used in DFA to determine the secret key. One test pattern is sufficient to determine one key bit, which results in at most \textbackslashtextbarK\textbackslashtextbar test patterns to determine the entire secret key of size \textbackslashtextbarK\textbackslashtextbar. The proposed attack is generic and can be extended to break any logic locked circuits.
Countermeasures Optimization in Multiple Fault-Injection Context. 2020 Workshop on Fault Detection and Tolerance in Cryptography (FDTC). :26–34.
.
2020. Fault attacks consist in changing the program behavior by injecting faults at run-time, either at hardware or at software level. Their goal is to change the correct progress of the algorithm and hence, either to allow gaining some privilege access or to allow retrieving some secret information based on an analysis of the deviation of the corrupted behavior with respect to the original one. Countermeasures have been proposed to protect embedded systems by adding spatial, temporal or information redundancy at hardware or software level. First we define Countermeasures Check Point (CCP) and CCPs-based countermeasures as an important subclass of countermeasures. Then we propose a methodology to generate an optimal protection scheme for CCPs-based countermeasure. Finally we evaluate our work on a benchmark of code examples with respect to several Control Flow Integrity (CFI) oriented existing protection schemes.
Representing Gate-Level SET Faults by Multiple SEU Faults at RTL. 2020 IEEE 26th International Symposium on On-Line Testing and Robust System Design (IOLTS). :1–6.
.
2020. The advanced complex electronic systems increasingly demand safer and more secure hardware parts. Correspondingly, fault injection became a major verification milestone for both safety- and security-critical applications. However, fault injection campaigns for gate-level designs suffer from huge execution times. Therefore, designers need to apply early design evaluation techniques to reduce the execution time of fault injection campaigns. In this work, we propose a method to represent gate-level Single-Event Transient (SET) faults by multiple Single-Event Upset (SEU) faults at the Register-Transfer Level. Introduced approach is to identify true and false logic paths for each SET in the flip-flops' fan-in logic cones to obtain more accurate sets of flip-flops for multiple SEUs injections at RTL. Experimental results demonstrate the feasibility of the proposed method to successfully reduce the fault space and also its advantage with respect to state of the art. It was shown that the approach is able to reduce the fault space, and therefore the fault-injection effort, by up to tens to hundreds of times.
A Secure Network Interface for on-Chip Systems. 2020 20th International Conference on Sciences and Techniques of Automatic Control and Computer Engineering (STA). :90–94.
.
2020. This paper presents a self-securing decentralized on-chip network interface (NI) architecture to Multicore System-on-Chip (McSoC) platforms. To protect intra-chip communication within McSoC, security framework proposal resides in initiator and target NIs. A comparison between block cipher and lightweight cryptographic algorithms is then given, so we can figure out the most suitable cipher for network-on-chip (NoC) architectures. AES and LED security algorithms was a subject of this comparison. The designs are developed in Xilinx ISE 14.7 tool using VHDL language.
Real-Time Operating Systems for Cyber-Physical Systems: Current Status and Future Research. 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics). :419–425.
.
2020. This paper studies the current status and future directions of RTOS (Real-Time Operating Systems) for time-sensitive CPS (Cyber-Physical Systems). GPOS (General Purpose Operating Systems) existed before RTOS but did not meet performance requirements for time sensitive CPS. Many GPOS have put forward adaptations to meet the requirements of real-time performance, and this paper compares RTOS and GPOS and shows their pros and cons for CPS applications. Furthermore, comparisons among select RTOS such as VxWorks, RTLinux, and FreeRTOS have been conducted in terms of scheduling, kernel, and priority inversion. Various tools for WCET (Worst-Case Execution Time) estimation are discussed. This paper also presents a CPS use case of RTOS, i.e. JetOS for avionics, and future advancements in RTOS such as multi-core RTOS, new RTOS architecture and RTOS security for CPS.
RAT: A Lightweight System-Level Soft Error Mitigation Technique. 2020 IFIP/IEEE 28th International Conference on Very Large Scale Integration (VLSI-SOC). :165–170.
.
2020. To achieve a substantial reliability and safety level, it is imperative to provide electronic computing systems with appropriate mechanisms to tackle soft errors. This paper proposes a low-cost system-level soft error mitigation technique, which allocates the critical application function to a pool of specific general-purpose processor registers. Both the critical function and the register pool are automatically selected by a developed profiling tool. The proposed technique was validated through more than 320K fault injections considering a Linux kernel, different benchmarks and two multicore ARM processors. Results show that our technique significantly reduces the code size and performance overheads while providing reliability improvement, w.r.t. the Triple Modular Redundancy (TMR) technique.
AltCC: Alternating Clustering and Classification for Batch Analysis of Malware Behavior. 2020 International Symposium on Networks, Computers and Communications (ISNCC). :1–6.
.
2020. The most common goal of malware analysis is to determine if a given binary is malware or benign. Another objective is similarity analysis of malware binaries to understand how new samples differ from known ones. Similarity analysis helps to analyze the malware with respect to those already analyzed and guides the discovery of novel aspects that should be analyzed more in depth. In this work, we are concerned with similarities and differences detection of malware binaries. Thousands of malware are created every day and machine learning is an indispensable tool for its analysis. Previous work has studied clustering and classification as competing paradigms. However, in this work, a malware similarity analysis technique (AltCC) is proposed that alternates the use of clustering and classification. In addition it assumes the malware are not available all at once but processed in batches. Initially, clustering is applied to the first batch to group similar binaries into novel malware classes. Then, the discovered classes are used to train a classifier. For the following batches, the classifier is used to decide if a new binary classifies to a known class or otherwise unclassified. The unclassified binaries are clustered and the process repeats. Malware clustering (i.e. labeling) may entail further human expert analysis but dramatically reduces the effort. The effectiveness of AltCC is studied using a dataset of 29,661 malware binaries that represent malware received in six consecutive days/batches. When KMeans is used to label the dataset all at once and its labeling is compared to AltCC's, the adjusted-rand-index scores 0.71.