Biblio
Detecting malicious code with exact match on collected datasets is becoming a large-scale identification problem due to the existence of new malware variants. Being able to promptly and accurately identify new attacks enables security experts to respond effectively. My proposal is to develop an automated framework for identification of unknown vulnerabilities by leveraging current neural network techniques. This has a significant and immediate value for the security field, as current anti-virus software is typically able to recognize the malware type only after its infection, and preventive measures are limited. Artificial Intelligence plays a major role in automatic malware classification: numerous machine-learning methods, both supervised and unsupervised, have been researched to try classifying malware into families based on features acquired by static and dynamic analysis. The value of automated identification is clear, as feature engineering is both a time-consuming and time-sensitive task, with new malware studied while being observed in the wild.
Mobile Ad-hoc Network (MANET) is a prominent technology in the wireless networking field in which the movables nodes operates in distributed manner and collaborates with each other in order to provide the multi-hop communication between the source and destination nodes. Generally, the main assumption considered in the MANET is that each node is trusted node. However, in the real scenario, there are some unreliable nodes which perform black hole attack in which the misbehaving nodes attract all the traffic towards itself by giving false information of having the minimum path towards the destination with a very high destination sequence number and drops all the data packets. In the paper, we have presented different categories for black hole attack mitigation techniques and also presented the summary of various techniques along with its drawbacks that need to be considered while designing an efficient protocol.
Secure routing in the field of mobile ad hoc network (MANET) is one of the most flourishing areas of research. Devising a trustworthy security protocol for ad hoc routing is a challenging task due to the unique network characteristics such as lack of central authority, rapid node mobility, frequent topology changes, insecure operational environment, and confined availability of resources. Due to low configuration and quick deployment, MANETs are well-suited for emergency situations like natural disasters or military applications. Therefore, data transfer between two nodes should necessarily involve security. A black-hole attack in the mobile ad-hoc network (MANET) is an offense occurring due to malicious nodes, which attract the data packets by incorrectly publicizing a fresh route to the destination. A clustering direction in AODV routing protocol for the detection and prevention of black-hole attack in MANET has been put forward. Every member of the unit will ping once to the cluster head, to detect the exclusive difference between the number of data packets received and forwarded by the particular node. If the fault is perceived, all the nodes will obscure the contagious nodes from the network. The reading of the system performance has been done in terms of packet delivery ratio (PDR), end to end delay (ETD) throughput and Energy simulation inferences are recorded using ns2 simulator.
K-means algorithm has been widely used in machine learning and data mining due to its simplicity and good performance. However, the standard k-means algorithm would be quite slow for clustering millions of data into thousands of or even tens of thousands of clusters. In this paper, we propose a fast k-means algorithm named multi-stage k-means (MKM) which uses a multi-stage filtering approach. The multi-stage filtering approach greatly accelerates the k-means algorithm via a coarse-to-fine search strategy. To further speed up the algorithm, hashing is introduced to accelerate the assignment step which is the most time-consuming part in k-means. Extensive experiments on several massive datasets show that the proposed algorithm can obtain up to 600X speed-up over the k-means algorithm with comparable accuracy.
Data clustering is an important topic in data science in general, but also in user modeling and recommendation systems. Some clustering algorithms like K-means require the adjustment of many parameters, and force the clustering without considering the clusterability of the dataset. Others, like DBSCAN, are adjusted to a fixed density threshold, so can't detect clusters with different densities. In this paper we propose a new clustering algorithm based on the mutual vote, which adjusts itself automatically to the dataset, demands a minimum of parameterizing, and is able to detect clusters with different densities in the same dataset. We test our algorithm and compare it to other clustering algorithms for clustering users, and predict their purchases in the context of recommendation systems.
Classifying users according to their behaviors is a complex problem due to the high-volume of data and the unclear association between distinct data points. Although over the past years behavioral researches has mainly focused on Massive Multiplayer Online Role Playing Games (MMORPG), such as World of Warcraft (WoW), which has predefined player classes, there has been little applied to Open World Sandbox Games (OWSG). Some OWSG do not have player classes or structured linear gameplay mechanics, as freedom is given to the player to freely wander and interact with the virtual world. This research focuses on identifying different play styles that exist within the non-structured gameplay sessions of OWSG. This paper uses the OWSG TUG as a case study and over a period of forty-five days, a database stored selected gameplay events happening on the research server. The study applied k-means clustering to this dataset and evaluated the resulting distinct behavioral profiles to classify player sessions on an open world sandbox game.
Using mobile sinks to collect sensed data in WSNs (Wireless Sensor Network) is an effective technique for significantly improving the network lifetime. We investigate the problem of collecting sensed data using a mobile sink in a WSN with unreachable regions such that the network lifetime is maximized and the total tour length is minimized, and propose a polynomial-time heuristic, an ILP-based (Integer Linear Programming) heuristic and an MINLP-based (Mixed-Integer Non-Linear Programming) algorithm for constructing a shortest path routing forest for the sensor nodes in unreachable regions, two energy-efficient heuristics for partitioning the sensor nodes in reachable regions into disjoint clusters, and an efficient approach to convert the tour construction problem into a TSP (Travelling Salesman Problem). We have performed extensive simulations on 100 instances with 100, 150, 200, 250 and 300 sensor nodes in an urban area and a forest area. The simulation results show that the average lifetime of all the network instances achieved by the polynomial-time heuristic is 74% of that achieved by the ILP-based heuristic and 65% of that obtained by the MINLP-based algorithm, and our tour construction heuristic significantly outperforms the state-of-the-art tour construction heuristic EMPS.
Organizations face the issue of how to best allocate their security resources. Thus, they need an accurate method for assessing how many new vulnerabilities will be reported for the operating systems (OSs) they use in a given time period. Our approach consists of clustering vulnerabilities by leveraging the text information within vulnerability records, and then simulating the mean value function of vulnerabilities by relaxing the monotonic intensity function assumption, which is prevalent among the studies that use software reliability models (SRMs) and nonhomogeneous Poisson process (NHPP) in modeling. We applied our approach to the vulnerabilities of four OSs: Windows, Mac, IOS, and Linux. For the OSs analyzed in terms of curve fitting and prediction capability, our results, compared to a power-law model without clustering issued from a family of SRMs, are more accurate in all cases we analyzed.
Onion sites on the darkweb operate using the Tor Hidden Service (HS) protocol to shield their locations on the Internet, which (among other features) enables these sites to host malicious and illegal content while being resistant to legal action and seizure. Identifying and monitoring such illicit sites in the darkweb is of high relevance to the Computer Security and Law Enforcement communities. We have developed an automated infrastructure that crawls and indexes content from onion sites into a large-scale data repository, called LIGHTS, with over 100M pages. In this paper we describe Automated Tool for Onion Labeling (ATOL), a novel scalable analysis service developed to conduct a thematic assessment of the content of onion sites in the LIGHTS repository. ATOL has three core components – (a) a novel keyword discovery mechanism (ATOLKeyword) which extends analyst-provided keywords for different categories by suggesting new descriptive and discriminative keywords that are relevant for the categories; (b) a classification framework (ATOLClassify) that uses the discovered keywords to map onion site content to a set of categories when sufficient labeled data is available; (c) a clustering framework (ATOLCluster) that can leverage information from multiple external heterogeneous knowledge sources, ranging from domain expertise to Bitcoin transaction data, to categorize onion content in the absence of sufficient supervised data. The paper presents empirical results of ATOL on onion datasets derived from the LIGHTS repository, and additionally benchmarks ATOL's algorithms on the publicly available 20 Newsgroups dataset to demonstrate the reproducibility of its results. On the LIGHTS dataset, ATOLClassify gives a 12% performance gain over an analyst-provided baseline, while ATOLCluster gives a 7% improvement over state-of-the-art semi-supervised clustering algorithms. We also discuss how ATOL has been deployed and externally evaluated, as part of the LIGHTS system.
Software-defined networking (SDN) separates the control plane from underlying devices, and allows it to control the data plane from a global view. While SDN brings conveniences to management, it also introduces new security threats. Knowing reactive rules, attackers can launch denial-of-service (DoS) attacks by sending numerous rule-matched packets which trigger packet-in packets to overburden the controller. In this work, we present a novel method ``INferring SDN by Probing and Rule Extraction'' (INSPIRE) to discover the flow rules in SDN from probing packets. We evaluate the delay time from probing packets, classify them into defined classes, and infer the rules. This method involves three relevant steps: probing, clustering and rule inference. First, forged packets with various header fields are sent to measure processing and propagation time in the path. Second, it classifies the packets into multiple classes by using k-means clustering based on packet delay time. Finally, the apriori algorithm will find common header fields in the classes to infer the rules. We show how INSPIRE is able to infer flow rules via simulation, and the accuracy of inference can be up to 98.41% with very low false-positive rates.
Distributed Denial of Service (DDoS) attack is a congestion-based attack that makes both the network and host-based resources unavailable for legitimate users, sending flooding attack packets to the victim's resources. The non-existence of predefined rules to correctly identify the genuine network flow made the task of DDoS attack detection very difficult. In this paper, a combination of unsupervised data mining techniques as intrusion detection system are introduced. The entropy concept in term of windowing the incoming packets is applied with data mining technique using Clustering Using Representative (CURE) as cluster analysis to detect the DDoS attack in network flow. The data is mainly collected from DARPA2000, CAIDA2007 and CAIDA2008 datasets. The proposed approach has been evaluated and compared with several existing approaches in terms of accuracy, false alarm rate, detection rate, F. measure and Phi coefficient. Results indicates the superiority of the proposed approach with four out five detected phases, more than 99% accuracy rate 96.29% detection rate, around 0% false alarm rate 97.98% F-measure, and 97.98% Phi coefficient.
Understanding how to group a set of binary files into the piece of software they belong to is highly desirable for software profiling, malware detection, or enterprise audits, among many other applications. Unfortunately, it is also extremely challenging: there is absolutely no uniformity in the ways different applications rely on different files, in how binaries are signed, or in the versioning schemes used across different pieces of software. In this paper, we show that, by combining information gleaned from a large number of endpoints (millions of computers), we can accomplish large-scale application identification automatically and reliably. Our approach relies on collecting metadata on billions of files every day, summarizing it into much smaller "sketches", and performing approximate k-nearest neighbor clustering on non-metric space representations derived from these sketches. We design and implement our proposed system using Apache Spark, show that it can process billions of files in a matter of hours, and thus could be used for daily processing. We further show our system manages to successfully identify which files belong to which application with very high precision, and adequate recall.
Inclusion dependencies form one of the most fundamental classes of integrity constraints. Their importance in classical data management is reinforced by modern applications such as data profiling, data cleaning, entity resolution and schema matching. Their discovery in an unknown dataset is at the core of any data analysis effort. Therefore, several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are appropriate for applications on dynamic datasets, such as transactional datasets, scientific applications, and social network. In these cases, discovery techniques should be able to efficiently update the inclusion dependencies after an update in the dataset, without reprocessing the entire dataset. We present the first approach for incrementally updating the unary inclusion dependencies. In particular, our approach is based on the concept of attribute clustering from which the unary inclusion dependencies are efficiently derivable. We incrementally update the clusters after each update of the dataset. Updating the clusters does not need to access the dataset because of special data structures designed to efficiently support the updating process. We perform an exhaustive analysis of our approach by applying it to large datasets with several hundred attributes and more than 116,200,000 million tuples. The results show that the incremental discovery significantly reduces the runtime needed by the static discovery. This reduction in the runtime is up to 99.9996 % for both the insert and the delete.
In this paper a model of secure wireless sensor network (WSN) was developed. This model is able to defend against most of known network attacks and don't significantly reduce the energy power of sensor nodes (SN). We propose clustering as a way of network organization, which allows reducing energy consumption. Network protection is based on the trust level calculation and the establishment of trusted relationships between trusted nodes. The primary purpose of the hierarchical trust management system (HTMS) is to protect the WSN from malicious actions of an attacker. The developed system should combine the properties of energy efficiency and reliability. To achieve this goal the following tasks are performed: detection of illegal actions of an intruder; blocking of malicious nodes; avoiding of malicious attacks; determining the authenticity of nodes; the establishment of trusted connections between authentic nodes; detection of defective nodes and the blocking of their work. The HTMS operation based on the use of Bayes' theorem and calculation of direct and centralized trust values.
Traditional authentication techniques such as static passwords are vulnerable to replay and guessing attacks. Recently, many studies have been conducted on keystroke dynamics as a promising behavioral biometrics for strengthening user authentication, however, current keystroke based solutions suffer from a numerous number of features with an insufficient number of samples which lead to a high verification error rate and high verification time. In this paper, a keystroke dynamics based authentication system is proposed for cloud environments that supports fixed and free text samples. The proposed system utilizes the ReliefF dimensionality reduction method, as a preprocessing step, to minimize the feature space dimensionality. The proposed system applies clustering to users' profile templates to reduce the verification time. The proposed system is applied to two different benchmark datasets. Experimental results prove the effectiveness and efficiency of the proposed system.
Intrusion detection using multiple security devices has received much attention recently. The large volume of information generated by these tools, however, increases the burden on both computing resources and security administrators. Moreover, attack detection does not improve as expected if these tools work without any coordination. In this work, we propose a simple method to join information generated by security monitors with diverse data formats. We present a novel intrusion detection technique that uses unsupervised clustering algorithms to identify malicious behavior within large volumes of diverse security monitor data. First, we extract a set of features from network-level and host-level security logs that aid in detecting malicious host behavior and flooding-based network attacks in an enterprise network system. We then apply clustering algorithms to the separate and joined logs and use statistical tools to identify anomalous usage behaviors captured by the logs. We evaluate our approach on an enterprise network data set, which contains network and host activity logs. Our approach correctly identifies and prioritizes anomalous behaviors in the logs by their likelihood of maliciousness. By combining network and host logs, we are able to detect malicious behavior that cannot be detected by either log alone.
Behavior-based tracking is an unobtrusive technique that allows observers to monitor user activities on the Internet over long periods of time – in spite of changing IP addresses. Previous work has employed supervised classifiers in order to link the sessions of individual users. However, classifiers need labeled training sessions, which are difficult to obtain for observers. In this paper we show how this limitation can be overcome with an unsupervised learning technique. We present a modified k-means algorithm and evaluate it on a realistic dataset that contains the Domain Name System (DNS) queries of 3,862 users. For this purpose, we simulate an observer that tries to track all users, and an Internet Service Provider that assigns a different IP address to every user on every day. The highest tracking accuracy is achieved within the subgroup of highly active users. Almost all sessions of 73% of the users in this subgroup can be linked over a period of 56 days. 19% of the highly active users can be traced completely, i.e., all their sessions are assigned to a single cluster. This fraction increases to 40% for shorter periods of seven days. As service providers may engage in behavior-based tracking to complement their existing profiling efforts, it constitutes a severe privacy threat for users of online services. Users can defend against behavior-based tracking by changing their IP address frequently, but this is cumbersome at the moment.
Understanding the behavior of complex financial supply chains is usually difficult due to a lack of data capturing the interactions between financial institutions (FIs) and the roles that they play in financial contracts (FCs). resMBS is an example supply chain corresponding to the US residential mortgage backed securities that were critical in the 2008 US financial crisis. In this paper, we describe the process of creating the resMBS graph dataset from financial prospectus. We use the SystemT rule-based text extraction platform to develop two tools, ORG NER and Dict NER, for named entity recognition of financial institution (FI) names. The resMBS graph comprises a set of FC nodes (each prospectus) and the corresponding FI nodes that are extracted from the prospectus. A Role-FI extractor matches a role keyword such as originator, sponsor or servicer, with FI names. We study the performance of the Role-FI extractor, and ORG NER and Dict NER, in constructing the resMBS dataset. We also present preliminary results of a clustering based analysis to identify financial communities and their evolution in the resMBS financial supply chain.
The detection of obstacles is a fundamental issue in autonomous navigation, as it is the main key for collision prevention. This paper presents a method for the segmentation of general obstacles by stereo vision with no need of dense disparity maps or assumptions about the scenario. A sparse set of points is selected according to a local spatial condition and then clustered in function of its neighborhood, disparity values and a cost associated with the possibility of each point being part of an obstacle. The method was evaluated in hand-labeled images from KITTI object detection benchmark and the precision and recall metrics were calculated. The quantitative and qualitative results showed satisfactory in scenarios with different types of objects.
This paper begins to describe a new kind of database, one that explores a diverse range of movement in the field of dance through capture of different bodies and different backgrounds - or what we are terming movement vernaculars. We re-purpose Ivan Illich's concept of 'vernacular work' [11] here to refer to those everyday forms of dance and organized movement that are informal, refractory (resistant to formal analysis), yet are socially reproduced and derived from a commons. The project investigates the notion of vernaculars in movement that is intentional and aesthetic through the development of a computational approach that highlights both similarities and differences, thereby revealing the specificities of each individual mover. This paper presents an example of how this movement database is used as a research tool, and how the fruits of that research can be added back to the database, thus adding a novel layer of annotation and further enriching the collection. Future researchers can then benefit from this layer, further refining and building upon these techniques. The creation of a robust, open source, movement lexicon repository will allow for observation, speculation, and contextualization - along with the provision of clean and complex data sets for new forms of creative expression.
The new era of information communication and technology (ICT), everyone wants to store/share their Data or information in online media, like in cloud database, mobile database, grid database, drives etc. When the data is stored in online media the main problem is arises related to data is privacy because different types of hacker, attacker or crackers wants to disclose their private information as publically. Security is a continuous process of protecting the data or information from attacks. For securing that information from those kinds of unauthorized people we proposed and implement of one the technique based on the data modification concept with taking the iris database on weka tool. And this paper provides the high privacy in distributed clustered database environments.
Mobile Ad-hoc Network is highly susceptible towards the security attacks due to its dynamic topology, resource constraint, energy constraint operations, limited physical security and lack of infrastructure. Misleading routing attack (MIRA) in MANET intend to delay packet to its fullest in order to generate time outs at the source as packets will not reach in time. Its main objective is to generate delay and increase network overhead. It is a variation to the sinkhole attack. In this paper, we have proposed a detection scheme to detect the malicious nodes at route discovery as well as at packet transmissions. The simulation results of MIRA attack indicate that though delay is increased by 91.30% but throughput is not affected which indicates that misleading routing attack is difficult to detect. The proposed detection scheme when applied to misleading routing attack suggests a significant decrease in delay.
Nearest neighbor search is a fundamental problem in various research fields like machine learning, data mining and pattern recognition. Recently, hashing-based approaches, for example, locality sensitive hashing (LSH), are proved to be effective for scalable high dimensional nearest neighbor search. Many hashing algorithms found their theoretic root in random projection. Since these algorithms generate the hash tables (projections) randomly, a large number of hash tables (i.e., long codewords) are required in order to achieve both high precision and recall. To address this limitation, we propose a novel hashing algorithm called density sensitive hashing (DSH) in this paper. DSH can be regarded as an extension of LSH. By exploring the geometric structure of the data, DSH avoids the purely random projections selection and uses those projective functions which best agree with the distribution of the data. Extensive experimental results on real-world data sets have shown that the proposed method achieves better performance compared to the state-of-the-art hashing approaches.