Biblio
Poison message failure is a mechanism that has been responsible for large scale failures in both telecommunications and IP networks. The poison message failure can propagate in the network and cause an unstable network. We apply a machine learning, data mining technique in the network fault management area. We use the k-nearest neighbor method to identity the poison message failure. We also propose a "probabilistic" k-nearest neighbor method which outputs a probability distribution about the poison message. Through extensive simulations, we show that the k-nearest neighbor method is very effective in identifying the responsible message type.
Edge detection of bottle opening is a primary section to the machine vision based bottle opening detection system. This paper, taking advantage of the Balloon Snake, on the PET (Polyethylene Terephthalate) images sampled at rotating bottle-blowing machine producing pipelines, extracts the opening. It first uses the grayscale weighting average method to calculate the centroid as the initial position of Snake and then based on the energy minimal theory, it extracts the opening. Experiments show that compared with the conventional edge detection and center location methods, Balloon Snake is robust and can easily step over the weak noise points. Edge extracted thorough Balloon Snake is more integral and continuous which provides a guarantee to correctly judge the opening.
This paper proposes a steganography method using the digital images. Here, we are embedding the data which is to be secured into the digital image. Human Visual System proved that the changes in the image edges are insensitive to human eyes. Therefore we are using edge detection method in steganography to increase data hiding capacity by embedding more data in these edge pixels. So, if we can increase number of edge pixels, we can increase the amount of data that can be hidden in the image. To increase the number of edge pixels, multiple edge detection is employed. Edge detection is carried out using more sophisticated operator like canny operator. To compensate for the resulting decrease in the PSNR because of increase in the amount of data hidden, Minimum Error Replacement [MER] method is used. Therefore, the main goal of image steganography i.e. security with highest embedding capacity and good visual qualities are achieved. To extract the data we need the original image and the embedding ratio. Extraction is done by taking multiple edges detecting the original image and the data is extracted corresponding to the embedding ratio.
This paper presents an evaluation of various methodologies used to determine relative significances of input variables in data-driven models. Significance analysis applied to manufacturing process parameters can be a useful tool in fault diagnosis for various types of manufacturing processes. It can also be applied to building models that are used in process control. The relative significances of input variables can be determined by various data mining methods, including relatively simple statistical procedures as well as more advanced machine learning systems. Several methodologies suitable for carrying out classification tasks which are characteristic of fault diagnosis were evaluated and compared from the viewpoint of their accuracy, robustness of results and applicability. Two types of testing data were used: synthetic data with assumed dependencies and real data obtained from the foundry industry. The simple statistical method based on contingency tables revealed the best overall performance, whereas advanced machine learning models, such as ANNs and SVMs, appeared to be of less value.
The growing popularity and development of data mining technologies bring serious threat to the security of individual,'s sensitive information. An emerging research topic in data mining, known as privacy-preserving data mining (PPDM), has been extensively studied in recent years. The basic idea of PPDM is to modify the data in such a way so as to perform data mining algorithms effectively without compromising the security of sensitive information contained in the data. Current studies of PPDM mainly focus on how to reduce the privacy risk brought by data mining operations, while in fact, unwanted disclosure of sensitive information may also happen in the process of data collecting, data publishing, and information (i.e., the data mining results) delivering. In this paper, we view the privacy issues related to data mining from a wider perspective and investigate various approaches that can help to protect sensitive information. In particular, we identify four different types of users involved in data mining applications, namely, data provider, data collector, data miner, and decision maker. For each type of user, we discuss his privacy concerns and the methods that can be adopted to protect sensitive information. We briefly introduce the basics of related research topics, review state-of-the-art approaches, and present some preliminary thoughts on future research directions. Besides exploring the privacy-preserving approaches for each type of user, we also review the game theoretical approaches, which are proposed for analyzing the interactions among different users in a data mining scenario, each of whom has his own valuation on the sensitive information. By differentiating the responsibilities of different users with respect to security of sensitive information, we would like to provide some useful insights into the study of PPDM.
The significant dependence on cyberspace has indeed brought new risks that often compromise, exploit and damage invaluable data and systems. Thus, the capability to proactively infer malicious activities is of paramount importance. In this context, inferring probing events, which are commonly the first stage of any cyber attack, render a promising tactic to achieve that task. We have been receiving for the past three years 12 GB of daily malicious real darknet data (i.e., Internet traffic destined to half a million routable yet unallocated IP addresses) from more than 12 countries. This paper exploits such data to propose a novel approach that aims at capturing the behavior of the probing sources in an attempt to infer their orchestration (i.e., coordination) pattern. The latter defines a recently discovered characteristic of a new phenomenon of probing events that could be ominously leveraged to cause drastic Internet-wide and enterprise impacts as precursors of various cyber attacks. To accomplish its goals, the proposed approach leverages various signal and statistical techniques, information theoretical metrics, fuzzy approaches with real malware traffic and data mining methods. The approach is validated through one use case that arguably proves that a previously analyzed orchestrated probing event from last year is indeed still active, yet operating in a stealthy, very low rate mode. We envision that the proposed approach that is tailored towards darknet data, which is frequently, abundantly and effectively used to generate cyber threat intelligence, could be used by network security analysts, emergency response teams and/or observers of cyber events to infer large-scale orchestrated probing events for early cyber attack warning and notification.
In the cyber crime huge log data, transactional data occurs which tends to plenty of data for storage and analyze them. It is difficult for forensic investigators to play plenty of time to find out clue and analyze those data. In network forensic analysis involves network traces and detection of attacks. The trace involves an Intrusion Detection System and firewall logs, logs generated by network services and applications, packet captures by sniffers. In network lots of data is generated in every event of action, so it is difficult for forensic investigators to find out clue and analyzing those data. In network forensics is deals with analysis, monitoring, capturing, recording, and analysis of network traffic for detecting intrusions and investigating them. This paper focuses on data collection from the cyber system and web browser. The FTK 4.0 is discussing for memory forensic analysis and remote system forensic which is to be used as evidence for aiding investigation.
Denial of Service (DoS) and Distributed Denial of Service (DDoS) attack, exhausts the resources of server/service and makes it unavailable for legitimate users. With increasing use of online services and attacks on these services, the importance of Intrusion Detection System (IDS) for detection of DoS/DDoS attacks has also grown. Detection accuracy & CPU utilization of Data mining based IDS is directly proportional to the quality of training dataset used to train it. Various preprocessing methods like normalization, discretization, fuzzification are used by researchers to improve the quality of training dataset. This paper evaluates the effect of various data preprocessing methods on the detection accuracy of DoS/DDoS attack detection IDS and proves that numeric to binary preprocessing method performs better compared to other methods. Experimental results obtained using KDD 99 dataset are provided to support the efficiency of proposed combination.
The inappropriate use of features intended to improve usability and interactivity of web applications has resulted in the emergence of various threats, including Cross-Site Scripting(XSS) attacks. In this work, we developed ETSS Detector, a generic and modular web vulnerability scanner that automatically analyzes web applications to find XSS vulnerabilities. ETSS Detector is able to identify and analyze all data entry points of the application and generate specific code injection tests for each one. The results shows that the correct filling of the input fields with only valid information ensures a better effectiveness of the tests, increasing the detection rate of XSS attacks.
Keeping up with rapid advances in research in various fields of Engineering and Technology is a challenging task. Decision makers including academics, program managers, venture capital investors, industry leaders and funding agencies not only need to be abreast of latest developments but also be able to assess the effect of growth in certain areas on their core business. Though analyst agencies like Gartner, McKinsey etc. Provide such reports for some areas, thought leaders of all organisations still need to amass data from heterogeneous collections like research publications, analyst reports, patent applications, competitor information etc. To help them finalize their own strategies. Text mining and data analytics researchers have been looking at integrating statistics, text analytics and information visualization to aid the process of retrieval and analytics. In this paper, we present our work on automated topical analysis and insight generation from large heterogeneous text collections of publications and patents. While most of the earlier work in this area provides search-based platforms, ours is an integrated platform for search and analysis. We have presented several methods and techniques that help in analysis and better comprehension of search results. We have also presented methods for generating insights about emerging and popular trends in research along with contextual differences between academic research and patenting profiles. We also present novel techniques to present topic evolution that helps users understand how a particular area has evolved over time.
In this paper we propose a twitter sentiment analytics that mines for opinion polarity about a given topic. Most of current semantic sentiment analytics depends on polarity lexicons. However, many key tone words are frequently bipolar. In this paper we demonstrate a technique which can accommodate the bipolarity of tone words by context sensitive tone lexicon learning mechanism where the context is modeled by the semantic neighborhood of the main target. Performance analysis shows that ability to contextualize the tone word polarity significantly improves the accuracy.
Enhanced situational awareness is integral to risk management and response evaluation. Dynamic systems that incorporate both hard and soft data sources allow for comprehensive situational frameworks which can supplement physical models with conceptual notions of risk. The processing of widely available semi-structured textual data sources can produce soft information that is readily consumable by such a framework. In this paper, we augment the situational awareness capabilities of a recently proposed risk management framework (RMF) with the incorporation of soft data. We illustrate the beneficial role of the hard-soft data fusion in the characterization and evaluation of potential vessels in distress within Maritime Domain Awareness (MDA) scenarios. Risk features pertaining to maritime vessels are defined a priori and then quantified in real time using both hard (e.g., Automatic Identification System, Douglas Sea Scale) as well as soft (e.g., historical records of worldwide maritime incidents) data sources. A risk-aware metric to quantify the effectiveness of the hard-soft fusion process is also proposed. Though illustrated with MDA scenarios, the proposed hard-soft fusion methodology within the RMF can be readily applied to other domains.
Interactive visualization provides valuable support for exploring, analyzing, and understanding textual documents. Certain tasks, however, require that insights derived from visual abstractions are verified by a human expert perusing the source text. So far, this problem is typically solved by offering overview-detail techniques, which present different views with different levels of abstractions. This often leads to problems with visual continuity. Focus-context techniques, on the other hand, succeed in accentuating interesting subsections of large text documents but are normally not suited for integrating visual abstractions. With VarifocalReader we present a technique that helps to solve some of these approaches' problems by combining characteristics from both. In particular, our method simplifies working with large and potentially complex text documents by simultaneously offering abstract representations of varying detail, based on the inherent structure of the document, and access to the text itself. In addition, VarifocalReader supports intra-document exploration through advanced navigation concepts and facilitates visual analysis tasks. The approach enables users to apply machine learning techniques and search mechanisms as well as to assess and adapt these techniques. This helps to extract entities, concepts and other artifacts from texts. In combination with the automatic generation of intermediate text levels through topic segmentation for thematic orientation, users can test hypotheses or develop interesting new research questions. To illustrate the advantages of our approach, we provide usage examples from literature studies.
Recently, threat of previously unknown cyber-attacks are increasing because existing security systems are not able to detect them. Past cyber-attacks had simple purposes of leaking personal information by attacking the PC or destroying the system. However, the goal of recent hacking attacks has changed from leaking information and destruction of services to attacking large-scale systems such as critical infrastructures and state agencies. In the other words, existing defence technologies to counter these attacks are based on pattern matching methods which are very limited. Because of this fact, in the event of new and previously unknown attacks, detection rate becomes very low and false negative increases. To defend against these unknown attacks, which cannot be detected with existing technology, we propose a new model based on big data analysis techniques that can extract information from a variety of sources to detect future attacks. We expect our model to be the basis of the future Advanced Persistent Threat(APT) detection and prevention system implementations.
Mobile Ad-Hoc Networks (MANET) consist of peer-to-peer infrastructure less communicating nodes that are highly dynamic. As a result, routing data becomes more challenging. Ultimately routing protocols for such networks face the challenges of random topology change, nature of the link (symmetric or asymmetric) and power requirement during data transmission. Under such circumstances both, proactive as well as reactive routing are usually inefficient. We consider, zone routing protocol (ZRP) that adds the qualities of the proactive (IARP) and reactive (IERP) protocols. In ZRP, an updated topological map of zone centered on each node, is maintained. Immediate routes are available inside each zone. In order to communicate outside a zone, a route discovery mechanism is employed. The local routing information of the zones helps in this route discovery procedure. In MANET security is always an issue. It is possible that a node can turn malicious and hamper the normal flow of packets in the MANET. In order to overcome such issue we have used a clustering technique to separate the nodes having intrusive behavior from normal behavior. We call this technique as effective k-means clustering which has been motivated from k-means. We propose to implement Intrusion Detection System on each node of the MANET which is using ZRP for packet flow. Then we will use effective k-means to separate the malicious nodes from the network. Thus, our Ad-Hoc network will be free from any malicious activity and normal flow of packets will be possible.
Mobile Ad-Hoc Networks (MANET) consist of peer-to-peer infrastructure less communicating nodes that are highly dynamic. As a result, routing data becomes more challenging. Ultimately routing protocols for such networks face the challenges of random topology change, nature of the link (symmetric or asymmetric) and power requirement during data transmission. Under such circumstances both, proactive as well as reactive routing are usually inefficient. We consider, zone routing protocol (ZRP) that adds the qualities of the proactive (IARP) and reactive (IERP) protocols. In ZRP, an updated topological map of zone centered on each node, is maintained. Immediate routes are available inside each zone. In order to communicate outside a zone, a route discovery mechanism is employed. The local routing information of the zones helps in this route discovery procedure. In MANET security is always an issue. It is possible that a node can turn malicious and hamper the normal flow of packets in the MANET. In order to overcome such issue we have used a clustering technique to separate the nodes having intrusive behavior from normal behavior. We call this technique as effective k-means clustering which has been motivated from k-means. We propose to implement Intrusion Detection System on each node of the MANET which is using ZRP for packet flow. Then we will use effective k-means to separate the malicious nodes from the network. Thus, our Ad-Hoc network will be free from any malicious activity and normal flow of packets will be possible.
Data mining is the process of finding correlations in the relational databases. There are different techniques for identifying malicious database transactions. Many existing approaches which profile is SQL query structures and database user activities to detect intrusion, the log mining approach is the automatic discovery for identifying anomalous database transactions. Mining of the Data is very helpful to end users for extracting useful business information from large database. Multi-level and multi-dimensional data mining are employed to discover data item dependency rules, data sequence rules, domain dependency rules, and domain sequence rules from the database log containing legitimate transactions. Database transactions that do not comply with the rules are identified as malicious transactions. The log mining approach can achieve desired true and false positive rates when the confidence and support are set up appropriately. The implemented system incrementally maintain the data dependency rule sets and optimize the performance of the intrusion detection process.
This paper addresses a robust methodology for developing a statistically sound, robust prognostic condition index and encapsulating this index as a series of highly accurate, transparent, human-readable rules. These rules can be used to further understand degradation phenomena and also provide transparency and trust for any underlying prognostic technique employed. A case study is presented on a wind turbine gearbox, utilising historical supervisory control and data acquisition (SCADA) data in conjunction with a physics of failure model. Training is performed without failure data, with the technique accurately identifying gearbox degradation and providing prognostic signatures up to 5 months before catastrophic failure occurred. A robust derivation of the Mahalanobis distance is employed to perform outlier analysis in the bivariate domain, enabling the rapid labelling of historical SCADA data on independent wind turbines. Following this, the RIPPER rule learner was utilised to extract transparent, human-readable rules from the labelled data. A mean classification accuracy of 95.98% of the autonomously derived condition was achieved on three independent test sets, with a mean kappa statistic of 93.96% reported. In total, 12 rules were extracted, with an independent domain expert providing critical analysis, two thirds of the rules were deemed to be intuitive in modelling fundamental degradation behaviour of the wind turbine gearbox.
We propose a novel phishing detection architecture based on transparent virtualization technologies and isolation of the own components. The architecture can be deployed as a security extension for virtual machines (VMs) running in the cloud. It uses fine-grained VM introspection (VMI) to extract, filter and scale a color-based fingerprint of web pages which are processed by a browser from the VM's memory. By analyzing the human perceptual similarity between the fingerprints, the architecture can reveal and mitigate phishing attacks which are based on redirection to spoofed web pages and it can also detect “Man-in-the-Browser” (MitB) attacks. To the best of our knowledge, the architecture is the first anti-phishing solution leveraging virtualization technologies. We explain details about the design and the implementation and we show results of an evaluation with real-world data.
Information fusion deals with the integration and merging of data and information from multiple (heterogeneous) sources. In many cases, the information that needs to be fused has security classification. The result of the fusion process is then by necessity restricted with the strictest information security classification of the inputs. This has severe drawbacks and limits the possible dissemination of the fusion results. It leads to decreased situational awareness: the organization knows information that would enable a better situation picture, but since parts of the information is restricted, it is not possible to distribute the most correct situational information. In this paper, we take steps towards defining fusion and data mining processes that can be used even when all the underlying data that was used cannot be disseminated. The method we propose here could be used to produce a classifier where all the sensitive information has been removed and where it can be shown that an antagonist cannot even in principle obtain knowledge about the classified information by using the classifier or situation picture.
In the United States, the number of Phasor Measurement Units (PMU) will increase from 166 networked devices in 2010 to 1043 in 2014. According to the Department of Energy, they are being installed in order to “evaluate and visualize reliability margin (which describes how close the system is to the edge of its stability boundary).” However, there is still a lot of debate in academia and industry around the usefulness of phase angles as unambiguous predictors of dynamic stability. In this paper, using 4-year of actual data from Hydro-Québec EMS, it is shown that phase angles enable satisfactory predictions of power transfer and dynamic security margins across critical interface using random forest models, with both explanation level and R-squares accuracy exceeding 99%. A generalized linear model (GLM) is next implemented to predict phase angles from day-ahead to hour-ahead time frames, using historical phase angles values and load forecast. Combining GLM based angles forecast with random forest mapping of phase angles to power transfers result in a new data-driven approach for dynamic security monitoring.
Network traffic is a rich source of information for security monitoring. However the increasing volume of data to treat raises issues, rendering holistic analysis of network traffic difficult. In this paper we propose a solution to cope with the tremendous amount of data to analyse for security monitoring perspectives. We introduce an architecture dedicated to security monitoring of local enterprise networks. The application domain of such a system is mainly network intrusion detection and prevention, but can be used as well for forensic analysis. This architecture integrates two systems, one dedicated to scalable distributed data storage and management and the other dedicated to data exploitation. DNS data, NetFlow records, HTTP traffic and honeypot data are mined and correlated in a distributed system that leverages state of the art big data solution. Data correlation schemes are proposed and their performance are evaluated against several well-known big data framework including Hadoop and Spark.
Unstructured data mining has become topical recently due to the availability of high-dimensional and voluminous digital content (known as "Big Data") across the enterprise spectrum. The Relational Database Management Systems (RDBMS) have been employed over the past decades for content storage and management, but, the ever-growing heterogeneity in today's data calls for a new storage approach. Thus, the NoSQL database has emerged as the preferred storage facility nowadays since the facility supports unstructured data storage. This creates the need to explore efficient data mining techniques from such NoSQL systems since the available tools and frameworks which are designed for RDBMS are often not directly applicable. In this paper, we focused on topics and terms mining, based on clustering, in document-based NoSQL. This is achieved by adapting the architectural design of an analytics-as-a-service framework and the proposal of the Viterbi algorithm to enhance the accuracy of the terms classification in the system. The results from the pilot testing of our work show higher accuracy in comparison to some previously proposed techniques such as the parallel search.
Although current Internet operations generate voluminous data, they remain largely oblivious of traffic data semantics. This poses many inefficiencies and challenges due to emergent or anomalous behavior impacting the vast array of Internet elements such as services and protocols. In this paper, we propose a Data Semantics Management System (DSMS) for learning Internet traffic data semantics to enable smarter semantics- driven networking operations. We extract networking semantics and build and utilize a dynamic ontology of network concepts to better recognize and act upon emergent or abnormal behavior. Our DSMS utilizes: (1) Latent Dirichlet Allocation algorithm (LDA) for latent features extraction and semantics reasoning; (2) big tables as a cloud-like data storage technique to maintain large-scale data; and (3) Locality Sensitive Hashing algorithm (LSH) for reducing data dimensionality. Our preliminary evaluation using real Internet traffic shows the efficacy of DSMS for learning behavior of normal and abnormal traffic data and for accurately detecting anomalies at low cost.
With the arrival of the big data era, information privacy and security issues become even more crucial. The Mining Associations with Secrecy Konstraints (MASK) algorithm and its improved versions were proposed as data mining approaches for privacy preserving association rules. The MASK algorithm only adopts a data perturbation strategy, which leads to a low privacy-preserving degree. Moreover, it is difficult to apply the MASK algorithm into practices because of its long execution time. This paper proposes a new algorithm based on data perturbation and query restriction (DPQR) to improve the privacy-preserving degree by multi-parameters perturbation. In order to improve the time-efficiency, the calculation to obtain an inverse matrix is simplified by dividing the matrix into blocks; meanwhile, a further optimization is provided to reduce the number of scanning database by set theory. Both theoretical analyses and experiment results prove that the proposed DPQR algorithm has better performance.