Biblio
One of the challenges in supplying the communities with wider access to scientific databases is the need for knowledge of database languages like Structured Query Language (SQL). Although the SQL language has been published in many forms, not everybody is able to write SQL queries. Another challenge is that it might not be practical to make the public aware of the structure of databases. There is a need for novice users to query relational databases using their natural language. To solve this problem, many natural language interfaces to structured databases have been developed. The goal is to provide a more intuitive method for generating database queries and delivering responses. Through social media, which makes it possible to interact with a wide section of the population, and with the help of natural language processing, researchers at the Atmospheric Radiation Measurement (ARM) Data Center at Oak Ridge National Laboratory (ORNL) have developed a concept to enable easy search and retrieval of data from several environmental data centers for the scientific community through social media.Using a machine learning framework that maps natural language text to thousands of datasets, instruments, variables, and data streams, the prototype system would allow users to request data through Twitter and receive a link (via tweet) to applicable data results on the project's search catalog tailored to their key words. This automated identification of relevant data from various petascale archives at ORNL could increase convenience, access, and use of the project's data by the broader community. In this paper we discuss how some data-intensive projects at ORNL are using innovative ways to help in data discovery.
Traditional anti-virus technologies have failed to keep pace with proliferation of malware due to slow process of their signatures and heuristics updates. Similarly, there are limitations of time and resources in order to perform manual analysis on each malware. There is a need to learn from this vast quantity of data, containing cyber attack pattern, in an automated manner to proactively adapt to ever-evolving threats. Machine learning offers unique advantages to learn from past cyber attacks to handle future cyber threats. The purpose of this research is to propose a framework for multi-classification of malware into well-known categories by applying different machine learning models over corpus of malware analysis reports. These reports are generated through an open source malware sandbox in an automated manner. We applied extensive pre-modeling techniques for data cleaning, features exploration and features engineering to prepare training and test datasets. Best possible hyper-parameters are selected to build machine learning models. These prepared datasets are then used to train the machine learning classifiers and to compare their prediction accuracy. Finally, these results are validated through a comprehensive 10-fold cross-validation methodology. The best results are achieved through Gaussian Naive Bayes classifier with random accuracy of 96% and 10-Fold Cross Validation accuracy of 91.2%. The said framework can be deployed in an operational environment to learn from malware attacks for proactively adapting matching counter measures.
With the exponential hike in cyber threats, organizations are now striving for better data mining techniques in order to analyze security logs received from their IT infrastructures to ensure effective and automated cyber threat detection. Machine Learning (ML) based analytics for security machine data is the next emerging trend in cyber security, aimed at mining security data to uncover advanced targeted cyber threats actors and minimizing the operational overheads of maintaining static correlation rules. However, selection of optimal machine learning algorithm for security log analytics still remains an impeding factor against the success of data science in cyber security due to the risk of large number of false-positive detections, especially in the case of large-scale or global Security Operations Center (SOC) environments. This fact brings a dire need for an efficient machine learning based cyber threat detection model, capable of minimizing the false detection rates. In this paper, we are proposing optimal machine learning algorithms with their implementation framework based on analytical and empirical evaluations of gathered results, while using various prediction, classification and forecasting algorithms.
Every day, huge amounts of unstructured text is getting generated. Most of this data is in the form of essays, research papers, patents, scholastic articles, book chapters etc. Many plagiarism softwares are being developed to be used in order to reduce the stealing and plagiarizing of Intellectual Property (IP). Current plagiarism softwares are mainly using string matching algorithms to detect copying of text from another source. The drawback of some of such plagiarism softwares is their inability to detect plagiarism when the structure of the sentence is changed. Replacement of keywords by their synonyms also fails to be detected by these softwares. This paper proposes a new method to detect such plagiarism using semantic knowledge graphs. The method uses Named Entity Recognition as well as semantic similarity between sentences to detect possible cases of plagiarism. The doubtful cases are visualized using semantic Knowledge Graphs for thorough analysis of authenticity. Rules for active and passive voice have also been considered in the proposed methodology.
Feature extraction and feature selection are the first tasks in pre-processing of input logs in order to detect cybersecurity threats and attacks by utilizing data mining techniques in the field of Artificial Intelligence. When it comes to the analysis of heterogeneous data derived from different sources, these tasks are found to be time-consuming and difficult to be managed efficiently. In this paper, we present an approach for handling feature extraction and feature selection utilizing machine learning algorithms for security analytics of heterogeneous data derived from different network sensors. The approach is implemented in Apache Spark, using its python API, named pyspark.
Aiming at the problem that there is little research on firmware vulnerability mining and the traditional method of vulnerability mining based on fuzzing test is inefficient, this paper proposed a new method of mining vulnerabilities in industrial control system firmware. Based on taint analysis technology, this method can construct test cases specifically for the variables that may trigger vulnerabilities, thus reducing the number of invalid test cases and improving the test efficiency. Experiment result shows that this method can reduce about 23 % of test cases and can effectively improve test efficiency.
With the extensive application of cloud computing technology developing, security is of paramount importance in Cloud Computing. In the cloud computing environment, surveys have been provided on several intrusion detection techniques for detecting intrusions. We will summarize some literature surveys of various attack taxonomy, which might cause various threats in cloud environment. Such as attacks in virtual machines, attacks on virtual machine monitor, and attacks in tenant network. Besides, we review massive existing solutions proposed in the literature, such as misuse detection techniques, behavior analysis of network traffic, behavior analysis of programs, virtual machine introspection (VMI) techniques, etc. In addition, we have summarized some innovations in the field of cloud security, such as CloudVMI, data mining techniques, artificial intelligence, and block chain technology, etc. At the same time, our team designed and implemented the prototype system of CloudI (Cloud Introspection). CloudI has characteristics of high security, high performance, high expandability and multiple functions.
Social media has been one of the most efficacious and precise by speakers of public opinion. A strategy which sanctions the utilization and illustration of twitter data to conclude public conviction is discussed in this paper. Sentiments on exclusive entities with diverse strengths and intenseness are stated by public, where these sentiments are strenuously cognate to their personal mood and emotions. To examine the sentiments from natural language texts, addressing various opinions, a lot of methods and lexical resources have been propounded. A path for boosting twitter sentiment classification using various sentiment proportions as meta-level features has been proposed by this article. Analysis of tweets was done on the product iPhone 6.
This article shows the analogy between natural language texts and quantum-like systems on the example of the Bell test calculating. The applicability of the well-known Bell test for texts in Russian is investigated. The possibility of using this test for the text separation on the topics corresponding to the user query in information retrieval system is shown.
Internet of Things (IoT) is experiencing exponential scalability. This scalability introduces new challenges regarding management of IoT networks. The question that emerges is how we can trust the constrained infrastructure that shortly is expected to be formed by millions of 'things.' The answer is not to trust. This research introduces Amatista, a blockchain-based middleware for management in IoT. Amatista presents a novel zero-trust hierarchical mining process that allows validating the infrastructure and transactions at different levels of trust. This research evaluates Amatista on Edison Arduino Boards.
Phishing has increased tremendously over last few years and it has become a serious threat to global security and economy. Existing literature dealing with the problem of phishing is scarce. Phishing is a deception technique that uses a combination of technology and social engineering to acquire sensitive information such as online banking passwords, credit card or bank account details [2]. Phishing can be done through emails and websites to collect confidential information. Phishers design fraudulent websites which look similar to the legitimate websites and lure the user to visit the malicious website. Therefore, the users must be aware of malicious websites to protect their sensitive data [1]. But it is very difficult to distinguish between legitimate and fake website especially for nontechnical users [4]. Moreover, phishing sites are growing rapidly. The aim of this paper is to demonstrate phishing detection using fuzzy logic and interpreting results using different defuzzification methods.
In this paper, we examine the recent trend to- wards in-browser mining of cryptocurrencies; in particular, the mining of Monero through Coinhive and similar code- bases. In this model, a user visiting a website will download a JavaScript code that executes client-side in her browser, mines a cryptocurrency - typically without her consent or knowledge - and pays out the seigniorage to the website. Websites may consciously employ this as an alternative or to supplement advertisement revenue, may offer premium content in exchange for mining, or may be unwittingly serving the code as a result of a breach (in which case the seigniorage is collected by the attacker). The cryptocurrency Monero is preferred seemingly for its unfriendliness to large-scale ASIC mining that would drive browser-based efforts out of the market, as well as for its purported privacy features. In this paper, we survey this landscape, conduct some measurements to establish its prevalence and profitability, outline an ethical framework for considering whether it should be classified as an attack or business opportunity, and make suggestions for the detection, mitigation and/or prevention of browser-based mining for non- consenting users.
Modern infrastructure is heavily reliant on systems with interconnected computational and physical resources, named Cyber-Physical Systems (CPSs). Hence, building resilient CPSs is a prime need and continuous monitoring of the CPS operational health is essential for improving resilience. This paper presents a framework for calculating and monitoring of health in CPSs using data driven techniques. The main advantages of this data driven methodology is that the ability of leveraging heterogeneous data streams that are available from the CPSs and the ability of performing the monitoring with minimal a priori domain knowledge. The main objective of the framework is to warn the operators of any degradation in cyber, physical or overall health of the CPS. The framework consists of four components: 1) Data acquisition and feature extraction, 2) state identification and real time state estimation, 3) cyber-physical health calculation and 4) operator warning generation. Further, this paper presents an initial implementation of the first three phases of the framework on a CPS testbed involving a Microgrid simulation and a cyber-network which connects the grid with its controller. The feature extraction method and the use of unsupervised learning algorithms are discussed. Experimental results are presented for the first two phases and the results showed that the data reflected different operating states and visualization techniques can be used to extract the relationships in data features.
Currently, due to improvements in defensive systems network covert channels are increasingly drawing attention of cybercriminals and malware developers as they can provide stealthiness of the malicious communication and thus to bypass existing security solutions. On the other hand, the utilized data hiding methods are getting increasingly sophisticated as the attackers, in order to stay under the radar, distribute the covert data among many connections, protocols, etc. That is why, the detection of such threats becomes a pressing issue. In this paper we make an initial step in this direction by presenting a data mining-based detection of such advanced threats which relies on pattern discovery technique. The obtained, initial experimental results indicate that such solution has potential and should be further investigated.
In the light of the information revolution, and the propagation of big social data, the dissemination of misleading information is certainly difficult to control. This is due to the rapid and intensive flow of information through unconfirmed sources under the propaganda and tendentious rumors. This causes confusion, loss of trust between individuals and groups and even between governments and their citizens. This necessitates a consolidation of efforts to stop penetrating of false information through developing theoretical and practical methodologies aim to measure the credibility of users of these virtual platforms. This paper presents an approach to domain-based prediction to user's trustworthiness of Online Social Networks (OSNs). Through incorporating three machine learning algorithms, the experimental results verify the applicability of the proposed approach to classify and predict domain-based trustworthy users of OSNs.
The power of artificial neural networks to form predictive models for phenomenon that exhibit non-linear relationships is a given fact. Despite this advantage, artificial neural networks are known to suffer drawbacks such as long training times and computational intensity. The researchers propose a two-tiered approach to enhance the learning performance of artificial neural networks for phenomenon with time series where data exhibits predictable changes that occur every calendar year. This paper focuses on the initial results of the first phase of the proposed algorithm which incorporates clustering and classification prior to application of the backpropagation algorithm. The 2016–2017 zonal load data of France is used as the data set. K-means is chosen as the clustering algorithm and a comparison is made between Naïve Bayes and k-Nearest Neighbors to determine the better classifier for this data set. The initial results show that electrical load behavior is not necessarily reflective of calendar clustering even without using the min-max temperature recorded during the inclusive months. Simulating the day-type classification process using one cluster, initial results show that the k-nearest neighbors outperforms the Naïve Bayes classifier for this data set and that the best feature to be used for classification into day type is the daily min-max load. These classified load data is expected to reduce training time and improve the overall performance of short-term load demand predictive models in a future paper.
Distributed diffusion is a powerful algorithm for multi-task state estimation which enables networked agents to interact with neighbors to process input data and diffuse infor- mation across the network. Compared to a centralized approach, diffusion offers multiple advantages that include robustness to node and link failures. In this paper, we consider distributed diffusion for multi-task estimation where networked agents must estimate distinct but correlated states of interest by processing streaming data. By exploiting the adaptive weights used for diffusing information, we develop attack models that drive normal agents to converge to states selected by the attacker. The attack models can be used for both stationary and non- stationary state estimation. In addition, we develop a resilient distributed diffusion algorithm under the assumption that the number of compromised nodes in the neighborhood of each normal node is bounded by F and we show that resilience may be obtained at the cost of performance degradation. Finally, we evaluate the proposed attack models and resilient distributed diffusion algorithm using stationary and non-stationary multi- target localization.
Cloud computing is an emerging technology that provides services to its users via Internet. It also allows sharing of resources there by reducing cost, money and space. With the popularity of cloud and its advantages, the trend of information industry shifting towards cloud services is increasing tremendously. Different cloud service providers are there on internet to provide services to the users. These services provided have certain parameters to provide better usage. It is difficult for the users to select a cloud service that is best suited to their requirements. Our proposed approach is based on data mining classification technique with fuzzy logic. Proposed algorithm uses cloud service design factors (security, agility and assurance etc.) and international standards to suggest the cloud service. The main objective of this research is to enable the end cloud users to choose best service as per their requirements and meeting international standards. We test our system with major cloud provider Google, Microsoft and Amazon.