Biblio
In this paper we propose a twitter sentiment analytics that mines for opinion polarity about a given topic. Most of current semantic sentiment analytics depends on polarity lexicons. However, many key tone words are frequently bipolar. In this paper we demonstrate a technique which can accommodate the bipolarity of tone words by context sensitive tone lexicon learning mechanism where the context is modeled by the semantic neighborhood of the main target. Performance analysis shows that ability to contextualize the tone word polarity significantly improves the accuracy.
An increasing number of people are using online social networking services (SNSs), and a significant amount of information related to experiences in consumption is shared in this new media form. Text mining is an emerging technique for mining useful information from the web. We aim at discovering in particular tweets semantic patterns in consumers' discussions on social media. Specifically, the purposes of this study are twofold: 1) finding similarity and dissimilarity between two sets of textual documents that include consumers' sentiment polarities, two forms of positive vs. negative opinions and 2) driving actual content from the textual data that has a semantic trend. The considered tweets include consumers' opinions on US retail companies (e.g., Amazon, Walmart). Cosine similarity and K-means clustering methods are used to achieve the former goal, and Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm, is used for the latter purpose. This is the first study which discover semantic properties of textual data in consumption context beyond sentiment analysis. In addition to major findings, we apply LDA (Latent Dirichlet Allocations) to the same data and drew latent topics that represent consumers' positive opinions and negative opinions on social media.
Keeping up with rapid advances in research in various fields of Engineering and Technology is a challenging task. Decision makers including academics, program managers, venture capital investors, industry leaders and funding agencies not only need to be abreast of latest developments but also be able to assess the effect of growth in certain areas on their core business. Though analyst agencies like Gartner, McKinsey etc. Provide such reports for some areas, thought leaders of all organisations still need to amass data from heterogeneous collections like research publications, analyst reports, patent applications, competitor information etc. To help them finalize their own strategies. Text mining and data analytics researchers have been looking at integrating statistics, text analytics and information visualization to aid the process of retrieval and analytics. In this paper, we present our work on automated topical analysis and insight generation from large heterogeneous text collections of publications and patents. While most of the earlier work in this area provides search-based platforms, ours is an integrated platform for search and analysis. We have presented several methods and techniques that help in analysis and better comprehension of search results. We have also presented methods for generating insights about emerging and popular trends in research along with contextual differences between academic research and patenting profiles. We also present novel techniques to present topic evolution that helps users understand how a particular area has evolved over time.
In the United States, the number of Phasor Measurement Units (PMU) will increase from 166 networked devices in 2010 to 1043 in 2014. According to the Department of Energy, they are being installed in order to “evaluate and visualize reliability margin (which describes how close the system is to the edge of its stability boundary).” However, there is still a lot of debate in academia and industry around the usefulness of phase angles as unambiguous predictors of dynamic stability. In this paper, using 4-year of actual data from Hydro-Québec EMS, it is shown that phase angles enable satisfactory predictions of power transfer and dynamic security margins across critical interface using random forest models, with both explanation level and R-squares accuracy exceeding 99%. A generalized linear model (GLM) is next implemented to predict phase angles from day-ahead to hour-ahead time frames, using historical phase angles values and load forecast. Combining GLM based angles forecast with random forest mapping of phase angles to power transfers result in a new data-driven approach for dynamic security monitoring.
Enhanced situational awareness is integral to risk management and response evaluation. Dynamic systems that incorporate both hard and soft data sources allow for comprehensive situational frameworks which can supplement physical models with conceptual notions of risk. The processing of widely available semi-structured textual data sources can produce soft information that is readily consumable by such a framework. In this paper, we augment the situational awareness capabilities of a recently proposed risk management framework (RMF) with the incorporation of soft data. We illustrate the beneficial role of the hard-soft data fusion in the characterization and evaluation of potential vessels in distress within Maritime Domain Awareness (MDA) scenarios. Risk features pertaining to maritime vessels are defined a priori and then quantified in real time using both hard (e.g., Automatic Identification System, Douglas Sea Scale) as well as soft (e.g., historical records of worldwide maritime incidents) data sources. A risk-aware metric to quantify the effectiveness of the hard-soft fusion process is also proposed. Though illustrated with MDA scenarios, the proposed hard-soft fusion methodology within the RMF can be readily applied to other domains.
When the system is in normal state, actual SCADA measurements of power transfers across critical interfaces are continuously compared with limits determined offline and stored in look-up tables or nomograms in order to assess whether the network is secure or insecure and inform the dispatcher to take preventive action in the latter case. However, synchrophasors could change this paradigm by enabling new features, the phase-angle differences, which are well-known measures of system stress, with the added potential to increase system visibility. The paper develops a systematic approach to baseline the phase-angles versus actual transfer limits across system interfaces and enable synchrophasor-based situational awareness (SBSA). Statistical methods are first used to determine seasonal exceedance levels of angle shifts that can allow real-time scoring and detection of atypical conditions. Next, key buses suitable for SBSA are identified using correlation and partitioning around medoid (PAM) clustering. It is shown that angle shifts of this subset of 15% of the network backbone buses can be effectively used as features in ensemble decision tree-based forecasting of seasonal security margins across critical interfaces.
Information fusion deals with the integration and merging of data and information from multiple (heterogeneous) sources. In many cases, the information that needs to be fused has security classification. The result of the fusion process is then by necessity restricted with the strictest information security classification of the inputs. This has severe drawbacks and limits the possible dissemination of the fusion results. It leads to decreased situational awareness: the organization knows information that would enable a better situation picture, but since parts of the information is restricted, it is not possible to distribute the most correct situational information. In this paper, we take steps towards defining fusion and data mining processes that can be used even when all the underlying data that was used cannot be disseminated. The method we propose here could be used to produce a classifier where all the sensitive information has been removed and where it can be shown that an antagonist cannot even in principle obtain knowledge about the classified information by using the classifier or situation picture.
Sensors of diverse capabilities and modalities, carried by us or deeply embedded in the physical world, have invaded our personal, social, work, and urban spaces. Our relationship with these sensors is a complicated one. On the one hand, these sensors collect rich data that are shared and disseminated, often initiated by us, with a broad array of service providers, interest groups, friends, and family. Embedded in this data is information that can be used to algorithmically construct a virtual biography of our activities, revealing intimate behaviors and lifestyle patterns. On the other hand, we and the services we use, increasingly depend directly and indirectly on information originating from these sensors for making a variety of decisions, both routine and critical, in our lives. The quality of these decisions and our confidence in them depend directly on the quality of the sensory information and our trust in the sources. Sophisticated adversaries, benefiting from the same technology advances as the sensing systems, can manipulate sensory sources and analyze data in subtle ways to extract sensitive knowledge, cause erroneous inferences, and subvert decisions. The consequences of these compromises will only amplify as our society increasingly complex human-cyber-physical systems with increased reliance on sensory information and real-time decision cycles.Drawing upon examples of this two-faceted relationship with sensors in applications such as mobile health and sustainable buildings, this talk will discuss the challenges inherent in designing a sensor information flow and processing architecture that is sensitive to the concerns of both producers and consumer. For the pervasive sensing infrastructure to be trusted by both, it must be robust to active adversaries who are deceptively extracting private information, manipulating beliefs and subverting decisions. While completely solving these challenges would require a new science of resilient, secure and trustworthy networked sensing and decision systems that would combine hitherto disciplines of distributed embedded systems, network science, control theory, security, behavioral science, and game theory, this talk will provide some initial ideas. These include an approach to enabling privacy-utility trade-offs that balance the tension between risk of information sharing to the producer and the value of information sharing to the consumer, and method to secure systems against physical manipulation of sensed information.
The inappropriate use of features intended to improve usability and interactivity of web applications has resulted in the emergence of various threats, including Cross-Site Scripting(XSS) attacks. In this work, we developed ETSS Detector, a generic and modular web vulnerability scanner that automatically analyzes web applications to find XSS vulnerabilities. ETSS Detector is able to identify and analyze all data entry points of the application and generate specific code injection tests for each one. The results shows that the correct filling of the input fields with only valid information ensures a better effectiveness of the tests, increasing the detection rate of XSS attacks.
The growing popularity and development of data mining technologies bring serious threat to the security of individual,'s sensitive information. An emerging research topic in data mining, known as privacy-preserving data mining (PPDM), has been extensively studied in recent years. The basic idea of PPDM is to modify the data in such a way so as to perform data mining algorithms effectively without compromising the security of sensitive information contained in the data. Current studies of PPDM mainly focus on how to reduce the privacy risk brought by data mining operations, while in fact, unwanted disclosure of sensitive information may also happen in the process of data collecting, data publishing, and information (i.e., the data mining results) delivering. In this paper, we view the privacy issues related to data mining from a wider perspective and investigate various approaches that can help to protect sensitive information. In particular, we identify four different types of users involved in data mining applications, namely, data provider, data collector, data miner, and decision maker. For each type of user, we discuss his privacy concerns and the methods that can be adopted to protect sensitive information. We briefly introduce the basics of related research topics, review state-of-the-art approaches, and present some preliminary thoughts on future research directions. Besides exploring the privacy-preserving approaches for each type of user, we also review the game theoretical approaches, which are proposed for analyzing the interactions among different users in a data mining scenario, each of whom has his own valuation on the sensitive information. By differentiating the responsibilities of different users with respect to security of sensitive information, we would like to provide some useful insights into the study of PPDM.
We are currently living in the age of Big Data coming along with the challenge to grasp the golden opportunities at hand. This mixed blessing also dominates the relation between Big Data and trust. On the one side, large amounts of trust-related data can be utilized to establish innovative data-driven approaches for reputation-based trust management. On the other side, this is intrinsically tied to the trust we can put in the origins and quality of the underlying data. In this paper, we address both sides of trust and Big Data by structuring the problem domain and presenting current research directions and inter-dependencies. Based on this, we define focal issues which serve as future research directions for the track to our vision of Next Generation Online Trust within the FORSEC project.
With the arrival of the big data era, information privacy and security issues become even more crucial. The Mining Associations with Secrecy Konstraints (MASK) algorithm and its improved versions were proposed as data mining approaches for privacy preserving association rules. The MASK algorithm only adopts a data perturbation strategy, which leads to a low privacy-preserving degree. Moreover, it is difficult to apply the MASK algorithm into practices because of its long execution time. This paper proposes a new algorithm based on data perturbation and query restriction (DPQR) to improve the privacy-preserving degree by multi-parameters perturbation. In order to improve the time-efficiency, the calculation to obtain an inverse matrix is simplified by dividing the matrix into blocks; meanwhile, a further optimization is provided to reduce the number of scanning database by set theory. Both theoretical analyses and experiment results prove that the proposed DPQR algorithm has better performance.
With the arrival of the big data era, information privacy and security issues become even more crucial. The Mining Associations with Secrecy Konstraints (MASK) algorithm and its improved versions were proposed as data mining approaches for privacy preserving association rules. The MASK algorithm only adopts a data perturbation strategy, which leads to a low privacy-preserving degree. Moreover, it is difficult to apply the MASK algorithm into practices because of its long execution time. This paper proposes a new algorithm based on data perturbation and query restriction (DPQR) to improve the privacy-preserving degree by multi-parameters perturbation. In order to improve the time-efficiency, the calculation to obtain an inverse matrix is simplified by dividing the matrix into blocks; meanwhile, a further optimization is provided to reduce the number of scanning database by set theory. Both theoretical analyses and experiment results prove that the proposed DPQR algorithm has better performance.
With the advancement in technology, industry, e-commerce and research a large amount of complex and pervasive digital data is being generated which is increasing at an exponential rate and often termed as big data. Traditional Data Storage systems are not able to handle Big Data and also analyzing the Big Data becomes a challenge and thus it cannot be handled by traditional analytic tools. Cloud Computing can resolve the problem of handling, storage and analyzing the Big Data as it distributes the big data within the cloudlets. No doubt, Cloud Computing is the best answer available to the problem of Big Data storage and its analyses but having said that, there is always a potential risk to the security of Big Data storage in Cloud Computing, which needs to be addressed. Data Privacy is one of the major issues while storing the Big Data in a Cloud environment. Data Mining based attacks, a major threat to the data, allows an adversary or an unauthorized user to infer valuable and sensitive information by analyzing the results generated from computation performed on the raw data. This thesis proposes a secure k-means data mining approach assuming the data to be distributed among different hosts preserving the privacy of the data. The approach is able to maintain the correctness and validity of the existing k-means to generate the final results even in the distributed environment.
Network traffic is a rich source of information for security monitoring. However the increasing volume of data to treat raises issues, rendering holistic analysis of network traffic difficult. In this paper we propose a solution to cope with the tremendous amount of data to analyse for security monitoring perspectives. We introduce an architecture dedicated to security monitoring of local enterprise networks. The application domain of such a system is mainly network intrusion detection and prevention, but can be used as well for forensic analysis. This architecture integrates two systems, one dedicated to scalable distributed data storage and management and the other dedicated to data exploitation. DNS data, NetFlow records, HTTP traffic and honeypot data are mined and correlated in a distributed system that leverages state of the art big data solution. Data correlation schemes are proposed and their performance are evaluated against several well-known big data framework including Hadoop and Spark.
Security companies have recently realised that mining massive amounts of security data can help generate actionable intelligence and improve their understanding of Internet attacks. In particular, attack attribution and situational understanding are considered critical aspects to effectively deal with emerging, increasingly sophisticated Internet attacks. This requires highly scalable analysis tools to help analysts classify, correlate and prioritise security events, depending on their likely impact and threat level. However, this security data mining process typically involves a considerable amount of features interacting in a non-obvious way, which makes it inherently complex. To deal with this challenge, we introduce MR-TRIAGE, a set of distributed algorithms built on MapReduce that can perform scalable multi-criteria data clustering on large security data sets and identify complex relationships hidden in massive datasets. The MR-TRIAGE workflow is made of a scalable data summarisation, followed by scalable graph clustering algorithms in which we integrate multi-criteria evaluation techniques. Theoretical computational complexity of the proposed parallel algorithms are discussed and analysed. The experimental results demonstrate that the algorithms can scale well and efficiently process large security datasets on commodity hardware. Our approach can effectively cluster any type of security events (e.g., spam emails, spear-phishing attacks, etc) that are sharing at least some commonalities among a number of predefined features.
Multiple-object tracking is an important task in automated video surveillance. In this paper, we present a multiple-human-tracking approach that takes the single-frame human detection results as input and associates them to form trajectories while improving the original detection results by making use of reliable temporal information in a closed-loop manner. It works by first forming tracklets, from which reliable temporal information is extracted, and then refining the detection responses inside the tracklets, which also improves the accuracy of tracklets' quantities. After this, local conservative tracklet association is performed and reliable temporal information is propagated across tracklets so that more detection responses can be refined. The global tracklet association is done last to resolve association ambiguities. Experimental results show that the proposed approach improves both the association and detection results. Comparison with several state-of-the-art approaches demonstrates the effectiveness of the proposed approach.
We propose a novel phishing detection architecture based on transparent virtualization technologies and isolation of the own components. The architecture can be deployed as a security extension for virtual machines (VMs) running in the cloud. It uses fine-grained VM introspection (VMI) to extract, filter and scale a color-based fingerprint of web pages which are processed by a browser from the VM's memory. By analyzing the human perceptual similarity between the fingerprints, the architecture can reveal and mitigate phishing attacks which are based on redirection to spoofed web pages and it can also detect “Man-in-the-Browser” (MitB) attacks. To the best of our knowledge, the architecture is the first anti-phishing solution leveraging virtualization technologies. We explain details about the design and the implementation and we show results of an evaluation with real-world data.
Sensors of diverse capabilities and modalities, carried by us or deeply embedded in the physical world, have invaded our personal, social, work, and urban spaces. Our relationship with these sensors is a complicated one. On the one hand, these sensors collect rich data that are shared and disseminated, often initiated by us, with a broad array of service providers, interest groups, friends, and family. Embedded in this data is information that can be used to algorithmically construct a virtual biography of our activities, revealing intimate behaviors and lifestyle patterns. On the other hand, we and the services we use, increasingly depend directly and indirectly on information originating from these sensors for making a variety of decisions, both routine and critical, in our lives. The quality of these decisions and our confidence in them depend directly on the quality of the sensory information and our trust in the sources. Sophisticated adversaries, benefiting from the same technology advances as the sensing systems, can manipulate sensory sources and analyze data in subtle ways to extract sensitive knowledge, cause erroneous inferences, and subvert decisions. The consequences of these compromises will only amplify as our society increasingly complex human-cyber-physical systems with increased reliance on sensory information and real-time decision cycles.Drawing upon examples of this two-faceted relationship with sensors in applications such as mobile health and sustainable buildings, this talk will discuss the challenges inherent in designing a sensor information flow and processing architecture that is sensitive to the concerns of both producers and consumer. For the pervasive sensing infrastructure to be trusted by both, it must be robust to active adversaries who are deceptively extracting private information, manipulating beliefs and subverting decisions. While completely solving these challenges would require a new science of resilient, secure and trustworthy networked sensing and decision systems that would combine hitherto disciplines of distributed embedded systems, network science, control theory, security, behavioral science, and game theory, this talk will provide some initial ideas. These include an approach to enabling privacy-utility trade-offs that balance the tension between risk of information sharing to the producer and the value of information sharing to the consumer, and method to secure systems against physical manipulation of sensed information.
This paper addresses a robust methodology for developing a statistically sound, robust prognostic condition index and encapsulating this index as a series of highly accurate, transparent, human-readable rules. These rules can be used to further understand degradation phenomena and also provide transparency and trust for any underlying prognostic technique employed. A case study is presented on a wind turbine gearbox, utilising historical supervisory control and data acquisition (SCADA) data in conjunction with a physics of failure model. Training is performed without failure data, with the technique accurately identifying gearbox degradation and providing prognostic signatures up to 5 months before catastrophic failure occurred. A robust derivation of the Mahalanobis distance is employed to perform outlier analysis in the bivariate domain, enabling the rapid labelling of historical SCADA data on independent wind turbines. Following this, the RIPPER rule learner was utilised to extract transparent, human-readable rules from the labelled data. A mean classification accuracy of 95.98% of the autonomously derived condition was achieved on three independent test sets, with a mean kappa statistic of 93.96% reported. In total, 12 rules were extracted, with an independent domain expert providing critical analysis, two thirds of the rules were deemed to be intuitive in modelling fundamental degradation behaviour of the wind turbine gearbox.
Denial of Service (DoS) and Distributed Denial of Service (DDoS) attack, exhausts the resources of server/service and makes it unavailable for legitimate users. With increasing use of online services and attacks on these services, the importance of Intrusion Detection System (IDS) for detection of DoS/DDoS attacks has also grown. Detection accuracy & CPU utilization of Data mining based IDS is directly proportional to the quality of training dataset used to train it. Various preprocessing methods like normalization, discretization, fuzzification are used by researchers to improve the quality of training dataset. This paper evaluates the effect of various data preprocessing methods on the detection accuracy of DoS/DDoS attack detection IDS and proves that numeric to binary preprocessing method performs better compared to other methods. Experimental results obtained using KDD 99 dataset are provided to support the efficiency of proposed combination.