Biblio

Found 168 results

Filters: Keyword is natural language processing  [Clear All Filters]
2017-01-05
Jaspreet Bhatia, Travis Breaux, Joel Reidenberg, Thomas Norton.  2016.  A Theory of Vagueness and Privacy Risk Perception. 2016 IEEE 24th International Requirements Engineering Conference (RE).

Ambiguity arises in requirements when astatement is unintentionally or otherwise incomplete, missing information, or when a word or phrase has morethan one possible meaning. For web-based and mobileinformation systems, ambiguity, and vagueness inparticular, undermines the ability of organizations to aligntheir privacy policies with their data practices, which canconfuse or mislead users thus leading to an increase inprivacy risk. In this paper, we introduce a theory ofvagueness for privacy policy statements based on ataxonomy of vague terms derived from an empiricalcontent analysis of 15 privacy policies. The taxonomy wasevaluated in a paired comparison experiment and resultswere analyzed using the Bradley-Terry model to yield arank order of vague terms in both isolation andcomposition. The theory predicts how vague modifiers toinformation actions and information types can becomposed to increase or decrease overall vagueness. Wefurther provide empirical evidence based on factorialvignette surveys to show how increases in vagueness willdecrease users' acceptance of privacy risk and thusdecrease users' willingness to share personal information.

2017-11-03
Park, A. J., Beck, B., Fletche, D., Lam, P., Tsang, H. H..  2016.  Temporal analysis of radical dark web forum users. 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). :880–883.
Extremist groups have turned to the Internet and social media sites as a means of sharing information amongst one another. This research study analyzes forum posts and finds people who show radical tendencies through the use of natural language processing and sentiment analysis. The forum data being used are from six Islamic forums on the Dark Web which are made available for security research. This research project uses a POS tagger to isolate keywords and nouns that can be utilized with the sentiment analysis program. Then the sentiment analysis program determines the polarity of the post. The post is scored as either positive or negative. These scores are then divided into monthly radical scores for each user. Once these time clusters are mapped, the change in opinions of the users over time may be interpreted as rising or falling levels of radicalism. Each user is then compared on a timeline to other radical users and events to determine possible connections or relationships. The ability to analyze a forum for an overall change in attitude can be an indicator of unrest and possible radical actions or terrorism.
2017-09-15
Tomuro, Noriko, Lytinen, Steven, Hornsburg, Kurt.  2016.  Automatic Summarization of Privacy Policies Using Ensemble Learning. Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy. :133–135.

When customers purchase a product or sign up for service from a company, they often are required to agree to a Privacy Policy or Terms of Service agreement. Many of these policies are lengthy, and a typical customer agrees to them without reading them carefully if at all. To address this problem, we have developed a prototype automatic text summarization system which is specifically designed for privacy policies. Our system generates a summary of a policy statement by identifying important sentences from the statement, categorizing these sentences by which of 5 "statement categories" the sentence addresses, and displaying to a user a list of the sentences which match each category. Our system incorporates keywords identified by a human domain expert and rules that were obtained by machine learning, and they are combined in an ensemble architecture. We have tested our system on a sample corpus of privacy statements, and preliminary results are promising.

Liao, Xiaojing, Yuan, Kan, Wang, XiaoFeng, Li, Zhou, Xing, Luyi, Beyah, Raheem.  2016.  Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. :755–766.

To adapt to the rapidly evolving landscape of cyber threats, security professionals are actively exchanging Indicators of Compromise (IOC) (e.g., malware signatures, botnet IPs) through public sources (e.g. blogs, forums, tweets, etc.). Such information, often presented in articles, posts, white papers etc., can be converted into a machine-readable OpenIOC format for automatic analysis and quick deployment to various security mechanisms like an intrusion detection system. With hundreds of thousands of sources in the wild, the IOC data are produced at a high volume and velocity today, which becomes increasingly hard to manage by humans. Efforts to automatically gather such information from unstructured text, however, is impeded by the limitations of today's Natural Language Processing (NLP) techniques, which cannot meet the high standard (in terms of accuracy and coverage) expected from the IOCs that could serve as direct input to a defense system. In this paper, we present iACE, an innovation solution for fully automated IOC extraction. Our approach is based upon the observation that the IOCs in technical articles are often described in a predictable way: being connected to a set of context terms (e.g., "download") through stable grammatical relations. Leveraging this observation, iACE is designed to automatically locate a putative IOC token (e.g., a zip file) and its context (e.g., "malware", "download") within the sentences in a technical article, and further analyze their relations through a novel application of graph mining techniques. Once the grammatical connection between the tokens is found to be in line with the way that the IOC is commonly presented, these tokens are extracted to generate an OpenIOC item that describes not only the indicator (e.g., a malicious zip file) but also its context (e.g., download from an external source). Running on 71,000 articles collected from 45 leading technical blogs, this new approach demonstrates a remarkable performance: it generated 900K OpenIOC items with a precision of 95% and a coverage over 90%, which is way beyond what the state-of-the-art NLP technique and industry IOC tool can achieve, at a speed of thousands of articles per hour. Further, by correlating the IOCs mined from the articles published over a 13-year span, our study sheds new light on the links across hundreds of seemingly unrelated attack instances, particularly their shared infrastructure resources, as well as the impacts of such open-source threat intelligence on security protection and evolution of attack strategies.

2017-03-07
Macdonald, M., Frank, R., Mei, J., Monk, B..  2015.  Identifying digital threats in a hacker web forum. 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). :926–933.

Information threatening the security of critical infrastructures are exchanged over the Internet through communication platforms, such as online discussion forums. This information can be used by malicious hackers to attack critical computer networks and data systems. Much of the literature on the hacking of critical infrastructure has focused on developing typologies of cyber-attacks, but has not examined the communication activities of the actors involved. To address this gap in the literature, the language of hackers was analyzed to identify potential threats against critical infrastructures using automated analysis tools. First, discussion posts were collected from a selected hacker forum using a customized web-crawler. Posts were analyzed using a parts of speech tagger, which helped determine a list of keywords used to query the data. Next, a sentiment analysis tool scored these keywords, which were then analyzed to determine the effectiveness of this method.

2017-02-23
K. Alnaami, G. Ayoade, A. Siddiqui, N. Ruozzi, L. Khan, B. Thuraisingham.  2015.  "P2V: Effective Website Fingerprinting Using Vector Space Representations". 2015 IEEE Symposium Series on Computational Intelligence. :59-66.

Language vector space models (VSMs) have recently proven to be effective across a variety of tasks. In VSMs, each word in a corpus is represented as a real-valued vector. These vectors can be used as features in many applications in machine learning and natural language processing. In this paper, we study the effect of vector space representations in cyber security. In particular, we consider a passive traffic analysis attack (Website Fingerprinting) that threatens users' navigation privacy on the web. By using anonymous communication, Internet users (such as online activists) may wish to hide the destination of web pages they access for different reasons such as avoiding tyrant governments. Traditional website fingerprinting studies collect packets from the users' network and extract features that are used by machine learning techniques to reveal the destination of certain web pages. In this work, we propose the packet to vector (P2V) approach where we model website fingerprinting attack using word vector representations. We show how the suggested model outperforms previous website fingerprinting works.

2017-03-07
Alnaami, K., Ayoade, G., Siddiqui, A., Ruozzi, N., Khan, L., Thuraisingham, B..  2015.  P2V: Effective Website Fingerprinting Using Vector Space Representations. 2015 IEEE Symposium Series on Computational Intelligence. :59–66.

Language vector space models (VSMs) have recently proven to be effective across a variety of tasks. In VSMs, each word in a corpus is represented as a real-valued vector. These vectors can be used as features in many applications in machine learning and natural language processing. In this paper, we study the effect of vector space representations in cyber security. In particular, we consider a passive traffic analysis attack (Website Fingerprinting) that threatens users' navigation privacy on the web. By using anonymous communication, Internet users (such as online activists) may wish to hide the destination of web pages they access for different reasons such as avoiding tyrant governments. Traditional website fingerprinting studies collect packets from the users' network and extract features that are used by machine learning techniques to reveal the destination of certain web pages. In this work, we propose the packet to vector (P2V) approach where we model website fingerprinting attack using word vector representations. We show how the suggested model outperforms previous website fingerprinting works.

2017-03-08
Darabseh, A., Namin, A. S..  2015.  On Accuracy of Classification-Based Keystroke Dynamics for Continuous User Authentication. 2015 International Conference on Cyberworlds (CW). :321–324.

The aim of this research is to advance the user active authentication using keystroke dynamics. Through this research, we assess the performance and influence of various keystroke features on keystroke dynamics authentication systems. In particular, we investigate the performance of keystroke features on a subset of most frequently used English words. The performance of four features such as i) key duration, ii) flight time latency, iii) diagraph time latency, and iv) word total time duration are analyzed. Two machine learning techniques are employed for assessing keystroke authentications. The selected classification methods are support vector machine (SVM), and k-nearest neighbor classifier (K-NN). The logged experimental data are captured for 28 users. The experimental results show that key duration time offers the best performance result among all four keystroke features, followed by word total time.

Darabseh, A., Namin, A. Siami.  2015.  On Accuracy of Keystroke Authentications Based on Commonly Used English Words. 2015 International Conference of the Biometrics Special Interest Group (BIOSIG). :1–8.

The aim of this research is to advance the user active authentication using keystroke dynamics. Through this research, we assess the performance and influence of various keystroke features on keystroke dynamics authentication systems. In particular, we investigate the performance of keystroke features on a subset of most frequently used English words. The performance of four features such as i) key duration, ii) flight time latency, iii) digraph time latency, and iv) word total time duration are analyzed. Experiments are performed to measure the performance of each feature individually as well as the results from the different subsets of these features. Four machine learning techniques are employed for assessing keystroke authentications. The selected classification methods are two-class support vector machine (TC) SVM, one-class support vector machine (OC) SVM, k-nearest neighbor classifier (K-NN), and Naive Bayes classifier (NB). The logged experimental data are captured for 28 users. The experimental results show that key duration time offers the best performance result among all four keystroke features, followed by word total time. Furthermore, our results show that TC SVM and KNN perform the best among the four classifiers.

2014-09-17
Yang, Wei, Xiao, Xusheng, Pandita, Rahul, Enck, William, Xie, Tao.  2014.  Improving Mobile Application Security via Bridging User Expectations and Application Behaviors. Proceedings of the 2014 Symposium and Bootcamp on the Science of Security. :32:1–32:2.

To keep malware out of mobile application markets, existing techniques analyze the security aspects of application behaviors and summarize patterns of these security aspects to determine what applications do. However, user expectations (reflected via user perception in combination with user judgment) are often not incorporated into such analysis to determine whether application behaviors are within user expectations. This poster presents our recent work on bridging the semantic gap between user perceptions of the application behaviors and the actual application behaviors.

2015-04-30
Hassen, H., Khemakhem, M..  2014.  A secured distributed OCR system in a pervasive environment with authentication as a service in the Cloud. Multimedia Computing and Systems (ICMCS), 2014 International Conference on. :1200-1205.

In this paper we explore the potential for securing a distributed Arabic Optical Character Recognition (OCR) system via cloud computing technology in a pervasive and mobile environment. The goal of the system is to achieve full accuracy, high speed and security when taking into account large vocabularies and amounts of documents. This issue has been resolved by integrating the recognition process and the security issue with multiprocessing and distributed computing technologies.

2015-05-05
Babour, A., Khan, J.I..  2014.  Tweet Sentiment Analytics with Context Sensitive Tone-Word Lexicon. Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014 IEEE/WIC/ACM International Joint Conferences on. 1:392-399.

In this paper we propose a twitter sentiment analytics that mines for opinion polarity about a given topic. Most of current semantic sentiment analytics depends on polarity lexicons. However, many key tone words are frequently bipolar. In this paper we demonstrate a technique which can accommodate the bipolarity of tone words by context sensitive tone lexicon learning mechanism where the context is modeled by the semantic neighborhood of the main target. Performance analysis shows that ability to contextualize the tone word polarity significantly improves the accuracy.

2015-05-04
Khosmood, F., Nico, P.L., Woolery, J..  2014.  User identification through command history analysis. Computational Intelligence in Cyber Security (CICS), 2014 IEEE Symposium on. :1-7.

As any veteran of the editor wars can attest, Unix users can be fiercely and irrationally attached to the commands they use and the manner in which they use them. In this work, we investigate the problem of identifying users out of a large set of candidates (25-97) through their command-line histories. Using standard algorithms and feature sets inspired by natural language authorship attribution literature, we demonstrate conclusively that individual users can be identified with a high degree of accuracy through their command-line behavior. Further, we report on the best performing feature combinations, from the many thousands that are possible, both in terms of accuracy and generality. We validate our work by experimenting on three user corpora comprising data gathered over three decades at three distinct locations. These are the Greenberg user profile corpus (168 users), Schonlau masquerading corpus (50 users) and Cal Poly command history corpus (97 users). The first two are well known corpora published in 1991 and 2001 respectively. The last is developed by the authors in a year-long study in 2014 and represents the most recent corpus of its kind. For a 50 user configuration, we find feature sets that can successfully identify users with over 90% accuracy on the Cal Poly, Greenberg and one variant of the Schonlau corpus, and over 87% on the other Schonlau variant.

2015-05-05
Koch, S., John, M., Worner, M., Muller, A., Ertl, T..  2014.  VarifocalReader #x2014; In-Depth Visual Analysis of Large Text Documents. Visualization and Computer Graphics, IEEE Transactions on. 20:1723-1732.

Interactive visualization provides valuable support for exploring, analyzing, and understanding textual documents. Certain tasks, however, require that insights derived from visual abstractions are verified by a human expert perusing the source text. So far, this problem is typically solved by offering overview-detail techniques, which present different views with different levels of abstractions. This often leads to problems with visual continuity. Focus-context techniques, on the other hand, succeed in accentuating interesting subsections of large text documents but are normally not suited for integrating visual abstractions. With VarifocalReader we present a technique that helps to solve some of these approaches' problems by combining characteristics from both. In particular, our method simplifies working with large and potentially complex text documents by simultaneously offering abstract representations of varying detail, based on the inherent structure of the document, and access to the text itself. In addition, VarifocalReader supports intra-document exploration through advanced navigation concepts and facilitates visual analysis tasks. The approach enables users to apply machine learning techniques and search mechanisms as well as to assess and adapt these techniques. This helps to extract entities, concepts and other artifacts from texts. In combination with the automatic generation of intermediate text levels through topic segmentation for thematic orientation, users can test hypotheses or develop interesting new research questions. To illustrate the advantages of our approach, we provide usage examples from literature studies.

Heimerl, F., Lohmann, S., Lange, S., Ertl, T..  2014.  Word Cloud Explorer: Text Analytics Based on Word Clouds. System Sciences (HICSS), 2014 47th Hawaii International Conference on. :1833-1842.

Word clouds have emerged as a straightforward and visually appealing visualization method for text. They are used in various contexts as a means to provide an overview by distilling text down to those words that appear with highest frequency. Typically, this is done in a static way as pure text summarization. We think, however, that there is a larger potential to this simple yet powerful visualization paradigm in text analytics. In this work, we explore the usefulness of word clouds for general text analysis tasks. We developed a prototypical system called the Word Cloud Explorer that relies entirely on word clouds as a visualization method. It equips them with advanced natural language processing, sophisticated interaction techniques, and context information. We show how this approach can be effectively used to solve text analysis tasks and evaluate it in a qualitative user study.

Baughman, A.K., Chuang, W., Dixon, K.R., Benz, Z., Basilico, J..  2014.  DeepQA Jeopardy! Gamification: A Machine-Learning Perspective. Computational Intelligence and AI in Games, IEEE Transactions on. 6:55-66.

DeepQA is a large-scale natural language processing (NLP) question-and-answer system that responds across a breadth of structured and unstructured data, from hundreds of analytics that are combined with over 50 models, trained through machine learning. After the 2011 historic milestone of defeating the two best human players in the Jeopardy! game show, the technology behind IBM Watson, DeepQA, is undergoing gamification into real-world business problems. Gamifying a business domain for Watson is a composite of functional, content, and training adaptation for nongame play. During domain gamification for medical, financial, government, or any other business, each system change affects the machine-learning process. As opposed to the original Watson Jeopardy!, whose class distribution of positive-to-negative labels is 1:100, in adaptation the computed training instances, question-and-answer pairs transformed into true-false labels, result in a very low positive-to-negative ratio of 1:100 000. Such initial extreme class imbalance during domain gamification poses a big challenge for the Watson machine-learning pipelines. The combination of ingested corpus sets, question-and-answer pairs, configuration settings, and NLP algorithms contribute toward the challenging data state. We propose several data engineering techniques, such as answer key vetting and expansion, source ingestion, oversampling classes, and question set modifications to increase the computed true labels. In addition, algorithm engineering, such as an implementation of the Newton-Raphson logistic regression with a regularization term, relaxes the constraints of class imbalance during training adaptation. We conclude by empirically demonstrating that data and algorithm engineering are complementary and indispensable to overcome the challenges in this first Watson gamification for real-world business problems.

2015-05-04
Marukatat, R., Somkiadcharoen, R., Nalintasnai, R., Aramboonpong, T..  2014.  Authorship Attribution Analysis of Thai Online Messages. Information Science and Applications (ICISA), 2014 International Conference on. :1-4.

This paper presents a framework to identify the authors of Thai online messages. The identification is based on 53 writing attributes and the selected algorithms are support vector machine (SVM) and C4.5 decision tree. Experimental results indicate that the overall accuracies achieved by the SVM and the C4.5 were 79% and 75%, respectively. This difference was not statistically significant (at 95% confidence interval). As for the performance of identifying individual authors, in some cases the SVM was clearly better than the C4.5. But there were also other cases where both of them could not distinguish one author from another.

2021-04-08
Colbaugh, R., Glass, K., Bauer, T..  2013.  Dynamic information-theoretic measures for security informatics. 2013 IEEE International Conference on Intelligence and Security Informatics. :45–49.
Many important security informatics problems require consideration of dynamical phenomena for their solution; examples include predicting the behavior of individuals in social networks and distinguishing malicious and innocent computer network activities based on activity traces. While information theory offers powerful tools for analyzing dynamical processes, to date the application of information-theoretic methods in security domains has focused on static analyses (e.g., cryptography, natural language processing). This paper leverages information-theoretic concepts and measures to quantify the similarity of pairs of stochastic dynamical systems, and shows that this capability can be used to solve important problems which arise in security applications. We begin by presenting a concise review of the information theory required for our development, and then address two challenging tasks: 1.) characterizing the way influence propagates through social networks, and 2.) distinguishing malware from legitimate software based on the instruction sequences of the disassembled programs. In each application, case studies involving real-world datasets demonstrate that the proposed techniques outperform standard methods.