Visible to the public Biblio

Filters: Keyword is information extraction  [Clear All Filters]
2022-06-06
Hung, Benjamin W.K., Muramudalige, Shashika R., Jayasumana, Anura P., Klausen, Jytte, Libretti, Rosanne, Moloney, Evan, Renugopalakrishnan, Priyanka.  2019.  Recognizing Radicalization Indicators in Text Documents Using Human-in-the-Loop Information Extraction and NLP Techniques. 2019 IEEE International Symposium on Technologies for Homeland Security (HST). :1–7.
Among the operational shortfalls that hinder law enforcement from achieving greater success in preventing terrorist attacks is the difficulty in dynamically assessing individualized violent extremism risk at scale given the enormous amount of primarily text-based records in disparate databases. In this work, we undertake the critical task of employing natural language processing (NLP) techniques and supervised machine learning models to classify textual data in analyst and investigator notes and reports for radicalization behavioral indicators. This effort to generate structured knowledge will build towards an operational capability to assist analysts in rapidly mining law enforcement and intelligence databases for cues and risk indicators. In the near-term, this effort also enables more rapid coding of biographical radicalization profiles to augment a research database of violent extremists and their exhibited behavioral indicators.
2021-02-22
Si, Y., Zhou, W., Gai, J..  2020.  Research and Implementation of Data Extraction Method Based on NLP. 2020 IEEE 14th International Conference on Anti-counterfeiting, Security, and Identification (ASID). :11–15.
In order to accurately extract the data from unstructured Chinese text, this paper proposes a rule-based method based on natural language processing and regular expression. This method makes use of the language expression rules of the data in the text and other related knowledge to form the feature word lists and rule template to match the text. Experimental results show that the accuracy of the designed algorithm is 94.09%.
2019-09-04
Xiong, M., Li, A., Xie, Z., Jia, Y..  2018.  A Practical Approach to Answer Extraction for Constructing QA Solution. 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC). :398–404.
Question Answering system(QA) plays an increasingly important role in the Internet age. The proportion of using the QA is getting higher and higher for the Internet users to obtain knowledge and solve problems, especially in the modern agricultural filed. However, the answer quality in QA varies widely due to the agricultural expert's level. Answer quality assessment is important. Due to the lexical gap between questions and answers, the existing approaches are not quite satisfactory. A practical approach RCAS is proposed to rank the candidate answers, which utilizes the support sets to reduce the impact of lexical gap between questions and answers. Firstly, Similar questions are retrieved and support sets are produced with their high-quality answers. Based on the assumption that high quality answers would also have intrinsic similarity, the quality of candidate answers are then evaluated through their distance from the support sets. Secondly, Different from the existing approaches, previous knowledge from similar question-answer pairs are used to bridge the straight lexical and semantic gaps between questions and answers. Experiments are implemented on approximately 0.15 million question-answer pairs about agriculture, dietetics and food from Yahoo! Answers. The results show that our approach can rank the candidate answers more precisely.
2018-01-23
Kilgallon, S., Rosa, L. De La, Cavazos, J..  2017.  Improving the effectiveness and efficiency of dynamic malware analysis with machine learning. 2017 Resilience Week (RWS). :30–36.

As the malware threat landscape is constantly evolving and over one million new malware strains are being generated every day [1], early automatic detection of threats constitutes a top priority of cybersecurity research, and amplifies the need for more advanced detection and classification methods that are effective and efficient. In this paper, we present the application of machine learning algorithms to predict the length of time malware should be executed in a sandbox to reveal its malicious intent. We also introduce a novel hybrid approach to malware classification based on static binary analysis and dynamic analysis of malware. Static analysis extracts information from a binary file without executing it, and dynamic analysis captures the behavior of malware in a sandbox environment. Our experimental results show that by turning the aforementioned problems into machine learning problems, it is possible to get an accuracy of up to 90% on the prediction of the malware analysis run time and up to 92% on the classification of malware families.

2017-12-20
Che, H., Liu, Q., Zou, L., Yang, H., Zhou, D., Yu, F..  2017.  A Content-Based Phishing Email Detection Method. 2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). :415–422.

Phishing emails have affected users seriously due to the enormous increasing in numbers and exquisite camouflage. Users spend much more effort on distinguishing the email properties, therefore current phishing email detection system demands more creativity and consideration in filtering for users. The proposed research tries to adopt creative computing in detecting phishing emails for users through a combination of computing techniques and social engineering concepts. In order to achieve the proposed target, the fraud type is summarised in social engineering criteria through literature review; a semantic web database is established to extract and store information; a fuzzy logic control algorithm is constructed to allocate email categories. The proposed approach will help users to distinguish the categories of emails, furthermore, to give advice based on different categories allocation. For the purpose of illustrating the approach, a case study will be presented to simulate a phishing email receiving scenario.

2017-05-30
Jadhao, Ankita R., Agrawal, Avinash J..  2016.  A Digital Forensics Investigation Model for Social Networking Site. Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies. :130:1–130:4.

Social Networking is fundamentally shifting the way we communicate, sharing idea and form opinions. All people try to use social media for there need, people from every age group are involved in social media site or e-commerce site. Nowadays almost every illegal activity is happened using the social network and instant messages. It means that present system is not capable to found all suspicious words. In this paper, we provided a brief description of problem and review on the different framework developed so far. Propose a better system which can be indentify criminal activity through social networking more efficiently. Use Ontology Based Information Extraction (OBIE) technique to identify domain of word and Association Rule mining to generate rules. Heuristic method checks in user database for malicious users according to predefine elements and Naïve Bayes method is use to identify the context behind the message or post. The experimental result is used for further action on victim by cyber crime department.

2017-05-19
Nahshon, Yoav, Peterfreund, Liat, Vansummeren, Stijn.  2016.  Incorporating Information Extraction in the Relational Database Model. Proceedings of the 19th International Workshop on Web and Databases. :6:1–6:7.

Modern information extraction pipelines are typically constructed by (1) loading textual data from a database into a special-purpose application, (2) applying a myriad of text-analytics functions to the text, which produce a structured relational table, and (3) storing this table in a database. Obviously, this approach can lead to laborious development processes, complex and tangled programs, and inefficient control flows. Towards solving these deficiencies, we embark on an effort to lay the foundations of a new generation of text-centric database management systems. Concretely, we extend the relational model by incorporating into it the theory of document spanners which provides the means and methods for the model to engage the Information Extraction (IE) tasks. This extended model, called Spannerlog, provides a novel declarative method for defining and manipulating textual data, which makes possible the automation of the typical work method described above. In addition to formally defining Spannerlog and illustrating its usefulness for IE tasks, we also report on initial results concerning its expressive power.

Rheinländer, Astrid, Lehmann, Mario, Kunkel, Anja, Meier, Jörg, Leser, Ulf.  2016.  Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale. Proceedings of the 2016 International Conference on Management of Data. :759–771.

In many domains, a plethora of textual information is available on the web as news reports, blog posts, community portals, etc. Information extraction (IE) is the default technique to turn unstructured text into structured fact databases, but systematically applying IE techniques to web input requires highly complex systems, starting from focused crawlers over quality assurance methods to cope with the HTML input to long pipelines of natural language processing and IE algorithms. Although a number of tools for each of these steps exists, their seamless, flexible, and scalable combination into a web scale end-to-end text analytics system still is a true challenge. In this paper, we report our experiences from building such a system for comparing the "web view" on health related topics with that derived from a controlled scientific corpus, i.e., Medline. The system combines a focused crawler, applying shallow text analysis and classification to maintain focus, with a sophisticated text analytic engine inside the Big Data processing system Stratosphere. We describe a practical approach to seed generation which led us crawl a corpus of \textasciitilde1 TB web pages highly enriched for the biomedical domain. Pages were run through a complex pipeline of best-of-breed tools for a multitude of necessary tasks, such as HTML repair, boilerplate detection, sentence detection, linguistic annotation, parsing, and eventually named entity recognition for several types of entities. Results are compared with those from running the same pipeline (without the web-related tasks) on a corpus of 24 million scientific abstracts and a third corpus made of \textasciitilde250K scientific full texts. We evaluate scalability, quality, and robustness of the employed methods and tools. The focus of this paper is to provide a large, real-life use case to inspire future research into robust, easy-to-use, and scalable methods for domain-specific IE at web scale.

2017-03-07
Zhang, Ce, Shin, Jaeho, Ré, Christopher, Cafarella, Michael, Niu, Feng.  2016.  Extracting Databases from Dark Data with DeepDive. Proceedings of the 2016 International Conference on Management of Data. :847–859.

DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data — scientific papers, Web classified ads, customer service notes, and so on — were instead in a relational database, it would give analysts access to a massive and highly-valuable new set of "big data" to exploit. DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.

2015-05-06
Sung-Hwan Ahn, Nam-Uk Kim, Tai-Myoung Chung.  2014.  Big data analysis system concept for detecting unknown attacks. Advanced Communication Technology (ICACT), 2014 16th International Conference on. :269-272.

Recently, threat of previously unknown cyber-attacks are increasing because existing security systems are not able to detect them. Past cyber-attacks had simple purposes of leaking personal information by attacking the PC or destroying the system. However, the goal of recent hacking attacks has changed from leaking information and destruction of services to attacking large-scale systems such as critical infrastructures and state agencies. In the other words, existing defence technologies to counter these attacks are based on pattern matching methods which are very limited. Because of this fact, in the event of new and previously unknown attacks, detection rate becomes very low and false negative increases. To defend against these unknown attacks, which cannot be detected with existing technology, we propose a new model based on big data analysis techniques that can extract information from a variety of sources to detect future attacks. We expect our model to be the basis of the future Advanced Persistent Threat(APT) detection and prevention system implementations.

2015-05-04
Lin Chen, Lu Zhou, Chunxue Liu, Quan Sun, Xiaobo Lu.  2014.  Occlusive vehicle tracking via processing blocks in Markov random field. Progress in Informatics and Computing (PIC), 2014 International Conference on. :294-298.

The technology of vehicle video detecting and tracking has been playing an important role in the ITS (Intelligent Transportation Systems) field during recent years. The occlusion phenomenon among vehicles is one of the most difficult problems related to vehicle tracking. In order to handle occlusion, this paper proposes an effective solution that applied Markov Random Field (MRF) to the traffic images. The contour of the vehicle is firstly detected by using background subtraction, then numbers of blocks with vehicle's texture and motion information are filled inside each vehicle. We extract several kinds of information of each block to process the following tracking. As for each occlusive block two groups of clique functions in MRF model are defined, which represents spatial correlation and motion coherence respectively. By calculating each occlusive block's total energy function, we finally solve the attribution problem of occlusive blocks. The experimental results show that our method can handle occlusion problems effectively and track each vehicle continuously.