Biblio
As the malware threat landscape is constantly evolving and over one million new malware strains are being generated every day [1], early automatic detection of threats constitutes a top priority of cybersecurity research, and amplifies the need for more advanced detection and classification methods that are effective and efficient. In this paper, we present the application of machine learning algorithms to predict the length of time malware should be executed in a sandbox to reveal its malicious intent. We also introduce a novel hybrid approach to malware classification based on static binary analysis and dynamic analysis of malware. Static analysis extracts information from a binary file without executing it, and dynamic analysis captures the behavior of malware in a sandbox environment. Our experimental results show that by turning the aforementioned problems into machine learning problems, it is possible to get an accuracy of up to 90% on the prediction of the malware analysis run time and up to 92% on the classification of malware families.
Phishing emails have affected users seriously due to the enormous increasing in numbers and exquisite camouflage. Users spend much more effort on distinguishing the email properties, therefore current phishing email detection system demands more creativity and consideration in filtering for users. The proposed research tries to adopt creative computing in detecting phishing emails for users through a combination of computing techniques and social engineering concepts. In order to achieve the proposed target, the fraud type is summarised in social engineering criteria through literature review; a semantic web database is established to extract and store information; a fuzzy logic control algorithm is constructed to allocate email categories. The proposed approach will help users to distinguish the categories of emails, furthermore, to give advice based on different categories allocation. For the purpose of illustrating the approach, a case study will be presented to simulate a phishing email receiving scenario.
Social Networking is fundamentally shifting the way we communicate, sharing idea and form opinions. All people try to use social media for there need, people from every age group are involved in social media site or e-commerce site. Nowadays almost every illegal activity is happened using the social network and instant messages. It means that present system is not capable to found all suspicious words. In this paper, we provided a brief description of problem and review on the different framework developed so far. Propose a better system which can be indentify criminal activity through social networking more efficiently. Use Ontology Based Information Extraction (OBIE) technique to identify domain of word and Association Rule mining to generate rules. Heuristic method checks in user database for malicious users according to predefine elements and Naïve Bayes method is use to identify the context behind the message or post. The experimental result is used for further action on victim by cyber crime department.
Modern information extraction pipelines are typically constructed by (1) loading textual data from a database into a special-purpose application, (2) applying a myriad of text-analytics functions to the text, which produce a structured relational table, and (3) storing this table in a database. Obviously, this approach can lead to laborious development processes, complex and tangled programs, and inefficient control flows. Towards solving these deficiencies, we embark on an effort to lay the foundations of a new generation of text-centric database management systems. Concretely, we extend the relational model by incorporating into it the theory of document spanners which provides the means and methods for the model to engage the Information Extraction (IE) tasks. This extended model, called Spannerlog, provides a novel declarative method for defining and manipulating textual data, which makes possible the automation of the typical work method described above. In addition to formally defining Spannerlog and illustrating its usefulness for IE tasks, we also report on initial results concerning its expressive power.
In many domains, a plethora of textual information is available on the web as news reports, blog posts, community portals, etc. Information extraction (IE) is the default technique to turn unstructured text into structured fact databases, but systematically applying IE techniques to web input requires highly complex systems, starting from focused crawlers over quality assurance methods to cope with the HTML input to long pipelines of natural language processing and IE algorithms. Although a number of tools for each of these steps exists, their seamless, flexible, and scalable combination into a web scale end-to-end text analytics system still is a true challenge. In this paper, we report our experiences from building such a system for comparing the "web view" on health related topics with that derived from a controlled scientific corpus, i.e., Medline. The system combines a focused crawler, applying shallow text analysis and classification to maintain focus, with a sophisticated text analytic engine inside the Big Data processing system Stratosphere. We describe a practical approach to seed generation which led us crawl a corpus of \textasciitilde1 TB web pages highly enriched for the biomedical domain. Pages were run through a complex pipeline of best-of-breed tools for a multitude of necessary tasks, such as HTML repair, boilerplate detection, sentence detection, linguistic annotation, parsing, and eventually named entity recognition for several types of entities. Results are compared with those from running the same pipeline (without the web-related tasks) on a corpus of 24 million scientific abstracts and a third corpus made of \textasciitilde250K scientific full texts. We evaluate scalability, quality, and robustness of the employed methods and tools. The focus of this paper is to provide a large, real-life use case to inspire future research into robust, easy-to-use, and scalable methods for domain-specific IE at web scale.
DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data — scientific papers, Web classified ads, customer service notes, and so on — were instead in a relational database, it would give analysts access to a massive and highly-valuable new set of "big data" to exploit. DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.
Recently, threat of previously unknown cyber-attacks are increasing because existing security systems are not able to detect them. Past cyber-attacks had simple purposes of leaking personal information by attacking the PC or destroying the system. However, the goal of recent hacking attacks has changed from leaking information and destruction of services to attacking large-scale systems such as critical infrastructures and state agencies. In the other words, existing defence technologies to counter these attacks are based on pattern matching methods which are very limited. Because of this fact, in the event of new and previously unknown attacks, detection rate becomes very low and false negative increases. To defend against these unknown attacks, which cannot be detected with existing technology, we propose a new model based on big data analysis techniques that can extract information from a variety of sources to detect future attacks. We expect our model to be the basis of the future Advanced Persistent Threat(APT) detection and prevention system implementations.
The technology of vehicle video detecting and tracking has been playing an important role in the ITS (Intelligent Transportation Systems) field during recent years. The occlusion phenomenon among vehicles is one of the most difficult problems related to vehicle tracking. In order to handle occlusion, this paper proposes an effective solution that applied Markov Random Field (MRF) to the traffic images. The contour of the vehicle is firstly detected by using background subtraction, then numbers of blocks with vehicle's texture and motion information are filled inside each vehicle. We extract several kinds of information of each block to process the following tracking. As for each occlusive block two groups of clique functions in MRF model are defined, which represents spatial correlation and motion coherence respectively. By calculating each occlusive block's total energy function, we finally solve the attribution problem of occlusive blocks. The experimental results show that our method can handle occlusion problems effectively and track each vehicle continuously.