Biblio
Advanced Persistent Threat (APT) attacks became a major network threat in recent years. Among APT attack techniques, sending a phishing email with malicious documents attached is considered one of the most effective ones. Although many users have the impression that documents are harmless, a malicious document may in fact contain shellcode to attack victims. To cope with the problem, we design and implement a malicious document detector called Forensor to differentiate malicious documents. Forensor integrates several open-source tools and methods. It first introspects file format to retrieve objects inside the documents, and then automatically decrypts simple encryption methods, e.g., XOR, rot and shift, commonly used in malware to discover potential shellcode. The emulator is used to verify the presence of shellcode. If shellcode is discovered, the file is considered malicious. The experiment used 9,000 benign files and more than 10,000 malware samples from a well-known sample sharing website. The result shows no false negative and only 2 false positives.
Compute-intensive simulations typically charge substantial workloads on an online simulation platform backed by limited computing clusters and storage resources. Some (or most) of the simulations initiated by users may accompany input parameters/files that have been already provided by other (or same) users in the past. Unfortunately, these duplicate simulations may aggravate the performance of the platform by drastic consumption of the limited resources shared by a number of users on the platform. To minimize or avoid conducting repeated simulations, we present a novel system, called SUPERMAN (SimUlation ProvEnance Recycling MANager) that can record simulation provenances and recycle the results of past simulations. This system presents a great opportunity to not only reutilize existing results but also perform various analytics helpful for those who are not familiar with the platform. The system also offers interoperability across other systems by collecting the provenances in a standardized format. In our simulated experiments we found that over half of past computing jobs could be answered without actual executions by our system.
Community question answering (cQA) has become an important issue due to the popularity of cQA archives on the Web. This paper focuses on addressing the lexical gap problem in question retrieval. Question retrieval in cQA archives aims to find the existing questions that are semantically equivalent or relevant to the queried questions. However, the lexical gap problem brings a new challenge for question retrieval in cQA. In this paper, we propose to model and learn distributed word representations with metadata of category information within cQA pages for question retrieval using two novel category powered models. One is a basic category powered model called MB-NET and the other one is an enhanced category powered model called ME-NET which can better learn the distributed word representations and alleviate the lexical gap problem. To deal with the variable size of word representation vectors, we employ the framework of fisher kernel to transform them into the fixed-length vectors. Experimental results on large-scale English and Chinese cQA data sets show that our proposed approaches can significantly outperform state-of-the-art retrieval models for question retrieval in cQA. Moreover, we further conduct our approaches on large-scale automatic evaluation experiments. The evaluation results show that promising and significant performance improvements can be achieved.
The performance of clustering is a crucial challenge, especially for pattern recognition. The models aggregation has a positive impact on the efficiency of Data clustering. This technique is used to obtain more cluttered decision boundaries by aggregating the resulting clustering models. In this paper, we study an aggregation scheme to improve the stability and accuracy of clustering, which allows to find a reliable and robust clustering model. We demonstrate the advantages of our aggregation method by running Fuzzy C-Means (FCM) clustering on Reuters-21578 corpus. Experimental studies showed that our scheme optimized the bias-variance on the selected model and achieved enhanced clustering for unstructured textual resources.
In this article, I present the questions that I seek to answer in my PhD research. I posit to analyze natural language text with the help of semantic annotations and mine important events for navigating large text corpora. Semantic annotations such as named entities, geographic locations, and temporal expressions can help us mine events from the given corpora. These events thus provide us with useful means to discover the locked knowledge in them. I pose three problems that can help unlock this knowledge vault in semantically annotated text corpora: i. identifying important events; ii. semantic search; iii. and event analytics.
Object Injection Vulnerability (OIV) is an emerging threat for web applications. It involves accepting external inputs during deserialization operation and use the inputs for sensitive operations such as file access, modification, and deletion. The challenge is the automation of the detection process. When the application size is large, it becomes hard to perform traditional approaches such as data flow analysis. Recent approaches fall short of narrowing down the list of source files to aid developers in discovering OIV and the flexibility to check for the presence of OIV through various known APIs. In this work, we address these limitations by exploring a concept borrowed from the information retrieval domain called Latent Semantic Indexing (LSI) to discover OIV. The approach analyzes application source code and builds an initial term document matrix which is then transformed systematically using singular value decomposition to reduce the search space. The approach identifies a small set of documents (source files) that are likely responsible for OIVs. We apply the LSI concept to three open source PHP applications that have been reported to contain OIVs. Our initial evaluation results suggest that the proposed LSI-based approach can identify OIVs and identify new vulnerabilities.
Twitter provides a public streaming API that is strictly limited, making it difficult to simultaneously achieve good coverage and relevance when monitoring tweets for a specific topic of interest. In this paper, we address the tweet acquisition challenge to enhance monitoring of tweets based on the client/application needs in an online adaptive manner such that the quality and quantity of the results improves over time. We propose a Tweet Acquisition System (TAS), that iteratively selects phrases to track based on an explore-exploit strategy. Our experimental studies show that TAS significantly improves recall of relevant tweets and the performance improves when the topics are more specific.
Cybersecurity is a problem of growing relevance that impacts all facets of society. As a result, many researchers have become interested in studying cybercriminals and online hacker communities in order to develop more effective cyber defenses. In particular, analysis of hacker community contents may reveal existing and emerging threats that pose great risk to individuals, businesses, and government. Thus, we are interested in developing an automated methodology for identifying tangible and verifiable evidence of potential threats within hacker forums, IRC channels, and carding shops. To identify threats, we couple machine learning methodology with information retrieval techniques. Our approach allows us to distill potential threats from the entirety of collected hacker contents. We present several examples of identified threats found through our analysis techniques. Results suggest that hacker communities can be analyzed to aid in cyber threat detection, thus providing promising direction for future work.
With the spectacular increase in online activities like e-transactions, security and privacy issues are at the peak with respect to their significance. Large numbers of database security breaches are occurring at a very high rate on daily basis. So, there is a crucial need in the field of database forensics to make several redundant copies of sensitive data found in database server artifacts, audit logs, cache, table storage etc. for analysis purposes. Large volume of metadata is available in database infrastructure for investigation purposes but most of the effort lies in the retrieval and analysis of that information from computing systems. Thus, in this paper we mainly focus on the significance of metadata in database forensics. We proposed a system here to perform forensics analysis of database by generating its metadata file independent of the DBMS system used. We also aim to generate the digital evidence against criminals for presenting it in the court of law in the form of who, when, why, what, how and where did the fraudulent transaction occur. Thus, we are presenting a system to detect major database attacks as well as anti-forensics attacks by developing an open source database forensics tool. Eventually, we are pointing out the challenges in the field of forensics and how these challenges can be used as opportunities to stimulate the areas of database forensics.
The EIIM model for ER allows for creation and maintenance of persistent entity identity structures. It accomplishes this through a collection of batch configurations that allow updates and asserted fixes to be made to the Identity knowledgebase (IKB). The model also provides a batch IR configuration that provides no maintenance activity but instead allows access to the identity information. This batch IR configuration is limited in a few ways. It is driven by the same rules used for maintaining the IKB, has no inherent method to identity "close" matches, and can only identify and return the positive matches. Through the decoupling of this configuration and its movements into an interactive role under the umbrella of an Identity Management Service, a more robust access method can be provided for the use of identity information. This more robust access to the information improved the quality of the information along multiple Information Quality dimensions.
This paper presents a novel visual analytics technique developed to support exploratory search tasks for event data document collections. The technique supports discovery and exploration by clustering results and overlaying cluster summaries onto coordinated timeline and map views. Users can also explore and interact with search results by selecting clusters to filter and re-cluster the data with animation used to smooth the transition between views. The technique demonstrates a number of advantages over alternative methods for displaying and exploring geo-referenced search results and spatio-temporal data. Firstly, cluster summaries can be presented in a manner that makes them easy to read and scan. Listing representative events from each cluster also helps the process of discovery by preserving the diversity of results. Also, clicking on visual representations of geo-temporal clusters provides a quick and intuitive way to navigate across space and time simultaneously. This removes the need to overload users with the display of too many event labels at any one time. The technique was evaluated with a group of nineteen users and compared with an equivalent text based exploratory search engine.
Keeping up with rapid advances in research in various fields of Engineering and Technology is a challenging task. Decision makers including academics, program managers, venture capital investors, industry leaders and funding agencies not only need to be abreast of latest developments but also be able to assess the effect of growth in certain areas on their core business. Though analyst agencies like Gartner, McKinsey etc. Provide such reports for some areas, thought leaders of all organisations still need to amass data from heterogeneous collections like research publications, analyst reports, patent applications, competitor information etc. To help them finalize their own strategies. Text mining and data analytics researchers have been looking at integrating statistics, text analytics and information visualization to aid the process of retrieval and analytics. In this paper, we present our work on automated topical analysis and insight generation from large heterogeneous text collections of publications and patents. While most of the earlier work in this area provides search-based platforms, ours is an integrated platform for search and analysis. We have presented several methods and techniques that help in analysis and better comprehension of search results. We have also presented methods for generating insights about emerging and popular trends in research along with contextual differences between academic research and patenting profiles. We also present novel techniques to present topic evolution that helps users understand how a particular area has evolved over time.
Efficient and secure search on encrypted data is an important problem in computer science. Users having large amount of data or information in multiple documents face problems with their storage and security. Cloud services have also become popular due to reduction in cost of storage and flexibility of use. But there is risk of data loss, misuse and theft. Reliability and security of data stored in the cloud is a matter of concern, specifically for critical applications and ones for which security and privacy of the data is important. Cryptographic techniques provide solutions for preserving the confidentiality of data but make the data unusable for many applications. In this paper we report a novel approach to securely store the data on a remote location and perform search in constant time without the need for decryption of documents. We use bloom filters to perform simple as well advanced search operations like case sensitive search, sentence search and approximate search.
In the era of big data, many users and companies start to move their data to cloud storage to simplify data management and reduce data maintenance cost. However, security and privacy issues become major concerns because third-party cloud service providers are not always trusty. Although data contents can be protected by encryption, the access patterns that contain important information are still exposed to clouds or malicious attackers. In this paper, we apply the ORAM algorithm to enable privacy-preserving access to big data that are deployed in distributed file systems built upon hundreds or thousands of servers in a single or multiple geo-distributed cloud sites. Since the ORAM algorithm would lead to serious access load unbalance among storage servers, we study a data placement problem to achieve a load balanced storage system with improved availability and responsiveness. Due to the NP-hardness of this problem, we propose a low-complexity algorithm that can deal with large-scale problem size with respect to big data. Extensive simulations are conducted to show that our proposed algorithm finds results close to the optimal solution, and significantly outperforms a random data placement algorithm.
Revolution in the field of technology leads to the development of cloud computing which delivers on-demand and easy access to the large shared pools of online stored data, softwares and applications. It has changed the way of utilizing the IT resources but at the compromised cost of security breaches as well such as phishing attacks, impersonation, lack of confidentiality and integrity. Thus this research work deals with the core problem of providing absolute security to the mobile consumers of public cloud to improve the mobility of user's, accessing data stored on public cloud securely using tokens without depending upon the third party to generate them. This paper presents the approach of simplifying the process of authenticating and authorizing the mobile user's by implementing middleware-centric framework called MiLAMob model with the huge online data storage system i.e. HDFS. It allows the consumer's to access the data from HDFS via mobiles or through the social networking sites eg. facebook, gmail, yahoo etc using OAuth 2.0 protocol. For authentication, the tokens are generated using one-time password generation technique and then encrypting them using AES method. By implementing the flexible user based policies and standards, this model improves the authorization process.
Source code authorship attribution is the task of determining the author of source code whose author is not explicitly known. One specific method of source code authorship attribution that has been shown to be extremely effective is the SCAP method. This method, however, relies on a parameter L that has heretofore been quite nebulous. In the SCAP method, each candidate author's known work is represented as a profile of that author, where the parameter L defines the profile's maximum length. In this study, alternative approaches for selecting a value for L were investigated. Several alternative approaches were found to perform better than the baseline approach used in the SCAP method. The approach that performed the best was empirically shown to improve the performance from 91.0% to 97.2% measured as a percentage of documents correctly attributed using a data set consisting of 7,231 programs written in Java and C++.
Effective Personalized Mobile Search Using KNN, implements an architecture to improve user's personalization effectiveness over large set of data maintaining security of the data. User preferences are gathered through clickthrough data. Clickthrough data obtained is sent to the server in encrypted form. Clickthrough data obtained is classified into content concepts and location concepts. To improve classification and minimize processing time, KNN(K Nearest Neighborhood) algorithm is used. Preferences identified(location and content) are merged to provide effective preferences to the user. System make use of four entropies to balance weight between content concepts and location concepts. System implements client server architecture. Role of client is to collect user queries and to maintain them in files for future reference. User preference privacy is ensured through privacy parameters and also through encryption techniques. Server is responsible to carry out the tasks like training, reranking of the search results obtained and the concept extraction. Experiments are carried out on Android based mobile. Results obtained through experiments show that system significantly gives improved results over previous algorithm for the large set of data maintaining security.
This paper presents a novel design of content fingerprints based on maximization of the mutual information across the distortion channel. We use the information bottleneck method to optimize the filters and quantizers that generate these fingerprints. A greedy optimization scheme is used to select filters from a dictionary and allocate fingerprint bits. We test the performance of this method for audio fingerprinting and show substantial improvements over existing learning based fingerprints.
- « first
- ‹ previous
- 1
- 2
- 3