Visible to the public Biblio

Filters: Keyword is information retrieval  [Clear All Filters]
2018-01-10
Oosterhuis, Harrie, de Rijke, Maarten.  2017.  Sensitive and Scalable Online Evaluation with Theoretical Guarantees. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. :77–86.
Multileaved comparison methods generalize interleaved comparison methods to provide a scalable approach for comparing ranking systems based on regular user interactions. Such methods enable the increasingly rapid research and development of search engines. However, existing multileaved comparison methods that provide reliable outcomes do so by degrading the user experience during evaluation. Conversely, current multileaved comparison methods that maintain the user experience cannot guarantee correctness. Our contribution is two-fold. First, we propose a theoretical framework for systematically comparing multileaved comparison methods using the notions of considerateness, which concerns maintaining the user experience, and fidelity, which concerns reliable correct outcomes. Second, we introduce a novel multileaved comparison method, Pairwise Preference Multileaving (PPM), that performs comparisons based on document-pair preferences, and prove that it is considerate and has fidelity. We show empirically that, compared to previous multileaved comparison methods, PPM is more sensitive to user preferences and scalable with the number of rankers being compared.
2017-12-20
Chen, C. K., Lan, S. C., Shieh, S. W..  2017.  Shellcode detector for malicious document hunting. 2017 IEEE Conference on Dependable and Secure Computing. :527–528.

Advanced Persistent Threat (APT) attacks became a major network threat in recent years. Among APT attack techniques, sending a phishing email with malicious documents attached is considered one of the most effective ones. Although many users have the impression that documents are harmless, a malicious document may in fact contain shellcode to attack victims. To cope with the problem, we design and implement a malicious document detector called Forensor to differentiate malicious documents. Forensor integrates several open-source tools and methods. It first introspects file format to retrieve objects inside the documents, and then automatically decrypts simple encryption methods, e.g., XOR, rot and shift, commonly used in malware to discover potential shellcode. The emulator is used to verify the presence of shellcode. If shellcode is discovered, the file is considered malicious. The experiment used 9,000 benign files and more than 10,000 malware samples from a well-known sample sharing website. The result shows no false negative and only 2 false positives.

2017-12-12
Suh, Y. K., Ma, J..  2017.  SuperMan: A Novel System for Storing and Retrieving Scientific-Simulation Provenance for Efficient Job Executions on Computing Clusters. 2017 IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS*W). :283–288.

Compute-intensive simulations typically charge substantial workloads on an online simulation platform backed by limited computing clusters and storage resources. Some (or most) of the simulations initiated by users may accompany input parameters/files that have been already provided by other (or same) users in the past. Unfortunately, these duplicate simulations may aggravate the performance of the platform by drastic consumption of the limited resources shared by a number of users on the platform. To minimize or avoid conducting repeated simulations, we present a novel system, called SUPERMAN (SimUlation ProvEnance Recycling MANager) that can record simulation provenances and recycle the results of past simulations. This system presents a great opportunity to not only reutilize existing results but also perform various analytics helpful for those who are not familiar with the platform. The system also offers interoperability across other systems by collecting the provenances in a standardized format. In our simulated experiments we found that over half of past computing jobs could be answered without actual executions by our system.

Zhou, G., Huang, J. X..  2017.  Modeling and Learning Distributed Word Representation with Metadata for Question Retrieval. IEEE Transactions on Knowledge and Data Engineering. 29:1226–1239.

Community question answering (cQA) has become an important issue due to the popularity of cQA archives on the Web. This paper focuses on addressing the lexical gap problem in question retrieval. Question retrieval in cQA archives aims to find the existing questions that are semantically equivalent or relevant to the queried questions. However, the lexical gap problem brings a new challenge for question retrieval in cQA. In this paper, we propose to model and learn distributed word representations with metadata of category information within cQA pages for question retrieval using two novel category powered models. One is a basic category powered model called MB-NET and the other one is an enhanced category powered model called ME-NET which can better learn the distributed word representations and alleviate the lexical gap problem. To deal with the variable size of word representation vectors, we employ the framework of fisher kernel to transform them into the fixed-length vectors. Experimental results on large-scale English and Chinese cQA data sets show that our proposed approaches can significantly outperform state-of-the-art retrieval models for question retrieval in cQA. Moreover, we further conduct our approaches on large-scale automatic evaluation experiments. The evaluation results show that promising and significant performance improvements can be achieved.

2017-11-03
Zulkarnine, A. T., Frank, R., Monk, B., Mitchell, J., Davies, G..  2016.  Surfacing collaborated networks in dark web to find illicit and criminal content. 2016 IEEE Conference on Intelligence and Security Informatics (ISI). :109–114.
The Tor Network, a hidden part of the Internet, is becoming an ideal hosting ground for illegal activities and services, including large drug markets, financial frauds, espionage, child sexual abuse. Researchers and law enforcement rely on manual investigations, which are both time-consuming and ultimately inefficient. The first part of this paper explores illicit and criminal content identified by prominent researchers in the dark web. We previously developed a web crawler that automatically searched websites on the internet based on pre-defined keywords and followed the hyperlinks in order to create a map of the network. This crawler has demonstrated previous success in locating and extracting data on child exploitation images, videos, keywords and linkages on the public internet. However, as Tor functions differently at the TCP level, and uses socket connections, further technical challenges are faced when crawling Tor. Some of the other inherent challenges for advanced Tor crawling include scalability, content selection tradeoffs, and social obligation. We discuss these challenges and the measures taken to meet them. Our modified web crawler for Tor, termed the “Dark Crawler” has been able to access Tor while simultaneously accessing the public internet. We present initial findings regarding what extremist and terrorist contents are present in Tor and how this content is connected to each other in a mapped network that facilitates dark web crimes. Our results so far indicate the most popular websites in the dark web are acting as catalysts for dark web expansion by providing necessary knowledgebase, support and services to build Tor hidden services and onion websites.
Iliou, C., Kalpakis, G., Tsikrika, T., Vrochidis, S., Kompatsiaris, I..  2016.  Hybrid Focused Crawling for Homemade Explosives Discovery on Surface and Dark Web. 2016 11th International Conference on Availability, Reliability and Security (ARES). :229–234.
This work proposes a generic focused crawling framework for discovering resources on any given topic that reside on the Surface or the Dark Web. The proposed crawler is able to seamlessly traverse the Surface Web and several darknets present in the Dark Web (i.e. Tor, I2P and Freenet) during a single crawl by automatically adapting its crawling behavior and its classifier-guided hyperlink selection strategy based on the network type. This hybrid focused crawler is demonstrated for the discovery of Web resources containing recipes for producing homemade explosives. The evaluation experiments indicate the effectiveness of the proposed ap-proach both for the Surface and the Dark Web.
2017-09-19
Djellali, Choukri, Adda, Mehdi.  2016.  A New Scalable Aggregation Scheme for Fuzzy Clustering Taking Unstructured Textual Resources As a Case. Proceedings of the 20th International Database Engineering & Applications Symposium. :199–204.

The performance of clustering is a crucial challenge, especially for pattern recognition. The models aggregation has a positive impact on the efficiency of Data clustering. This technique is used to obtain more cluttered decision boundaries by aggregating the resulting clustering models. In this paper, we study an aggregation scheme to improve the stability and accuracy of clustering, which allows to find a reliable and robust clustering model. We demonstrate the advantages of our aggregation method by running Fuzzy C-Means (FCM) clustering on Reuters-21578 corpus. Experimental studies showed that our scheme optimized the bias-variance on the selected model and achieved enhanced clustering for unstructured textual resources.

2017-05-19
Gupta, Dhruv.  2016.  Event Search and Analytics: Detecting Events in Semantically Annotated Corpora for Search & Analytics. Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. :705–705.

In this article, I present the questions that I seek to answer in my PhD research. I posit to analyze natural language text with the help of semantic annotations and mine important events for navigating large text corpora. Semantic annotations such as named entities, geographic locations, and temporal expressions can help us mine events from the given corpora. These events thus provide us with useful means to discover the locked knowledge in them. I pose three problems that can help unlock this knowledge vault in semantically annotated text corpora: i. identifying important events; ii. semantic search; iii. and event analytics.

2017-03-20
Shahriar, Hossain, Haddad, Hisham.  2016.  Object Injection Vulnerability Discovery Based on Latent Semantic Indexing. Proceedings of the 31st Annual ACM Symposium on Applied Computing. :801–807.

Object Injection Vulnerability (OIV) is an emerging threat for web applications. It involves accepting external inputs during deserialization operation and use the inputs for sensitive operations such as file access, modification, and deletion. The challenge is the automation of the detection process. When the application size is large, it becomes hard to perform traditional approaches such as data flow analysis. Recent approaches fall short of narrowing down the list of source files to aid developers in discovering OIV and the flexibility to check for the presence of OIV through various known APIs. In this work, we address these limitations by exploring a concept borrowed from the information retrieval domain called Latent Semantic Indexing (LSI) to discover OIV. The approach analyzes application source code and builds an initial term document matrix which is then transformed systematically using singular value decomposition to reduce the search space. The approach identifies a small set of documents (source files) that are likely responsible for OIVs. We apply the LSI concept to three open source PHP applications that have been reported to contain OIVs. Our initial evaluation results suggest that the proposed LSI-based approach can identify OIVs and identify new vulnerabilities.

2017-03-07
Sadri, Mehdi, Mehrotra, Sharad, Yu, Yaming.  2016.  Online Adaptive Topic Focused Tweet Acquisition. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :2353–2358.

Twitter provides a public streaming API that is strictly limited, making it difficult to simultaneously achieve good coverage and relevance when monitoring tweets for a specific topic of interest. In this paper, we address the tweet acquisition challenge to enhance monitoring of tweets based on the client/application needs in an online adaptive manner such that the quality and quantity of the results improves over time. We propose a Tweet Acquisition System (TAS), that iteratively selects phrases to track based on an explore-exploit strategy. Our experimental studies show that TAS significantly improves recall of relevant tweets and the performance improves when the topics are more specific.

Benjamin, V., Li, W., Holt, T., Chen, H..  2015.  Exploring threats and vulnerabilities in hacker web: Forums, IRC and carding shops. 2015 IEEE International Conference on Intelligence and Security Informatics (ISI). :85–90.

Cybersecurity is a problem of growing relevance that impacts all facets of society. As a result, many researchers have become interested in studying cybercriminals and online hacker communities in order to develop more effective cyber defenses. In particular, analysis of hacker community contents may reveal existing and emerging threats that pose great risk to individuals, businesses, and government. Thus, we are interested in developing an automated methodology for identifying tangible and verifiable evidence of potential threats within hacker forums, IRC channels, and carding shops. To identify threats, we couple machine learning methodology with information retrieval techniques. Our approach allows us to distill potential threats from the entirety of collected hacker contents. We present several examples of identified threats found through our analysis techniques. Results suggest that hacker communities can be analyzed to aid in cyber threat detection, thus providing promising direction for future work.

2015-05-06
Khanuja, H., Suratkar, S.S..  2014.  #x201C;Role of metadata in forensic analysis of database attacks #x201C;. Advance Computing Conference (IACC), 2014 IEEE International. :457-462.

With the spectacular increase in online activities like e-transactions, security and privacy issues are at the peak with respect to their significance. Large numbers of database security breaches are occurring at a very high rate on daily basis. So, there is a crucial need in the field of database forensics to make several redundant copies of sensitive data found in database server artifacts, audit logs, cache, table storage etc. for analysis purposes. Large volume of metadata is available in database infrastructure for investigation purposes but most of the effort lies in the retrieval and analysis of that information from computing systems. Thus, in this paper we mainly focus on the significance of metadata in database forensics. We proposed a system here to perform forensics analysis of database by generating its metadata file independent of the DBMS system used. We also aim to generate the digital evidence against criminals for presenting it in the court of law in the form of who, when, why, what, how and where did the fraudulent transaction occur. Thus, we are presenting a system to detect major database attacks as well as anti-forensics attacks by developing an open source database forensics tool. Eventually, we are pointing out the challenges in the field of forensics and how these challenges can be used as opportunities to stimulate the areas of database forensics.

Kobayashi, F., Talburt, J.R..  2014.  Decoupling Identity Resolution from the Maintenance of Identity Information. Information Technology: New Generations (ITNG), 2014 11th International Conference on. :349-354.

The EIIM model for ER allows for creation and maintenance of persistent entity identity structures. It accomplishes this through a collection of batch configurations that allow updates and asserted fixes to be made to the Identity knowledgebase (IKB). The model also provides a batch IR configuration that provides no maintenance activity but instead allows access to the identity information. This batch IR configuration is limited in a few ways. It is driven by the same rules used for maintaining the IKB, has no inherent method to identity "close" matches, and can only identify and return the positive matches. Through the decoupling of this configuration and its movements into an interactive role under the umbrella of an Identity Management Service, a more robust access method can be provided for the use of identity information. This more robust access to the information improved the quality of the information along multiple Information Quality dimensions.

2015-05-05
Craig, P., Roa Seiler, N., Olvera Cervantes, A.D..  2014.  Animated Geo-temporal Clusters for Exploratory Search in Event Data Document Collections. Information Visualisation (IV), 2014 18th International Conference on. :157-163.

This paper presents a novel visual analytics technique developed to support exploratory search tasks for event data document collections. The technique supports discovery and exploration by clustering results and overlaying cluster summaries onto coordinated timeline and map views. Users can also explore and interact with search results by selecting clusters to filter and re-cluster the data with animation used to smooth the transition between views. The technique demonstrates a number of advantages over alternative methods for displaying and exploring geo-referenced search results and spatio-temporal data. Firstly, cluster summaries can be presented in a manner that makes them easy to read and scan. Listing representative events from each cluster also helps the process of discovery by preserving the diversity of results. Also, clicking on visual representations of geo-temporal clusters provides a quick and intuitive way to navigate across space and time simultaneously. This removes the need to overload users with the display of too many event labels at any one time. The technique was evaluated with a group of nineteen users and compared with an equivalent text based exploratory search engine.
 

Dey, L., Mahajan, D., Gupta, H..  2014.  Obtaining Technology Insights from Large and Heterogeneous Document Collections. Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014 IEEE/WIC/ACM International Joint Conferences on. 1:102-109.

Keeping up with rapid advances in research in various fields of Engineering and Technology is a challenging task. Decision makers including academics, program managers, venture capital investors, industry leaders and funding agencies not only need to be abreast of latest developments but also be able to assess the effect of growth in certain areas on their core business. Though analyst agencies like Gartner, McKinsey etc. Provide such reports for some areas, thought leaders of all organisations still need to amass data from heterogeneous collections like research publications, analyst reports, patent applications, competitor information etc. To help them finalize their own strategies. Text mining and data analytics researchers have been looking at integrating statistics, text analytics and information visualization to aid the process of retrieval and analytics. In this paper, we present our work on automated topical analysis and insight generation from large heterogeneous text collections of publications and patents. While most of the earlier work in this area provides search-based platforms, ours is an integrated platform for search and analysis. We have presented several methods and techniques that help in analysis and better comprehension of search results. We have also presented methods for generating insights about emerging and popular trends in research along with contextual differences between academic research and patenting profiles. We also present novel techniques to present topic evolution that helps users understand how a particular area has evolved over time.
 

Pal, S.K., Sardana, P., Sardana, A..  2014.  Efficient search on encrypted data using bloom filter. Computing for Sustainable Global Development (INDIACom), 2014 International Conference on. :412-416.

Efficient and secure search on encrypted data is an important problem in computer science. Users having large amount of data or information in multiple documents face problems with their storage and security. Cloud services have also become popular due to reduction in cost of storage and flexibility of use. But there is risk of data loss, misuse and theft. Reliability and security of data stored in the cloud is a matter of concern, specifically for critical applications and ones for which security and privacy of the data is important. Cryptographic techniques provide solutions for preserving the confidentiality of data but make the data unusable for many applications. In this paper we report a novel approach to securely store the data on a remote location and perform search in constant time without the need for decryption of documents. We use bloom filters to perform simple as well advanced search operations like case sensitive search, sentence search and approximate search.
 

Peng Li, Song Guo.  2014.  Load balancing for privacy-preserving access to big data in cloud. Computer Communications Workshops (INFOCOM WKSHPS), 2014 IEEE Conference on. :524-528.

In the era of big data, many users and companies start to move their data to cloud storage to simplify data management and reduce data maintenance cost. However, security and privacy issues become major concerns because third-party cloud service providers are not always trusty. Although data contents can be protected by encryption, the access patterns that contain important information are still exposed to clouds or malicious attackers. In this paper, we apply the ORAM algorithm to enable privacy-preserving access to big data that are deployed in distributed file systems built upon hundreds or thousands of servers in a single or multiple geo-distributed cloud sites. Since the ORAM algorithm would lead to serious access load unbalance among storage servers, we study a data placement problem to achieve a load balanced storage system with improved availability and responsiveness. Due to the NP-hardness of this problem, we propose a low-complexity algorithm that can deal with large-scale problem size with respect to big data. Extensive simulations are conducted to show that our proposed algorithm finds results close to the optimal solution, and significantly outperforms a random data placement algorithm.
 

Singh, S., Sharma, S..  2014.  Improving security mechanism to access HDFS data by mobile consumers using middleware-layer framework. Computing, Communication and Networking Technologies (ICCCNT), 2014 International Conference on. :1-7.

Revolution in the field of technology leads to the development of cloud computing which delivers on-demand and easy access to the large shared pools of online stored data, softwares and applications. It has changed the way of utilizing the IT resources but at the compromised cost of security breaches as well such as phishing attacks, impersonation, lack of confidentiality and integrity. Thus this research work deals with the core problem of providing absolute security to the mobile consumers of public cloud to improve the mobility of user's, accessing data stored on public cloud securely using tokens without depending upon the third party to generate them. This paper presents the approach of simplifying the process of authenticating and authorizing the mobile user's by implementing middleware-centric framework called MiLAMob model with the huge online data storage system i.e. HDFS. It allows the consumer's to access the data from HDFS via mobiles or through the social networking sites eg. facebook, gmail, yahoo etc using OAuth 2.0 protocol. For authentication, the tokens are generated using one-time password generation technique and then encrypting them using AES method. By implementing the flexible user based policies and standards, this model improves the authorization process.

2015-05-04
Tennyson, M.F., Mitropoulos, F.J..  2014.  Choosing a profile length in the SCAP method of source code authorship attribution. SOUTHEASTCON 2014, IEEE. :1-6.

Source code authorship attribution is the task of determining the author of source code whose author is not explicitly known. One specific method of source code authorship attribution that has been shown to be extremely effective is the SCAP method. This method, however, relies on a parameter L that has heretofore been quite nebulous. In the SCAP method, each candidate author's known work is represented as a profile of that author, where the parameter L defines the profile's maximum length. In this study, alternative approaches for selecting a value for L were investigated. Several alternative approaches were found to perform better than the baseline approach used in the SCAP method. The approach that performed the best was empirically shown to improve the performance from 91.0% to 97.2% measured as a percentage of documents correctly attributed using a data set consisting of 7,231 programs written in Java and C++.

Swati, K., Patankar, A.J..  2014.  Effective personalized mobile search using KNN. Data Science Engineering (ICDSE), 2014 International Conference on. :157-160.

Effective Personalized Mobile Search Using KNN, implements an architecture to improve user's personalization effectiveness over large set of data maintaining security of the data. User preferences are gathered through clickthrough data. Clickthrough data obtained is sent to the server in encrypted form. Clickthrough data obtained is classified into content concepts and location concepts. To improve classification and minimize processing time, KNN(K Nearest Neighborhood) algorithm is used. Preferences identified(location and content) are merged to provide effective preferences to the user. System make use of four entropies to balance weight between content concepts and location concepts. System implements client server architecture. Role of client is to collect user queries and to maintain them in files for future reference. User preference privacy is ensured through privacy parameters and also through encryption techniques. Server is responsible to carry out the tasks like training, reranking of the search results obtained and the concept extraction. Experiments are carried out on Android based mobile. Results obtained through experiments show that system significantly gives improved results over previous algorithm for the large set of data maintaining security.

Naini, R., Moulin, P..  2014.  Fingerprint information maximization for content identification. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. :3809-3813.

This paper presents a novel design of content fingerprints based on maximization of the mutual information across the distortion channel. We use the information bottleneck method to optimize the filters and quantizers that generate these fingerprints. A greedy optimization scheme is used to select filters from a dictionary and allocate fingerprint bits. We test the performance of this method for audio fingerprinting and show substantial improvements over existing learning based fingerprints.