Biblio
Caching query results is an efficient technique for Web search engines. A state-of-the-art approach named Static-Dynamic Cache (SDC) is widely used in practice. Replacement policy is the key factor on the performance of cache system, and has been widely studied such as LIRS, ARC, CLOCK, SKLRU and RANDOM in different research areas. In this paper, we discussed replacement policies for static-dynamic cache and conducted the experiments on real large scale query logs from two famous commercial Web search engine companies. The experimental results show that ARC replacement policy could work well with static-dynamic cache, especially for large scale query results cache.
Spammers use automated content spinning techniques to evade plagiarism detection by search engines. Text spinners help spammers in evading plagiarism detectors by automatically restructuring sentences and replacing words or phrases with their synonyms. Prior work on spun content detection relies on the knowledge about the dictionary used by the text spinning software. In this work, we propose an approach to detect spun content and its seed without needing the text spinner's dictionary. Our key idea is that text spinners introduce stylometric artifacts that can be leveraged for detecting spun documents. We implement and evaluate our proposed approach on a corpus of spun documents that are generated using a popular text spinning software. The results show that our approach can not only accurately detect whether a document is spun but also identify its source (or seed) document - all without needing the dictionary used by the text spinner.
The emergence of new network applications, such as the network intrusion detection system and packet-level accounting, requires packet classification to report all matched rules instead of only the best matched rule. Although several schemes have been proposed recently to address the multimatch packet classification problem, most of them require either huge memory or expensive ternary content addressable memory (TCAM) to store the intermediate data structure, or they suffer from steep performance degradation under certain types of classifiers. In this paper, we decompose the operation of multimatch packet classification from the complicated multidimensional search to several single-dimensional searches, and present an asynchronous pipeline architecture based on a signature tree structure to combine the intermediate results returned from single-dimensional searches. By spreading edges of the signature tree across multiple hash tables at different stages, the pipeline can achieve a high throughput via the interstage parallel access to hash tables. To exploit further intrastage parallelism, two edge-grouping algorithms are designed to evenly divide the edges associated with each stage into multiple work-conserving hash tables. To avoid collisions involved in hash table lookup, a hybrid perfect hash table construction scheme is proposed. Extensive simulation using realistic classifiers and traffic traces shows that the proposed pipeline architecture outperforms HyperCuts and B2PC schemes in classification speed by at least one order of magnitude, while having a similar storage requirement. Particularly, with different types of classifiers of 4K rules, the proposed pipeline architecture is able to achieve a throughput between 26.8 and 93.1 Gb/s using perfect hash tables.
The huge amount of user log data collected by search engine providers creates new opportunities to understand user loyalty and defection behavior at an unprecedented scale. However, this also poses a great challenge to analyze the behavior and glean insights into the complex, large data. In this paper, we introduce LoyalTracker, a visual analytics system to track user loyalty and switching behavior towards multiple search engines from the vast amount of user log data. We propose a new interactive visualization technique (flow view) based on a flow metaphor, which conveys a proper visual summary of the dynamics of user loyalty of thousands of users over time. Two other visualization techniques, a density map and a word cloud, are integrated to enable analysts to gain further insights into the patterns identified by the flow view. Case studies and the interview with domain experts are conducted to demonstrate the usefulness of our technique in understanding user loyalty and switching behavior in search engines.
Cross-Site Scripting (XSS) is a common attack technique that lets attackers insert the code in the output application of web page which is referred to the web browser of visitor and then the inserted code executes automatically and steals the sensitive information. In order to prevent the users from XSS attack, many client- side solutions have been implemented; most of them being used are the filters that sanitize the malicious input. However, many of these filters do not provide prevention to the newly designed sophisticated attacks such as multiple points of injection, injection into script etc. This paper proposes and implements an approach based on encoding unfiltered reflections for detecting vulnerable web applications which can be exploited using above mentioned sophisticated attacks. Results prove that the proposed approach provides accurate higher detection rate of exploits. In addition to this, an implementation of blocking the execution of malicious scripts have contributed to XSS-Me: an open source Mozilla Firefox security extension that detects for reflected XSS vulnerabilities which can be considered as an effective solution if it is integrated inside the browser rather than being enforced as an extension.
Effective Personalized Mobile Search Using KNN, implements an architecture to improve user's personalization effectiveness over large set of data maintaining security of the data. User preferences are gathered through clickthrough data. Clickthrough data obtained is sent to the server in encrypted form. Clickthrough data obtained is classified into content concepts and location concepts. To improve classification and minimize processing time, KNN(K Nearest Neighborhood) algorithm is used. Preferences identified(location and content) are merged to provide effective preferences to the user. System make use of four entropies to balance weight between content concepts and location concepts. System implements client server architecture. Role of client is to collect user queries and to maintain them in files for future reference. User preference privacy is ensured through privacy parameters and also through encryption techniques. Server is responsible to carry out the tasks like training, reranking of the search results obtained and the concept extraction. Experiments are carried out on Android based mobile. Results obtained through experiments show that system significantly gives improved results over previous algorithm for the large set of data maintaining security.
With the rapid increase in cloud services collecting and using user data to offer personalized experiences, ensuring that these services comply with their privacy policies has become a business imperative for building user trust. However, most compliance efforts in industry today rely on manual review processes and audits designed to safeguard user data, and therefore are resource intensive and lack coverage. In this paper, we present our experience building and operating a system to automate privacy policy compliance checking in Bing. Central to the design of the system are (a) Legal ease-a language that allows specification of privacy policies that impose restrictions on how user data is handled, and (b) Grok-a data inventory for Map-Reduce-like big data systems that tracks how user data flows among programs. Grok maps code-level schema elements to data types in Legal ease, in essence, annotating existing programs with information flow types with minimal human input. Compliance checking is thus reduced to information flow analysis of Big Data systems. The system, bootstrapped by a small team, checks compliance daily of millions of lines of ever-changing source code written by several thousand developers.