Biblio
Caching query results is an efficient technique for Web search engines. A state-of-the-art approach named Static-Dynamic Cache (SDC) is widely used in practice. Replacement policy is the key factor on the performance of cache system, and has been widely studied such as LIRS, ARC, CLOCK, SKLRU and RANDOM in different research areas. In this paper, we discussed replacement policies for static-dynamic cache and conducted the experiments on real large scale query logs from two famous commercial Web search engine companies. The experimental results show that ARC replacement policy could work well with static-dynamic cache, especially for large scale query results cache.
Wikipedia is one of the most popular information platforms on the Internet. The user access pattern to Wikipedia pages depends on their relevance in the current worldwide social discourse. We use publically available statistics about the top-1000 most popular pages on each day to estimate the efficiency of caches for support of the platform. While the data volumes are moderate, the main goal of Wikipedia caches is to reduce access times for page views and edits. We study the impact of most popular pages on the achievable cache hit rate in comparison to Zipf request distributions and we include daily dynamics in popularity.
Knowledge Graphs (KG) are becoming core components of most artificial intelligence applications. Linked Data, as a method of publishing KGs, allows applications to traverse within, and even out of, the graph thanks to global dereferenceable identifiers denoting entities, in the form of IRIs. However, as we show in this work, after analyzing several popular datasets (namely DBpedia, LOD Cache, and Web Data Commons JSON-LD data) many entities are being represented using literal strings where IRIs should be used, diminishing the advantages of using Linked Data. To remedy this, we propose an approach for identifying such strings and replacing them with their corresponding entity IRIs. The proposed approach is based on identifying relations between entities based on both ontological axioms as well as data profiling information and converting strings to entity IRIs based on the types of entities linked by each relation. Our approach showed 98% recall and 76% precision in identifying such strings and 97% precision in converting them to their corresponding IRI in the considered KG. Further, we analyzed how the connectivity of the KG is increased when new relevant links are added to the entities as a result of our method. Our experiments on a subset of the Spanish DBpedia data show that it could add 25% more links to the KG and improve the overall connectivity by 17%.
A disk-based search system distributes a large index across multiple disks on one or more machines, where documents are typically assigned to disks at random in order to achieve load balancing. However, random distribution degrades clustering, which is required for efficient index compression. Using the GOV2 dataset, we demonstrate the effect of various ordering techniques on index compression, and then quantify the effect of various document distribution approaches on compression and load balancing. We explore runtime performance by simulating a disk-based search system for a scaled-out 10xGOV2 index over ten disks using two standard approaches, document and term distribution, as well as a hybrid approach: small-term distribution. We find that small-term distribution has the best performance, especially in the presence of list caching, and argue that this rarely discussed distribution approach can improve disk-based search performance for many real-world installations.
Modern CDNs cache and deliver a highly-diverse set of traffic classes, including web pages, images, videos and software downloads. It is economically advantageous for a CDN to cache and deliver all traffic classes using a shared distributed cache server infrastructure. However, such sharing of cache resources across multiple traffic classes poses significant cache provisioning challenges that are the focus of this paper. Managing a vast shared caching infrastructure requires careful modeling of user request sequences for each traffic class. Using extensive traces from Akamai's CDN, we show how each traffic class has drastically different object access patterns, object size distributions, and cache resource requirements. We introduce the notion of a footprint descriptor that is a succinct representation of the cache requirements of a request sequence. Leveraging novel connections to Fourier analysis, we develop a footprint descriptor calculus that allows us to predict the cache requirements when different traffic classes are added, subtracted and scaled to within a prediction error of 2.5%. We integrated our footprint calculus in the cache provisioning operations of the production CDN and show how it is used to solve key challenges in cache sizing, traffic mixing, and cache partitioning.
The Internet is becoming more and more content-oriented. CDN (Content Distribution Networks) has been a popular architecture compatible with the current Internet, and a new revolutionary paradigm such as ICN (Information Centric Networking) has studied. One of the main components in both CDN and ICN is considering cache on network. Despite a surge of extensive use of cache in the current and future Internet architectures, analysis on the performance of general cache networks are still quite limited due to complex inter-plays among various components and thus analytical intractability. Due to mathematical tractability, we consider 'static' cache policies and study asymptotic delay performance of those policies in cache networks, in particular, focusing on the impact of heterogeneous content popularities and nodes' geographical 'importances' in caching policies. Furthermore, our simulation results suggest that they perform quite similarly as popular 'dynamic' policies such as LFU (Least-Frequently-Used) and LRU (Least-Recently-Used). We believe that our theoretical findings provide useful engineering implications such as when and how various factors have impact on caching performance.
Many Linked Open Data applications require fresh copies of RDF data at their local repositories. Since RDF documents constantly change and those changes are not automatically propagated to the LOD applications, it is important to regularly visit the RDF documents to refresh the local copies and keep them up-to-date. For this purpose, crawling strategies determine which RDF documents should be preferentially fetched. Traditional crawling strategies rely only on how an RDF document has been modified in the past. In contrast, we predict on the triple level whether a change will occur in the future. We use the weekly snapshots of the DyLDO dataset as well as the monthly snapshots of the Wikidata dataset. First, we conduct an in-depth analysis of the life span of triples in RDF documents. Through the analysis, we identify which triples are stable and which are ephemeral. We introduce different features based on the triples and apply a simple but effective linear regression model. Second, we propose a novel crawling strategy based on the linear regression model. We conduct two experimental setups where we vary the amount of available bandwidth as well as iteratively observe the quality of the local copies over time. The results demonstrate that the novel crawling strategy outperforms the state of the art in both setups.
Map-based services are becoming increasingly important in many applications. These services often need to show geospatial objects (e.g., cities and parks) in Web browsers, and being able to retrieve such objects efficiently is critical to achieving a low response time for user queries. In this demonstration we present a browser-based caching technique to store and load geospatial objects on a map in a Web page. The technique employs a hierarchical structure to store and index polygons, and does intelligent prefetching and cache replacement by utilizing the information about the user's recent browser activities. We demonstrate the usage of the technique in an application called TwitterMap for visualizing more than 1 billion tweets in real time. We show its effectiveness by using different replacement policies. The technique is implemented as a general-purpose Javascript library, making it suitable for other applications as well.
Most popular Web applications rely on persistent databases based on languages like SQL for declarative specification of data models and the operations that read and modify them. As applications scale up in user base, they often face challenges responding quickly enough to the high volume of requests. A common aid is caching of database results in the application's memory space, taking advantage of program-specific knowledge of which caching schemes are sound and useful, embodied in handwritten modifications that make the program less maintainable. These modifications also require nontrivial reasoning about the read-write dependencies across operations. In this paper, we present a compiler optimization that automatically adds sound SQL caching to Web applications coded in the Ur/Web domain-specific functional language, with no modifications required to source code. We use a custom cache implementation that supports concurrent operations without compromising the transactional semantics of the database abstraction. Through experiments with microbenchmarks and production Ur/Web applications, we show that our optimization in many cases enables an easy doubling or more of an application's throughput, requiring nothing more than passing an extra command-line flag to the compiler.
End-users in emerging markets experience poor web performance due to a combination of three factors: high server response time, limited edge bandwidth and the complexity of web pages. The absence of cloud infrastructure in developing regions and the limited bandwidth experienced by edge nodes constrain the effectiveness of conventional caching solutions for these contexts. This paper describes the design, implementation and deployment of xCache, a cloud-managed Internet caching architecture that aims to proactively profile popular web pages and maintain the liveness of popular content at software defined edge caches to enhance the cache hit rate with minimal bandwidth overhead. xCache uses a Cloud Controller that continuously analyzes active cloud-managed web pages and derives an object-group representation of web pages based on the objects of a page. Using this object-group representation, xCache computes a bandwidth-aware utility measure to derive the most valuable configuration for each edge cache. Our preliminary real-world deployment across university campuses in three developing regions demonstrates its potential compared to conventional caching by improving cache hit rates by about 15%. Our evaluations of xCache have also shown that it can be applied in conjunction with other web optimizations solutions like Shandian, and can improve page load times by more than 50%.