Biblio
In this paper we propose a mechanism of prediction of domestic human activity in a smart home context. We use those predictions to adapt the behavior of home appliances whose impact on the environment is delayed (for example the heating). The behaviors of appliances are built by a reinforcement learning mechanism. We compare the behavior built by the learning approach with both a merely reactive behavior and a state-remanent behavior.
Sensing platforms are becoming batteryless to enable the vision of the Internet of Things, where trillions of devices collect data, interact with each other, and interact with people. However, these batteryless sensing platforms—that rely purely on energy harvesting—are rarely able to maintain a sense of time after a power failure. This makes working with sensor data that is time sensitive especially difficult. We propose two novel, zero-power timekeepers that use remanence decay to measure the time elapsed between power failures. Our approaches compute the elapsed time from the amount of decay of a capacitive device, either on-chip Static Random-Access Memory (SRAM) or a dedicated capacitor. This enables hourglass-like timers that give intermittently powered sensing devices a persistent sense of time. Our evaluation shows that applications using either timekeeper can keep time accurately through power failures as long as 45s with low overhead.
Cloud computing provides a shared pool of resources for large-scale distributed applications. Recent trends such as fog computing and edge computing spread the workload of clouds closer towards the edge of the network and the users. Exploiting the edge resources efficiently requires managing the resources and directing user traffic to the correct edge servers. In this paper we propose to profile and group users according to their interest profiles. We consider edge caching as an example and through our evaluation show the potential benefits of directing users from the same group to the same caches. We investigate a range of workloads and parameters and the same conclusions apply. Our results highlight the importance of grouping users and demonstrate the potential benefits of this approach.
Vehicular users are expected to consume large amounts of data, for both entertainment and navigation purposes. This will put a strain on cellular networks, which will be able to cope with such a load only if proper caching is in place; this in turn begs the question of which caching architecture is the best-suited to deal with vehicular content consumption. In this paper, we leverage a large-scale, crowd-sourced trace to (i) characterize the vehicular traffic demand, in terms of overall magnitude and content breakup; (ii) assess how different caching approaches perform against such a real-world load; (iii) study the effect of recommendation systems and local content items. We define a price-of-fog metric, expressing the additional caching capacity to deploy when moving from traditional, centralized caching architectures to a "fog computing" approach, where caches are closer to the network edge. We find that for location-specific items, such as the ones that vehicular users are most likely to request, such a price almost disappears. Vehicular networks thus make a strong case for the adoption of mobile-edge caching, as we are able to reap the benefit thereof – including a reduction in the distance travelled by data, within the core network – with little or none of the associated disadvantages.
Centralized spectrum management is one of the key dynamic spectrum access (DSA) mechanisms proposed to govern the spectrum sharing between government incumbent users (IUs) and commercial secondary users (SUs). In the current centralized DSA designs, the operation data of both government IUs and commercial SUs needs to be shared with a central server. However, the operation data of government IUs is often classified information and the SU operation data may also be commercial secret. The current system design dissatisfies the privacy requirement of both IUs and SUs since the central server is not necessarily trust-worthy for holding such sensitive operation data. To address the privacy issue, this paper presents a privacy-preserving centralized DSA system (P2-SAS), which realizes the complex spectrum allocation process of DSA through efficient secure multi-party computation. In P2-SAS, none of the IU or SU operation data would be exposed to any snooping party, including the central server itself. We formally prove the correctness and privacy-preserving property of P2-SAS and evaluate its scalability and practicality using experiments based on real-world data. Experiment results show that P2-SAS can respond an SU's spectrum request in 6.96 seconds with communication overhead of less than 4 MB.
Existing API mining algorithms can be difficult to use as they require expensive parameter tuning and the returned set of API calls can be large, highly redundant and difficult to understand. To address this, we present PAM (Probabilistic API Miner), a near parameter-free probabilistic algorithm for mining the most interesting API call patterns. We show that PAM significantly outperforms both MAPO and UPMiner, achieving 69% test-set precision, at retrieving relevant API call sequences from GitHub. Moreover, we focus on libraries for which the developers have explicitly provided code examples, yielding over 300,000 LOC of hand-written API example code from the 967 client projects in the data set. This evaluation suggests that the hand-written examples actually have limited coverage of real API usages.
It is expected that DRAM memory will be augmented, and perhaps eventually replaced, by one of several up-and-coming memory technologies. These are all non-volatile, in that they retain their contents without power. This allows primary memory to be used as a fast disk replacement. It also enables more aggressive programming models that directly leverage persistence of primary memory. However, it is challenging to maintain consistency of memory in such an environment. There is no consensus on the right programming model for doing so, and subtle differences can have large, and sometimes surprising, effects on the implementation and its performance. The existing literature describes multiple programming systems that provide point solutions to the selective persistence for user data structures. Real progress in this area requires a choice of programming model, which we cannot reasonably make without a real understanding of the design space. Point solutions are insufficient. We systematically explore what we consider to be the most promising part of the space, precisely defining semantics and identifying implementation costs. This allows us to be much more explicit and precise about semantic and implementation trade-offs that were usually glossed over in prior work. It also exposes some promising new design alternatives.
Accelerated Processing Unit (APU) is a heterogeneous multicore processor that contains general-purpose CPU cores and a GPU in a single chip. It also supports Heterogeneous System Architecture (HSA) that provides coherent physically-shared memory between the CPU and the GPU. In this paper, we present the design and implementation of a high-performance IPsec gateway using a low-cost commodity embedded APU. The HSA supported by the APUs eliminates the data copy overhead between the CPU and the GPU, which is unavoidable in the previous discrete GPU approaches. The gateway is implemented in OpenCL to exploit the GPU and uses zero-copy packet I/O APIs in DPDK. The IPsec gateway handles the real-world network traffic where each packet has a different workload. The proposed packet scheduling algorithm significantly improves GPU utilization for such traffic. It works not only for APUs but also for discrete GPUs. With three CPU cores and one GPU in the APU, the IPsec gateway achieves a throughput of 10.36 Gbps with an average latency of 2.79 ms to perform AES-CBC+HMAC-SHA1 for incoming packets of 1024 bytes.
With the prevalence of personal Bluetooth devices, potential breach of user privacy has been an increasing concern. To date, sniffing Bluetooth traffic has been widely considered an extremely intricate task due to Bluetooth's indiscoverable mode, vendor-dependent adaptive hopping behavior, and the interference in the open 2.4 GHz band. In this paper, we present BlueEar -a practical Bluetooth traffic sniffer. BlueEar features a novel dual-radio architecture where two Bluetooth-compliant radios coordinate with each other on learning the hopping sequence of indiscoverable Bluetooth networks, predicting adaptive hopping behavior, and mitigating the impacts of RF interference. Experiment results show that BlueEar can maintain a packet capture rate higher than 90% consistently in real-world environments, where the target Bluetooth network exhibits diverse hopping behaviors in the presence of dynamic interference from coexisting Wi-Fi devices. In addition, we discuss the privacy implications of the BlueEar system, and present a practical countermeasure that effectively reduces the packet capture rate of the sniffer to 20%. The proposed countermeasure can be easily implemented on the Bluetooth master device while requiring no modification to slave devices like keyboards and headsets.
There exist a multitude of execution models available today for a developer to target. The choices vary from general purpose processors to fixed-function hardware accelerators with a large number of variations in-between. There is a growing demand to assess the potential benefits of porting or rewriting an application to a target architecture in order to fully exploit the benefits of performance and/or energy efficiency offered by such targets. However, as a first step of this process, it is necessary to determine whether the application has characteristics suitable for acceleration. In this paper, we present Peruse, a tool to characterize the features of loops in an application and to help the programmer understand the amenability of loops for acceleration. We consider a diverse set of features ranging from loop characteristics (e.g., loop exit points) and operation mixes (e.g., control vs data operations) to wider code region characteristics (e.g., idempotency, vectorizability). Peruse is language, architecture, and input independent and uses the intermediate representation of compilers to do the characterization. Using static analyses makes Peruse scalable and enables analysis of large applications to identify and extract interesting loops suitable for acceleration. We show analysis results for unmodified applications from the SPEC CPU benchmark suite, Polybench, and HPC workloads. For an end-user it is more desirable to get an estimate of the potential speedup due to acceleration. We use the workload characterization results of Peruse as features and develop a machine-learning based model to predict the potential speedup of a loop when off-loaded to a fixed function hardware accelerator. We use the model to predict the speedup of loops selected by Peruse and achieve an accuracy of 79%.
There is a growing interest in modeling and predicting the behavior of financial systems and supply chains. In this paper, we focus on the the analysis of the resMBS supply chain; it is associated with the US residential mortgage backed securities and subprime mortgages that were critical in the 2008 US financial crisis. We develop models based on financial institutions (FI), and their participation described by their roles (Role) on financial contracts (FC). Our models are based on an intuitive assumption that FIs will form communities within an FC, and FIs within a community are more likely to collaborate with other FIs in that community, and play the same role, in another FC. Inspired by the Latent Dirichlet Allocation (LDA) and topic models, we develop two probabilistic financial community models. In FI-Comm, each FC (document) is a mix of topics where a topic is a distribution over FIs (words). In Role-FI-Comm, each topic is a distribution over Role-FI pairs (words). Experimental results over 5000+ financial prospecti demonstrate the effectiveness of our models.
The Internet of Things (IoT) offers new opportunities, but alongside those come many challenges for security and privacy. Most IoT devices offer no choice to users of where data is published, which data is made available and what identities are used for both devices and users. The aim of this work is to explore new middleware models and techniques that can provide users with more choice as well as enhance privacy and security. This paper outlines a new model and a prototype of a middleware system that implements this model.
Vehicle trajectories are one of the most important data in location-based services. The quality of trajectories directly affects the services. However, in the real applications, trajectory data are not always sampled densely. In this paper, we study the problem of recovering the entire route between two distant consecutive locations in a trajectory. Most existing works solve the problem without using those informative historical data or solve it in an empirical way. We claim that a data-driven and probabilistic approach is actually more suitable as long as data sparsity can be well handled. We propose a novel route recovery system in a fully probabilistic way which incorporates both temporal and spatial dynamics and addresses all the data sparsity problem introduced by the probabilistic method. It outperforms the existing works with a high accuracy (over 80%) and shows a strong robustness even when the length of routes to be recovered is very long (about 30 road segments) or the data is very sparse.
One essential requirement for supporting analytics for Big Medical Data systems is the provision of a suitable level of traceability to data or processes ('Items') in large volumes of data. Systems should be designed from the outset to support usage of such Items across the spectrum of medical use and over time in order to promote traceability, to simplify maintenance and to assist analytics. The philosophy proposed in this paper is to design medical data systems using a 'description-driven' approach in which meta-data and the description of medical items are saved alongside the data, simplifying item re-use over time and thereby enabling the traceability of these items over time and their use in analytics. Details are given of a big data system in neuroimaging to demonstrate aspects of provenance data capture, collaborative analysis and longitudinal information traceability. Evidence is presented that the description-driven approach leads to simplicity of design and ease of maintenance following the adoption of a unified approach to Item management.
Defenders of enterprise networks have a critical need to quickly identify the root causes of malware and data leakage. Increasingly, USB storage devices are the media of choice for data exfiltration, malware propagation, and even cyber-warfare. We observe that a critical aspect of explaining and preventing such attacks is understanding the provenance of data (i.e., the lineage of data from its creation to current state) on USB devices as a means of ensuring their safe usage. Unfortunately, provenance tracking is not offered by even sophisticated modern devices. This work presents ProvUSB, an architecture for fine-grained provenance collection and tracking on smart USB devices. ProvUSB maintains data provenance by recording reads and writes at the block layer and reliably identifying hosts editing those blocks through attestation over the USB channel. Our evaluation finds that ProvUSB imposes a one-time 850 ms overhead during USB enumeration, but approaches nearly-bare-metal runtime performance (90% of throughput) on larger files during normal execution, and less than 0.1% storage overhead for provenance in real-world workloads. ProvUSB thus provides essential new techniques in the defense of computer systems and USB storage devices.
Provenance workflows capture movement and transformation of data in complex environments, such as document management in large organizations, content generation and sharing in in social media, scientific computations, etc. Sharing and processing of provenance workflows brings numerous benefits, e.g., improving productivity in an organization, understanding social media interaction patterns, etc. However, directly sharing provenance may also disclose sensitive information such as confidential business practices, or private details about participants in a social network. We propose an algorithm that privately extracts sequential association rules from provenance workflow datasets. Finding such rules has numerous practical applications, such as capacity planning or identifying hot-spots in provenance graphs. Our approach provides good accuracy and strong privacy, by leveraging on the exponential mechanism of differential privacy. We propose an heuristic that identifies promising candidate rules and makes judicious use of the privacy budget. Experimental results show that the our approach is fast and accurate, and clearly outperforms the state-of-the-art. We also identify influential factors in improving accuracy, which helps in choosing promising directions for future improvement.
Collecting and processing provenance, i.e., information describing the production process of some end product, is important in various applications, e.g., to assess quality, to ensure reproducibility, or to reinforce trust in the end product. In the past, different types of provenance meta-data have been proposed, each with a different scope. The first part of the proposed tutorial provides an overview and comparison of these different types of provenance. To put provenance to good use, it is essential to be able to interact with and present provenance data in a user-friendly way. Often, users interested in provenance are not necessarily experts in databases or query languages, as they are typically domain experts of the product and production process for which provenance is collected (biologists, journalists, etc.). Furthermore, in some scenarios, it is difficult to use solely queries for analyzing and exploring provenance data. The second part of this tutorial therefore focuses on enabling users to leverage provenance through adapted visualizations. To this end, we will present some fundamental concepts of visualization before we discuss possible visualizations for provenance.
In this paper, we propose ParaRegex, a novel approach for fast parallel regular expression matching. ParaRegex is a framework that implements data-parallel regular expression matching for deterministic finite automaton based methods. Experimental evaluation shows that ParaRegex produces a fast matching engine with speeds of up to 6 times compared to sequential implementations on a commodity 8-thread workstation.
We propose a method for classifying Wi-Fi enabled mobile handheld devices (smartphones) and non-handheld devices (laptops) in a completely passive way, that is resorting neither to traffic probes on network edge devices nor to deep packet inspection techniques to read application layer information. Instead, classification is performed starting from probe requests Wi-Fi frames, which can be sniffed with inexpensive commercial hardware. We extract distinctive features from probe request frames (how many probe requests are transmitted by each device, how frequently, etc.) and take a machine learning approach, training four different classifiers to recognize the two types of devices. We compare the performance of the different classifiers and identify a solution based on a Random Decision Forest that correctly classify devices 95% of the times. The classification method is then used as a pre-processing stage to analyze network traffic traces from the wireless network of a university building, with interesting considerations on the way different types of devices uses the network (amount of data exchanged, duration of connections, etc.). The proposed methodology finds application in many scenarios related to Wi-Fi network management/optimization and Wi-Fi based services.
This paper introduces typy, a statically typed programming language embedded by reflection into Python. typy features a fragmentary semantics, i.e. it delegates semantic control over each term, drawn from Python's fixed concrete and abstract syntax, to some contextually relevant user-defined semantic fragment. The delegated fragment programmatically 1) typechecks the term (following a bidirectional protocol); and 2) assigns dynamic meaning to the term by computing a translation to Python. We argue that this design is expressive with examples of fragments that express the static and dynamic semantics of 1) functional records; 2) labeled sums (with nested pattern matching a la ML); 3) a variation on JavaScript's prototypal object system; and 4) typed foreign interfaces to Python and OpenCL. These semantic structures are, or would need to be, defined primitively in conventionally structured languages. We further argue that this design is compositionally well-behaved. It avoids the expression problem and the problems of grammar composition because the syntax is fixed. Moreover, programs are semantically stable under fragment composition (i.e. defining a new fragment will not change the meaning of existing program components.)
Association and classification are two important tasks in data mining. Literature abounds with works that unify these two techniques. This paper presents a new algorithm called Particle Swarm Optimization trained Classification Association Rule Mining (PSOCARM) for associative classification that generates class association rules (CARs) from transactional database by formulating a combinatorial global optimization problem, without having to specify minimal support and confidence unlike other conventional associative classifiers. We devised a new rule pruning scheme in order to reduce the number of rules and increasing the generalization aspect of the classifier. We demonstrated its effectiveness for phishing email and phishing website detection. Our experimental results indicate the superiority of our proposed algorithm with respect to accuracy and the number of rules generated as compared to the state-of-the-art algorithms.
{Phishing is a social engineering tactic used to trick people into revealing personal information [Zielinska, Tembe, Hong, Ge, Murphy-Hill, & Mayhorn 2014]. As phishing emails continue to infiltrate users' mailboxes, what social engineering techniques are the phishers using to successfully persuade victims into releasing sensitive information? Cialdini's [2007] six principles of persuasion (authority, social proof, liking/similarity, commitment/consistency, scarcity, and reciprocation) have been linked to elements of phishing emails [Akbar 2014; Ferreira, & Lenzini 2015]; however, the findings have been conflicting. Authority and scarcity were found as the most common persuasion principles in 207 emails obtained from a Netherlands database [Akbar 2014], while liking/similarity was the most common principle in 52 personal emails available in Luxemborg and England [Ferreira et al. 2015]. The purpose of this study was to examine the persuasion principles present in emails available in the United States over a period of five years. Two reviewers assessed eight hundred eighty-seven phishing emails from Arizona State University, Brown University, and Cornell University for Cialdini's six principles of persuasion. Each email was evaluated using a questionnaire adapted from the Ferreira et al. [2015] study. There was an average agreement of 87% per item between the two raters. Spearman's Rho correlations were used to compare email characteristics over time. During the five year period under consideration (2010–2015), the persuasion principles of commitment/consistency and scarcity have increased over time, while the principles of reciprocation and social proof have decreased over time. Authority and liking/similarity revealed mixed results with certain characteristics increasing and others decreasing. The commitment/consistency principle could be seen in the increase of emails referring to elements outside the email to look more reliable, such as Google Docs or Adobe Reader (rs(850) = .12
Phishing is one of the most dangerous information security threats present in the world today, with losses toping 5.9 billion dollars in 2013. Evolving from the original concept of phishing, spear phishing also attempts to scam individuals online, however it uses personalized mail to yield a far higher success rate. This paper suggests an increased threat of spear phishing success due to the presence of social media. Assessing this new threat is important not only to the individuals, but also to companies whose employees may specifically be targeted through their social media accounts. The paper presents the design and implementation of an architecture to determine phishing susceptibility of a user through their social media accounts, and methods to reduce the threat. Preliminary testing shows that social media provides a publicly accessible resource to assess targeted individuals for phishing attacks through their accounts.
Privacy protection in Internet of Things (IoTs) has long been the topic of extensive research in the last decade. The perceptual layer of IoTs suffers the most significant privacy disclosing because of the limitation of hardware resources. Data encryption and anonymization are the most common methods to protect private information for the perceptual layer of IoTs. However, these efforts are ineffective to avoid privacy disclosure if the communication environment exists unknown wireless nodes which could be malicious devices. Therefore, in this paper we derive an innovative and passive method called Horizontal Hierarchy Slicing (HHS) method to detect the existence of unknown wireless devices which could result negative means to the privacy. PAM algorithm is used to cluster the HHS curves and analyze whether unknown wireless devices exist in the communicating environment. Link Quality Indicator data are utilized as the network parameters in this paper. The simulation results show their effectiveness in privacy protection.
The mainstream approach to protecting the privacy of mobile users in location-based services (LBSs) is to alter (e.g., perturb, hide, and so on) the users’ actual locations in order to reduce exposed sensitive information. In order to be effective, a location-privacy preserving mechanism must consider both the privacy and utility requirements of each user, as well as the user’s overall exposed locations (which contribute to the adversary’s background knowledge). In this article, we propose a methodology that enables the design of optimal user-centric location obfuscation mechanisms respecting each individual user’s service quality requirements, while maximizing the expected error that the optimal adversary incurs in reconstructing the user’s actual trace. A key advantage of a user-centric mechanism is that it does not depend on third-party proxies or anonymizers; thus, it can be directly integrated in the mobile devices that users employ to access LBSs. Our methodology is based on the mutual optimization of user/adversary objectives (maximizing location privacy versus minimizing localization error) formalized as a Stackelberg Bayesian game. This formalization makes our solution robust against any location inference attack, that is, the adversary cannot decrease the user’s privacy by designing a better inference algorithm as long as the obfuscation mechanism is designed according to our privacy games. We develop two linear programs that solve the location privacy game and output the optimal obfuscation strategy and its corresponding optimal inference attack. These linear programs are used to design location privacy–preserving mechanisms that consider the correlation between past, current, and future locations of the user, thus can be tuned to protect different privacy objectives along the user’s location trace. We illustrate the efficacy of the optimal location privacy–preserving mechanisms obtained with our approach against real location traces, showing their performance in protecting users’ different location privacy objectives.