Biblio
In today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To extract value from these large, multi-domain pools of text, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their fine-grained types (e.g., people, product and food) in a scalable way. Since these methods do not rely on annotated data, predefined typing schema or hand-crafted features, they can be quickly adapted to a new domain, genre and language. We demonstrate on real datasets including various genres (e.g., news articles, discussion forum posts, and tweets), domains (general vs. bio-medical domains) and languages (e.g., English, Chinese, Arabic, and even low-resource languages like Hausa and Yoruba) how these typed entities aid in knowledge discovery and management.
With the increasing incentive of enterprises to ingest as much data as they can in what is commonly referred to as "data lakes", and with the recent development of multiple technologies to support this "load-first" paradigm, the new environment presents serious data management challenges. Among them, the assessment of data quality and cleaning large volumes of heterogeneous data sources become essential tasks in unveiling the value of big data. The coveted use of unstructured and semi-structured data in large volumes makes current data cleaning tools (primarily designed for relational data) not directly adoptable. We present CLAMS, a system to discover and enforce expressive integrity constraints from large amounts of lake data with very limited schema information (e.g., represented as RDF triples). This demonstration shows how CLAMS is able to discover the constraints and the schemas they are defined on simultaneously. CLAMS also introduces a scale-out solution to efficiently detect errors in the raw data. CLAMS interacts with human experts to both validate the discovered constraints and to suggest data repairs. CLAMS has been deployed in a real large-scale enterprise data lake and was experimented with a real data set of 1.2 billion triples. It has been able to spot multiple obscure data inconsistencies and errors early in the data processing stack, providing huge value to the enterprise.
In a software project as large and as rapidly evolving as the Linux kernel, automated testing systems are an integral component to the development process. Extensive build and regression tests can catch potential problems in changes before they appear in a stable release. Current systems, however, do not systematically incorporate the configuration system Kconfig. In this work, we present an approach to identify relationships between configuration options. These relationships allow us to find source files which might be affected by a change to a configuration option and hence require retesting. Our findings show that the majority of configuration options only affects few files, while very few options influence almost all files in the code base. We further observe that developers sometimes value usability over clean dependency modelling, leading to counterintuitive outliers in our results.
In this paper, we focus on the definition of estimators to predict method calls in Android apps. Estimation models are based on information from requirements specification documents (e.g., number of actors, number of use cases, and number of classes in the conceptual model). We have used a dataset containing information on 23 Android apps. After performing data-cleaning, we applied linear regression to build estimation models on 21 data points. Results suggest that measures gathered from requirements specification documents can be considered good predictors to estimate the number of internal calls (i.e., methods invoking other methods present in the app) and external calls (i.e., invocations to API) as well as their sum.
Mutex locks have traditionally been the most common mechanism for protecting shared data structures in parallel programs. However, the robustness of such locks against process failures has not been studied thoroughly. Most (user-level) mutex algorithms are designed around the assumption that processes are reliable, meaning that a process may not fail while executing the lock acquisition and release code, or while inside the critical section. If such a failure does occur, then the liveness properties of a conventional mutex lock may cease to hold until the application or operating system intervenes by cleaning up the internal structure of the lock. For example, a process that is attempting to acquire an otherwise starvation-free mutex may be blocked forever waiting for a failed process to release the critical section. Adding to the difficulty, if the failed process recovers and attempts to acquire the same mutex again without appropriate cleanup, then the mutex may become corrupted to the point where it loses safety, notably the mutual exclusion property. We address this challenge by formalizing the problem of recoverable mutual exclusion, and proposing several solutions that vary both in their assumptions regarding hardware support for synchronization, and in their time complexity. Compared to known solutions, our algorithms are more robust as they do not restrict where or when a process may crash, and provide stricter guarantees in terms of time complexity, which we define in terms of remote memory references.
Despite the growing promotion of the “open data” movement, the collection, cleaning, management, interpretation, and dissemination of open data is laborious and cost intensive, particularly for non-profits with limited resources. In this paper, we describe how non-profit organizations (NPOs) use open data, building on prior literature that focuses on understanding challenges that NPOs face. Based on 15 interviews of staff from 10 NPOs, our results suggest that NPOs use data to develop narratives to build a case for support from grantors and other stakeholders. We then present empirical results based on the usage of a data portal we created, which suggests that technologies should be designed to not only make data accessible, but also to facilitate communication and support relationships between expert data analysts and NPOs.
Walrasian equilibrium prices have a remarkable property: they allow each buyer to purchase a bundle of goods that she finds the most desirable, while guaranteeing that the induced allocation over all buyers will globally maximize social welfare. However, this clean story has two caveats. * First, the prices may induce indifferences. In fact, the minimal equilibrium prices necessarily induce indifferences. Accordingly, buyers may need to coordinate with one another to arrive at a socially optimal outcome—the prices alone are not sufficient to coordinate the market. * Second, although natural procedures converge to Walrasian equilibrium prices on a fixed population, in practice buyers typically observe prices without participating in a price computation process. These prices cannot be perfect Walrasian equilibrium prices, but instead somehow reflect distributional information about the market. To better understand the performance of Walrasian prices when facing these two problems, we give two results. First, we propose a mild genericity condition on valuations under which the minimal Walrasian equilibrium prices induce allocations which result in low over-demand, no matter how the buyers break ties. In fact, under genericity the over-demand of any good can be bounded by 1, which is the best possible at the minimal prices. We demonstrate our results for unit demand valuations and give an extension to matroid based valuations (MBV), conjectured to be equivalent to gross substitute valuations (GS). Second, we use techniques from learning theory to argue that the over-demand and welfare induced by a price vector converge to their expectations uniformly over the class of all price vectors, with respective sample complexity linear and quadratic in the number of goods in the market. These results make no assumption on the form of the valuation functions. These two results imply that under a mild genericity condition, the exact Walrasian equilibrium prices computed in a market are guaranteed to induce both low over-demand and high welfare when used in a new market where agents are sampled independently from the same distribution, whenever the number of agents is larger than the number of commodities in the market.
With the increase of the training offered in online environments over the last decade, one of the key elements in these training events is mediation as an enabler best results approval and learning achievements. In this context the role of mediation is key, especially when there is no teacher or tutor figure that is in contact with the participants. The aim of this research, that is derived from a doctoral program, is to analyze the practices of mediation in massive open online courses (MOOC) in the areas of energy sustainability and propose a mediation model that considers the open innovation as a key element. In this way answers the research question regarding the relationship between the practices of teaching and learning technological mediation with MOOC courses. The research is the result of cooperation between government institutions in Mexico such as CONACYT and SENER with the Technological Institute of Monterrey, the latter by offering 10 massive courses on the subject of energy sustainability, whose course name in spanish is: "Generación de energías y limpias y energías convencionales" [Generation of clean energy and conventional energy], and its participantes constitute the research population, who, through various instruments of qualitative and quantitative, will be consulted regarding mediation and its relationship with learning. The results of this study allow thus to answer the research question, in addition to the generation of proposals in the field of mediation that promote open innovation. It is expected that the results represent a contribution in the sense of determining the relationship mediation - outcomes of learning in MOOC courses.
The execution of distributed applications are captured by the events generated by the individual components. However, understanding the behavior of these applications from their event logs can be a complex and error prone task, compounded by the fact that applications continuously change rendering any knowledge obsolete. We describe our experiences applying a suite of process-aware analytic tools to a number of real world scenarios, and distill our lessons learned. For example, we have seen that these tools are used iteratively, where insights gained at one stage inform the configuration decisions made at an earlier stage. As well, we have observed that data onboarding, where the raw data is cleaned and transformed, is the most critical stage in the pipeline and requires the most manual effort and domain knowledge. In particular, missing, inconsistent, and low-resolution event time stamps are recurring problems that require better solutions. The experiences and insights presented here will assist practitioners applying process analytic tools to real scenarios, and reveal to researchers some of the more pressing challenges in this space.
Today's infrastructure as a service (IaaS) cloud environments rely upon full trust in the provider to secure applications and data. Cloud providers do not offer the ability to create hardware-rooted cryptographic identities for IaaS cloud resources or sufficient information to verify the integrity of systems. Trusted computing protocols and hardware like the TPM have long promised a solution to this problem. However, these technologies have not seen broad adoption because of their complexity of implementation, low performance, and lack of compatibility with virtualized environments. In this paper we introduce keylime, a scalable trusted cloud key management system. keylime provides an end-to-end solution for both bootstrapping hardware rooted cryptographic identities for IaaS nodes and for system integrity monitoring of those nodes via periodic attestation. We support these functions in both bare-metal and virtualized IaaS environments using a virtual TPM. keylime provides a clean interface that allows higher level security services like disk encryption or configuration management to leverage trusted computing without being trusted computing aware. We show that our bootstrapping protocol can derive a key in less than two seconds, we can detect system integrity violations in as little as 110ms, and that keylime can scale to thousands of IaaS cloud nodes.
Fully automated vehicles will require new functionalities for perception, navigation and decision making -- an Autonomous Driving Intelligence (ADI). We consider architectural cases for such functionalities and investigate how they integrate with legacy platforms. The cases range from a robot replacing the driver -- with entire reuse of existing vehicle platforms, to a clean-slate design. Focusing on Heavy Commercial Vehicles (HCVs), we assess these cases from the perspectives of business, safety, dependability, verification, and realization. The original contributions of this paper are the classification of the architectural cases themselves and the analysis that follows. The analysis reveals that although full reuse of vehicle platforms is appealing, it will require explicitly dealing with the accidental complexity of the legacy platforms, including adding corresponding diagnostics and error handling to the ADI. The current fail-safe design of the platform will also tend to limit availability. Allowing changes to the platforms, will enable more optimized designs and fault-operational behaviour, but will require initial higher development cost and specific emphasis on partitioning and control to limit the influences of safety requirements. For all cases, the design and verification of the ADI will pose a grand challenge and relate to the evolution of the regulatory framework including safety standards.
As the amount of mobile devices populating the Internet keeps growing at tremendous pace, context-aware services have gained a lot of traction thanks to the wide set of potential use cases they can be applied to. Environmental sensing applications, emergency services, and location-aware messaging are just a few examples of applications that are expected to increase in popularity in the next few years. The MobilityFirst future Internet architecture, a clean-slate Internet architecture design, provides the necessary abstractions for creating and managing context-aware services. Starting from these abstractions we design a context services framework, which is based on a set of three fundamental mechanisms: an easy way to specify context based on human understandable techniques, i.e. use of names, an architecture supported management mechanism that allows both to conveniently deploy the service and efficiently provide management capabilities, and a native delivery system that reduces the tax on the network components and on the overhead cost of deploying such applications. In this paper, we present an emergency alert system for vehicles assisting first responders that exploits users location awareness to support quick and reliable alert messages for interested vehicles. By deploying a demo of the system on a nationwide testbed, we aim to provide better understanding of the dynamics involved in our designed framework.
The Internet routing ecosystem is facing substantial scalability challenges on the data plane. Various “clean slate” architectures for representing forwarding tables (FIBs), such as IPv6, introduce additional constraints on efficient implementations from both lookup time and memory footprint perspectives due to significant classification width. In this work, we propose an abstraction layer able to represent IPv6 FIBs on existing IP and even MPLS infrastructure. Feasibility of the proposed representations is confirmed by an extensive simulation study on real IPv6 forwarding tables, including low-level experimental performance evaluation.
As demand for wireless mobile connectivity continues to explode, cellular network infrastructure capacity requirements continue to grow. While 5G tries to address capacity requirements at the radio layer, the load on the cellular core network infrastructure (called Enhanced Packet Core (EPC)) stresses the network infrastructure. Our work examines the architecture, protocols of current cellular infrastructures and the workload on the EPC. We study the challenges in dimensioning capacity and review the design alternatives to support the significant scale up desired, even for the near future. We breakdown the workload on the network infrastructure into its components-signaling event transactions; database or lookup transactions and packet processing. We quantitatively show the control plane and data plane load on the various components of the EPC and estimate how future 5G cellular network workloads will scale. This analysis helps us to understand the scalability challenges for future 5G EPC network components. Other efforts to scale the 5G cellular network take a system view where the control plane is separated from the data path and is terminated on a centralized SDN controller. The SDN controller configures the data path on a widely distributed switching infrastructure. Our analysis of the workload informs us on the feasibility of various design alternatives and motivates our efforts to develop our clean-slate approach, called CleanG.
The use of multi-terminal HVDC to integrate wind power coming from the North Sea opens de door for a new transmission system model, the DC-Independent System Operator (DC-ISO). DC-ISO will face highly stressed and varying conditions that requires new risk assessment tools to ensure security of supply. This paper proposes a novel risk-based static security assessment methodology named risk-based DC security assessment (RB-DCSA). It combines a probabilistic approach to include uncertainties and a fuzzy inference system to quantify the systemic and individual component risk associated with operational scenarios considering uncertainties. The proposed methodology is illustrated using a multi-terminal HVDC system where the variability of wind speed at the offshore wind is included.
In the Internet-of-Things (IoT), users might share part of their data with different IoT prosumers, which offer applications or services. Within this open environment, the existence of an adversary introduces security risks. These can be related, for instance, to the theft of user data, and they vary depending on the security controls that each IoT prosumer has put in place. To minimize such risks, users might seek an “optimal” set of prosumers. However, assuming the adversary has the same information as the users about the existing security measures, he can then devise which prosumers will be preferable (e.g., with the highest security levels) and attack them more intensively. This paper proposes a decision-support approach that minimizes security risks in the above scenario. We propose a non-cooperative, two-player game entitled Prosumers Selection Game (PSG). The Nash Equilibria of PSG determine subsets of prosumers that optimize users' payoffs. We refer to any game solution as the Nash Prosumers Selection (NPS), which is a vector of probabilities over subsets of prosumers. We show that when using NPS, a user faces the least expected damages. Additionally, we show that according to NPS every prosumer, even the least secure one, is selected with some non-zero probability. We have also performed simulations to compare NPS against two different heuristic selection algorithms. The former is proven to be approximately 38% more effective in terms of security-risk mitigation.
Nowadays, Online Social Networks (OSNs) are very popular and have become an integral part of our life. People are dependent on Online Social Networks for various purposes. The activities of most of the users are normal, but a few of the users exhibit unusual and suspicious behavior. We term this suspicious and unusual behavior as malicious behavior. Malicious behavior in Online Social Networks includes a wide range of unethical activities and actions performed by individuals or communities to manipulate thought process of OSN users to fulfill their vested interest. Such malicious behavior needs to be checked and its effects should be minimized. To minimize effects of such malicious activities, we require proper detection and containment strategy. Such strategy will protect millions of users across the OSNs from misinformation and security threats. In this paper, we discuss the different studies performed in the area of malicious behavior analysis and propose a framework for detection of malicious behavior in OSNs.
With the advancement of technology, the world has not only become a better place to live in but have also lost the privacy and security of shared data. Information in any form is never safe from the hands of unauthorized accessing individuals. Here, in our paper we propose an approach by which we can preserve data using visual cryptography. In this paper, two sixteen segment displayed text is broken into two shares that does not reveal any information about the original images. By this process we have obtained satisfactory results in statistical and structural testes.
The passive radar also known as Green Radar exploits the available commercial communication signals and is useful for target tracking and detection in general. Recent communications standards frequently employ Orthogonal Frequency Division Multiplexing (OFDM) waveforms and wideband for broadcasting. This paper focuses on the recent developments of the target detection algorithms in the OFDM passive radar framework where its channel estimates have been derived using the matched filter concept using the knowledge of the transmitted signals. The MUSIC algorithm, which has been modified to solve this two dimensional delay-Doppler detection problem, is first reviewed. As the target detection problem can be represented as sparse signals, this paper employs compressive sensing to compare with the detection capability of the 2-D MUSIC algorithm. It is found that the previously proposed single time sample compressive sensing cannot significantly reduce the leakage from the direct signal component. Furthermore, this paper proposes the compressive sensing method utilizing multiple time samples, namely l1-SVD, for the detection of multiple targets. In comparison between the MUSIC and compressive sensing, the results show that l1-SVD can decrease the direct signal leakage but its prerequisite of computational resources remains a major issue. This paper also presents the detection performance of these two algorithms for closely spaced targets.
With the increase in signal's bandwidth, the conventional analog to digital converters (ADCs), operating on the basis of Shannon/Nyquist theorem, are forced to work at very high rates leading to low dynamic range and high power consumptions. This paper here tells about one Analog to Information converter developed based on compressive sensing techniques. The high sampling rates, which is the main drawback for ADCs, is being successfully reduced to 4 times lower than the conventional rates. The system is also accompanied with the advantage of low power dissipation.
Compressive Sampling and Sparse reconstruction theory is applied to a linearly frequency modulated continuous wave hybrid lidar/radar system. The goal is to show that high resolution time of flight measurements to underwater targets can be obtained utilizing far fewer samples than dictated by Nyquist sampling theorems. Traditional mixing/down-conversion and matched filter signal processing methods are reviewed and compared to the Compressive Sampling and Sparse Reconstruction methods. Simulated evidence is provided to show the possible sampling rate reductions, and experiments are used to observe the effects that turbid underwater environments have on recovery. Results show that by using compressive sensing theory and sparse reconstruction, it is possible to achieve significant sample rate reduction while maintaining centimeter range resolution.
Internet is facing many challenges that cannot be solved easily through ad hoc patches. To address these challenges, many research programs and projects have been initiated and many solutions are being proposed. However, before we have a new architecture that can motivate Internet service providers (ISPs) to deploy and evolve, we need to address two issues: 1) know the current status better by appropriately evaluating the existing Internet; and 2) find how various incentives and strategies will affect the deployment of the new architecture. For the first issue, we define a series of quantitative metrics that can potentially unify results from several measurement projects using different approaches and can be an intrinsic part of future Internet architecture (FIA) for monitoring and evaluation. Using these metrics, we systematically evaluate the current interdomain routing system and reveal many “autonomous-system-level” observations and key lessons for new Internet architectures. Particularly, the evaluation results reveal the imbalance underlying the interdomain routing system and how the deployment of FIAs can benefit from these findings. With these findings, for the second issue, appropriate deployment strategies of the future architecture changes can be formed with balanced incentives for both customers and ISPs. The results can be used to shape the short- and long-term goals for new architectures that are simple evolutions of the current Internet (so-called dirty-slate architectures) and to some extent to clean-slate architectures.
Although computational systems are looking towards post CMOS devices in the pursuit of lower power, the expected inherent unreliability of such devices makes it difficult to design robust systems without additional power overheads for guaranteeing robustness. As such, algorithmic structures with inherent ability to tolerate computational errors are of significant interest. We propose to cast applications as stochastic algorithms based on Markov chains (MCs) as such algorithms are both sufficiently general and tolerant to transition errors. We show with four example applications—Boolean satisfiability, sorting, low-density parity-check decoding and clustering—how applications can be cast as MC algorithms. Using algorithmic fault injection techniques, we demonstrate the robustness of these implementations to transition errors with high error rates. Based on these results, we make a case for using MCs as an algorithmic template for future robust low-power systems.
Workarounds to computer access in healthcare are sufficiently common that they often go unnoticed. Clinicians focus on patient care, not cybersecurity. We argue and demonstrate that understanding workarounds to healthcare workers’ computer access requires not only analyses of computer rules, but also interviews and observations with clinicians. In addition, we illustrate the value of shadowing clinicians and conducing focus groups to understand their motivations and tradeoffs for circumvention. Ethnographic investigation of the medical workplace emerges as a critical method of research because in the inevitable conflict between even well-intended people versus the machines, it’s the people who are the more creative, flexible, and motivated. We conducted interviews and observations with hundreds of medical workers and with 19 cybersecurity experts, CIOs, CMIOs, CTO, and IT workers to obtain their perceptions of computer security. We also shadowed clinicians as they worked. We present dozens of ways workers ingeniously circumvent security rules. The clinicians we studied were not “black hat” hackers, but just professionals seeking to accomplish their work despite the security technologies and regulations.