Biblio
Users’ online behaviors such as ratings and examination of items are recognized as one of the most valuable sources of information for learning users’ preferences in order to make personalized recommendations. But most previous works focus on modeling only one type of users’ behaviors such as numerical ratings or browsing records, which are referred to as explicit feedback and implicit feedback, respectively. In this article, we study a Semisupervised Collaborative Recommendation (SSCR) problem with labeled feedback (for explicit feedback) and unlabeled feedback (for implicit feedback), in analogy to the well-known Semisupervised Learning (SSL) setting with labeled instances and unlabeled instances. SSCR is associated with two fundamental challenges, that is, heterogeneity of two types of users’ feedback and uncertainty of the unlabeled feedback. As a response, we design a novel Self-Transfer Learning (sTL) algorithm to iteratively identify and integrate likely positive unlabeled feedback, which is inspired by the general forward/backward process in machine learning. The merit of sTL is its ability to learn users’ preferences from heterogeneous behaviors in a joint and selective manner. We conduct extensive empirical studies of sTL and several very competitive baselines on three large datasets. The experimental results show that our sTL is significantly better than the state-of-the-art methods.
A power-efficient programmable-gain control function embedded Delta-Sigma (ΔΣ) analog-to-digital converter (ADC) for various smart sensor applications is presented. It consists of a programmable-gain switched-capacitor ΔΣ modulator followed by a digital decimation filter for down-sampling. The programmable function is realized with programmable coefficients of a loop filter using a capacitor array. The coefficient control is accomplished with keeping the location of poles of a noise transfer function, so the stability of a designed closed-loop transfer function can be assured. The proposed gain control method helps ADC to optimize its performance with varying input signal magnitude. The gain controllability requires negligible additional energy consuming or area occupying block. The power efficient programmable-gain ADC (PGADC) is well-suited for sensor devices. The gain amplification can be optimized from 0 to 18 dB with a 6 dB step. Measurements show that the PGADC achieves 15.2-bit resolution and 12.4-bit noise free resolution with 99.9 % reliability. The chip operates with a 3.3 V analog supply and a 1.8 V digital supply, while consuming only 97 μA analog current and 37 μA digital current. The analog core area is 0.064 mm2 in a standard 0.18-μm CMOS process.
Head portraits are popular in traditional painting. Automating portrait painting is challenging as the human visual system is sensitive to the slightest irregularities in human faces. Applying generic painting techniques often deforms facial structures. On the other hand portrait painting techniques are mainly designed for the graphite style and/or are based on image analogies; an example painting as well as its original unpainted version are required. This limits their domain of applicability. We present a new technique for transferring the painting from a head portrait onto another. Unlike previous work our technique only requires the example painting and is not restricted to a specific style. We impose novel spatial constraints by locally transferring the color distributions of the example painting. This better captures the painting texture and maintains the integrity of facial structures. We generate a solution through Convolutional Neural Networks and we present an extension to video. Here motion is exploited in a way to reduce temporal inconsistencies and the shower-door effect. Our approach transfers the painting style while maintaining the input photograph identity. In addition it significantly reduces facial deformations over state of the art.
Humans are often able to generalize knowledge learned from a single exemplar. In this paper, we present a novel integration of mental simulation and analogical generalization algorithms into a cognitive robotic architecture that enables a similarly rudimentary generalization capability in robots. Specifically, we show how a robot can generate variations of a given scenario and then use the results of those new scenarios run in a physics simulator to generate generalized action scripts using analogical mappings. The generalized action scripts then allow the robot to perform the originally learned activity in a wider range of scenarios with different types of objects without the need for additional exploration or practice. In a proof-of-concept demonstration we show how the robot can generalize from a previously learned pick-and-place action performed with a single arm on an object with a handle to a pick-and-place action of a cylindrical object with no handle with two arms.
Although computing students may enjoy when their instructors teach using analogies, it is unknown to what extent these analogies are useful for their learning. This study examines the value of analogies when used to introduce three introductory computing topics. The value of these analogies may be evident during the teaching process itself (short term), in subsequent exams (long term), or in students' ability to apply their understanding to related non-technical areas (transfer). Comparing results between an experimental group (analogy) and control group (no analogy), we find potential value for analogies in short term learning. However, no solid evidence was found to support analogies as valuable for students in the long term or for knowledge transfer. Specific demographic groups were examined and promising preliminary findings are presented.
Modern information extraction pipelines are typically constructed by (1) loading textual data from a database into a special-purpose application, (2) applying a myriad of text-analytics functions to the text, which produce a structured relational table, and (3) storing this table in a database. Obviously, this approach can lead to laborious development processes, complex and tangled programs, and inefficient control flows. Towards solving these deficiencies, we embark on an effort to lay the foundations of a new generation of text-centric database management systems. Concretely, we extend the relational model by incorporating into it the theory of document spanners which provides the means and methods for the model to engage the Information Extraction (IE) tasks. This extended model, called Spannerlog, provides a novel declarative method for defining and manipulating textual data, which makes possible the automation of the typical work method described above. In addition to formally defining Spannerlog and illustrating its usefulness for IE tasks, we also report on initial results concerning its expressive power.
In many domains, a plethora of textual information is available on the web as news reports, blog posts, community portals, etc. Information extraction (IE) is the default technique to turn unstructured text into structured fact databases, but systematically applying IE techniques to web input requires highly complex systems, starting from focused crawlers over quality assurance methods to cope with the HTML input to long pipelines of natural language processing and IE algorithms. Although a number of tools for each of these steps exists, their seamless, flexible, and scalable combination into a web scale end-to-end text analytics system still is a true challenge. In this paper, we report our experiences from building such a system for comparing the "web view" on health related topics with that derived from a controlled scientific corpus, i.e., Medline. The system combines a focused crawler, applying shallow text analysis and classification to maintain focus, with a sophisticated text analytic engine inside the Big Data processing system Stratosphere. We describe a practical approach to seed generation which led us crawl a corpus of \textasciitilde1 TB web pages highly enriched for the biomedical domain. Pages were run through a complex pipeline of best-of-breed tools for a multitude of necessary tasks, such as HTML repair, boilerplate detection, sentence detection, linguistic annotation, parsing, and eventually named entity recognition for several types of entities. Results are compared with those from running the same pipeline (without the web-related tasks) on a corpus of 24 million scientific abstracts and a third corpus made of \textasciitilde250K scientific full texts. We evaluate scalability, quality, and robustness of the employed methods and tools. The focus of this paper is to provide a large, real-life use case to inspire future research into robust, easy-to-use, and scalable methods for domain-specific IE at web scale.
Text mining has developed and emerged as an essential tool for revealing the hidden value in the data. Text mining is an emerging technique for companies around the world and suitable for large enduring analyses and discrete investigations. Since there is a need to track disrupting technologies, explore internal knowledge bases or review enormous data sets. Most of the information produced due to conversation transcripts is an unstructured format. These data have ambiguity, redundancy, duplications, typological errors and many more. The processing and analysis of these unstructured data are difficult task. But, there are several techniques in text mining are available to extract keywords from these unstructured conversation transcripts. Keyword Extraction is the process of examining the most significant word in the context which helps to take decisions in a much faster manner. The main objective of the proposed work is extracting the keywords from meeting transcripts by using the Swarm Intelligence (SI) techniques. Here Stochastic Diffusion Search (SDS) algorithm is used for keyword extraction and Firefly algorithm used for clustering. These techniques will be implemented for an extensive range of optimization problems and produced better results when compared with existing technique.
Many books and papers describe how to do data science. While those texts are useful, it can also be important to reflect on anti-patterns; i.e. common classes of errors seen when large communities of researchers and commercial software engineers use, and misuse data mining tools. This technical briefing will present those errors and show how to avoid them.
The web world is been flooded with multi-media sources such as images, videos, animations and audios, which has in turn made the computer vision researchers to focus over extracting the content from the sources. Scene text recognition basically involves two major steps namely Text Localization and Text Recognition. This paper provides end-to-end text recognition approach to extract the characters alone from the complex natural scene. Using Maximal Stable Extremal Region (MSER) the various objects are localized, using Canny Edge detection method edges are identified, further binary classification is done using Connected-Component method which segregates the text and nontext objects and finally the stroke analysis method is applied to analyse the style of the character, leading to the character recognization. The Experimental results were obtained by testing the approach over ICDAR2015 dataset, wherein text was able to be recognized from most of the scene images with good precision value.
By reflecting the degree of proximity or remoteness of documents, similarity measure plays the key role in text analytics. Traditional measures, e.g. cosine similarity, assume that documents are represented in an orthogonal space formed by words as dimensions. Words are considered independent from each other and document similarity is computed based on lexical overlap. This assumption is also made in the bag of concepts representation of documents while the space is formed by concepts. This paper proposes new semantic similarity measures without relying on the orthogonality assumption. By employing Wikipedia as an external resource, we introduce five similarity measures using concept-concept relatedness. Experimental results on real text datasets reveal that eliminating the orthogonality assumption improves the quality of text clustering algorithms.
The identification of relevant information in large text databases is a challenging task. One of the reasons is human beings' limitations in handling large volumes of data. A common solution for scavenging data from texts are word clouds. A word cloud illustrates word usage in a document by resizing individual words in documents proportionally to how frequently they appear. Even though word clouds are easy to understand, they are not particularly efficient, because they are static. In addition, the presented information lacks context, i.e., words are not explained and they may lead to radically erroneous interpretations. To tackle these problems we developed VCloud, a tool that allows the user to interact with word clouds, therefore allowing informative and interactive data exploration. Furthermore, our tool also allows one to compare two data sets presented as word clouds. We evaluated VCloud using real data about the evolution of gastritis research through the years. The papers indexed by Pubmed related to this medical context were selected for visualization and data analysis using VCloud. A domain expert explored these visualizations, being able to extract useful information from it. This illustrates how can VCloud be a valuable tool for visual text analytics.
Analyzing and gaining insights from a large amount of textual conversations can be quite challenging for a user, especially when the discussions become very long. During my doctoral research, I have focused on integrating Information Visualization (InfoVis) with Natural Language Processing (NLP) techniques to better support the user's task of exploring and analyzing conversations. For this purpose, I have designed a visual text analytics system that supports the user exploration, starting from a possibly large set of conversations, then narrowing down to a subset of conversations, and eventually drilling-down to a set of comments of one conversation. While so far our approach is evaluated mainly based on lab studies, in my on-going and future work I plan to evaluate our approach via online longitudinal studies.
In this article, I present the questions that I seek to answer in my PhD research. I posit to analyze natural language text with the help of semantic annotations and mine important events for navigating large text corpora. Semantic annotations such as named entities, geographic locations, and temporal expressions can help us mine events from the given corpora. These events thus provide us with useful means to discover the locked knowledge in them. I pose three problems that can help unlock this knowledge vault in semantically annotated text corpora: i. identifying important events; ii. semantic search; iii. and event analytics.
Online conversations, such as blogs, provide rich amount of information and opinions about popular queries. Given a query, traditional blog sites return a set of conversations often consisting of thousands of comments with complex thread structure. Since the interfaces of these blog sites do not provide any overview of the data, it becomes very difficult for the user to explore and analyze such a large amount of conversational data. In this paper, we present MultiConVis, a visual text analytics system designed to support the exploration of a collection of online conversations. Our system tightly integrates NLP techniques for topic modeling and sentiment analysis with information visualizations, by considering the unique characteristics of online conversations. The resulting interface supports the user exploration, starting from a possibly large set of conversations, then narrowing down to the subset of conversations, and eventually drilling-down to the set of comments of one conversation. Our evaluations through case studies with domain experts and a formal user study with regular blog readers illustrate the potential benefits of our approach, when compared to a traditional blog reading interface.
It is estimated that 50% of the global population lives in urban areas occupying just 0.4% of the Earth's surface. Understanding urban activity constitutes monitoring population density and its changes over time, in urban environments. Currently, there are limited mechanisms to non-intrusively monitor population density in real-time. The pervasive use of cellular phones in urban areas is one such mechanism that provides a unique opportunity to study population density by monitoring the mobility patterns in near real-time. Cellular carriers such as AT&T harvest such data through their cell towers; however, this data is proprietary and the carriers restrict access, due to privacy concerns. In this work, we propose a system that passively senses the population density and infers mobility patterns in an urban area by monitoring power spectral density in cellular frequency bands using periodic beacons from each cellphone without knowing who and where they are located. A wireless sensor network platform is being developed to perform spectral monitoring along with environmental measurements. Algorithms are developed to generate real-time fine-resolution population estimates.
Spoofing is a serious threat to the widespread use of Global Navigation Satellite Systems (GNSSs) such as GPS and can be expected to play an important role in the security of many future IoT systems that rely on time, location, or navigation information. In this paper, we focus on the technique of multi-receiver GPS spoofing detection, so far only proposed theoretically. This technique promises to detect malicious spoofing signals by making use of the reported positions of several GPS receivers deployed in a fixed constellation. We scrutinize the assumptions of prior work, in particular the error models, and investigate how these models and their results can be improved due to the correlation of errors at co-located receiver positions. We show that by leveraging spatial noise correlations, the false acceptance rate of the countermeasure can be improved while preserving the sensitivity to attacks. As a result, receivers can be placed significantly closer together than previously expected, which broadens the applicability of the countermeasure. Based on theoretical and practical investigations, we build the first realization of a multi-receiver countermeasure and experimentally evaluate its performance both in authentic and in spoofing scenarios.
Ubiquitous WiFi infrastructure and smart phones offer a great opportunity to study physical activities. In this paper, we present MobiCamp, a large-scale testbed for studying mobility-related activities of residents on a campus. MobiCamp consists of \textasciitilde2,700 APs, \textasciitilde95,000 smart phones, and an App with \textasciitilde2,300 opt-in volunteer users. More specifically, we capture how mobile users interact with different types of buildings, with other users, and with classroom courses, etc. To achieve this goal, we first obtain a relatively complete coverage of the users' mobility traces by utilizing four types of information from SNMP and by relaxing the location granularity to roughly at the room level. Then the popular App provides user attributes (grade, gender, etc.) and fine-grained behavior information (phone usages, course timetables, etc.) of the sampled population. These detailed mobile data is then correlated with the mobility traces from the SNMP to estimate the entire campus population's physical activities. We use two applications to show the power of MobiCamp.
We study the trade-off between the benefits obtained by communication, vs. the risks due to exposure of the location of the transmitter. To study this problem, we introduce a game between two teams of mobile agents, the P-bots team and the E-bots team. The E-bots attempt to eavesdrop and collect information, while evading the P-bots; the P-bots attempt to prevent this by performing patrol and pursuit. The game models a typical use-case of micro-robots, i.e., their use for (industrial) espionage. We evaluate strategies for both teams, using analysis and simulations.
With its high penetration rate and relatively good clock accuracy, smartphones are replacing watches in several market segments. Modern smartphones have more than one clock source to complement each other: NITZ (Network Identity and Time Zone), NTP (Network Time Protocol), and GNSS (Global Navigation Satellite System) including GPS. NITZ information is delivered by the cellular core network, indicating the network name and clock information. NTP provides a facility to synchronize the clock with a time server. Among these clock sources, only NITZ and NTP are updated without user interaction, as location services require manual activation. In this paper, we analyze security aspects of these clock sources and their impact on security features of modern smartphones. In particular, we investigate NITZ and NTP procedures over cellular networks (2G, 3G and 4G) and Wi-Fi communication respectively. Furthermore, we analyze several European, Asian, and American cellular networks from NITZ perspective. We identify three classes of vulnerabilities: specification issues in a cellular protocol, configurational issues in cellular network deployments, and implementation issues in different mobile OS's. We demonstrate how an attacker with low cost setup can spoof NITZ and NTP messages to cause Denial of Service attacks. Finally, we propose methods for securely synchronizing the clock on smartphones.
Future transportation systems highly rely on the integrity of spatial information provided by their means of transportation such as vehicles and planes. In critical applications (e.g. collision avoidance), tampering with this data can result in life-threatening situations. It is therefore essential for the safety of these systems to securely verify this information. While there is a considerable body of work on the secure verification of locations, movement of nodes has only received little attention in the literature. This paper proposes a new method to securely verify spatial movement of a mobile sender in all dimensions, i.e., position, speed, and direction. Our scheme uses Doppler shift measurements from different locations to verify a prover's motion. We provide formal proof for the security of the scheme and demonstrate its applicability to air traffic communications. Our results indicate that it is possible to reliably verify the motion of aircraft in currently operational systems with an equal error rate of zero.
Radio network information is leaked well beyond the perimeter in which the radio network is deployed. We investigate attacks where person location can be inferred using the radio characteristics of wireless links (e.g., the received signal strength). An attacker can deploy a network of receivers which measure the received signal strength of the radio signals transmitted by the legitimate wireless devices inside a perimeter, allowing the attacker to learn the locations of people moving in the vicinity of the devices inside the perimeter. In this paper, we develop the first solution to this location privacy problem where neither the attacker nodes nor the tracked moving object transmit any RF signals. We first model the radio network leakage attack using a Stackelberg game. Next, we define utility and cost functions related to the defender and attacker actions. Last, using our utility and cost functions, we find the optimal strategy for the defender by applying a greedy method. We evaluate our game theoretic framework using experiments and find that our approach significantly reduces the chance of an attacker determining the location of people inside a perimeter.
Today, mobile data owners lack consent and control over the release and utilization of their location data. Third party applications continuously process and access location data without data owners granular control and without knowledge of how location data is being used. The proliferation of GPS enabled IoT devices will lead to larger scale abuses of trust. In this paper we present the first design and implementation of a privacy module built into the GPSD daemon. The GPSD daemon is a low-level GPS interface that runs on GPS enabled devices. The integration of the privacy module ensures that data owners have granular control over the release of their GPS location. We describe the design of our privacy module integration into the GPSD daemon.
With an immense number of threats pouring in from nation states and hacktivists as well as terrorists and cybercriminals, the requirement of a globally secure infrastructure becomes a major obligation. Most critical infrastructures were primarily designed to work isolated from the normal communication network, but due to the advent of the "Smart Grid" that uses advanced and intelligent approaches to control critical infrastructure, it is necessary for these cyber-physical systems to have access to the communication system. Consequently, such critical systems have become prime targets; hence security of critical infrastructure is currently one of the most challenging research problems. Performing an extensive security analysis involving experiments with cyber-attacks on a live industrial control system (ICS) is not possible. Therefore, researchers generally resort to test beds and complex simulations to answer questions related to SCADA systems. Since all conclusions are drawn from the test bed, it is necessary to perform validation against a physical model. This paper examines the fidelity of a virtual SCADA testbed to a physical test bed and allows for the study of the effects of cyber- attacks on both of the systems.
Security and privacy are crucial for all IT systems and services. The diversity of applications places high demands on the knowledge and experience of software developers and IT professionals. Besides programming skills, security and privacy aspects are required as well and must be considered during development. If developers have not been trained in these topics, it is especially difficult for them to prevent problematic security issues such as vulnerabilities. In this work we present an interactive e-learning platform focusing on solving real-world cybersecurity tasks in a sandboxed web environment. With our platform students can learn and understand how security vulnerabilities can be exploited in different scenarios. The platform has been evaluated in four university IT security courses with around 1100 participants over three years.