Biblio
The explosion in Internet-connected household devices, such as light-bulbs, smoke-alarms, power-switches, and webcams, is creating new vectors for attacking "smart-homes" at an unprecedented scale. Common perception is that smart-home IoT devices are protected from Internet attacks by the perimeter security offered by home routers. In this paper we demonstrate how an attacker can infiltrate the home network via a doctored smart-phone app. Unbeknownst to the user, this app scouts for vulnerable IoT devices within the home, reports them to an external entity, and modifies the firewall to allow the external entity to directly attack the IoT device. The ability to infiltrate smart-homes via doctored smart-phone apps demonstrates that home routers are poor protection against Internet attacks and highlights the need for increased security for IoT devices.
Authenticating a user based on her unique behavioral bio-metric traits has been extensively researched over the past few years. The most researched behavioral biometrics techniques are based on keystroke and mouse dynamics. These schemes, however, have been shown to be vulnerable to human-based and robotic attacks that attempt to mimic the user's behavioral pattern to impersonate the user. In this paper, we aim to verify the user's identity through the use of active, cognition-based user interaction in the authentication process. Such interaction boasts to provide two key advantages. First, it may enhance the security of the authentication process as multiple rounds of active interaction would serve as a mechanism to prevent against several types of attacks, including zero-effort attack, expert trained attackers, and automated attacks. Second, it may enhance the usability of the authentication process by actively engaging the user in the process. We explore the cognitive authentication paradigm through very simplistic interactive challenges, called Dynamic Cognitive Games, which involve objects floating around within the images, where the user's task is to match the objects with their respective target(s) and drag/drop them to the target location(s). Specifically, we introduce, build and study Gametrics ("Game-based biometrics"), an authentication mechanism based on the unique way the user solves such simple challenges captured by multiple features related to her cognitive abilities and mouse dynamics. Based on a comprehensive data set collected in both online and lab settings, we show that Gametrics can identify the users with a high accuracy (false negative rates, FNR, as low as 0.02) while rejecting zero-effort attackers (false positive rates, FPR, as low as 0.02). Moreover, Gametrics shows promising results in defending against expert attackers that try to learn and later mimic the user's pattern of solving the challenges (FPR for expert human attacker as low as 0.03). Furthermore, we argue that the proposed biometrics is hard to be replayed or spoofed by automated means, such as robots or malware attacks.
The function of query auto-completion in modern search engines is to help users formulate queries fast and precisely. Conventional context-aware methods primarily rank candidate queries according to term- and query- relationships to the context. However, most sessions are extremely short. How to capture search intents with such relationships becomes difficult when the context generally contains only few queries. In this paper, we investigate the feasibility of discovering search intents within short context for query auto-completion. The class distribution of the search session (i.e., issued queries and click behavior) is derived as search intents. Several distribution-based features are proposed to estimate the proximity between candidates and search intents. Finally, we apply learning-to-rank to predict the user's intended query according to these features. Moreover, we also design an ensemble model to combine the benefits of our proposed features and term-based conventional approaches. Extensive experiments have been conducted on the publicly available AOL search engine log. The experimental results demonstrate that our approach significantly outperforms six competitive baselines. The performance of keystrokes is also evaluated in experiments. Furthermore, an in-depth analysis is made to justify the usability of search intent classification for query auto-completion.
Important information regarding the learning experience and relative preparedness of Computer Science students can be obtained by analyzing their coding activity at a fine-grained level, using an online IDE that records student code editing, compiling, and testing activities down to the individual keystroke. We report results from analyses of student coding patterns using such an online IDE. In particular, we gather data from a group of students performing an assigned programming lab, using the online IDE indicated to gather statistics. We extract high-level statistics from the student data, and apply supervised learning techniques to identify those that are the most salient prediction of student success as measured by later performance in the class. We use these results to make predictions of course performance for another student group, and report on the reliability of those predictions
In this paper, we consider side-channel mechanisms, specifically using smart device ambient light sensors, to capture information about user computing activity. We distinguish keyboard keystrokes using only the ambient light sensor readings from a smart watch worn on the user's non-dominant hand. Additionally, we investigate the feasibility of capturing screen emanations for determining user browser usage patterns. The experimental results expose privacy and security risks, as well as the potential for new mobile user interfaces and applications.
In traditional programming courses, students have usually been at least partly graded using pen and paper exams. One of the problems related to such exams is that they only partially connect to the practice conducted within such courses. Testing students in a more practical environment has been constrained due to the limited resources that are needed, for example, for authentication. In this work, we study whether students in a programming course can be identified in an exam setting based solely on their typing patterns. We replicate an earlier study that indicated that keystroke analysis can be used for identifying programmers. Then, we examine how a controlled machine examination setting affects the identification accuracy, i.e. if students can be identified reliably in a machine exam based on typing profiles built with data from students' programming assignments from a course. Finally, we investigate the identification accuracy in an uncontrolled machine exam, where students can complete the exam at any time using any computer they want. Our results indicate that even though the identification accuracy deteriorates when identifying students in an exam, the accuracy is high enough to reliably identify students if the identification is not required to be exact, but top k closest matches are regarded as correct.
Twitter is one of the most popular microblogging social systems, which provides a set of distinctive posting services operating in real time. The flexibility of these services has attracted unethical individuals, so-called "spammers", aiming at spreading malicious, phishing, and misleading information. Unfortunately, the existence of spam results non-ignorable problems related to search and user's privacy. In the battle of fighting spam, various detection methods have been designed, which work by automating the detection process using the "features" concept combined with machine learning methods. However, the existing features are not effective enough to adapt spammers' tactics due to the ease of manipulation in the features. Also, the graph features are not suitable for Twitter based applications, though the high performance obtainable when applying such features. In this paper, beyond the simple statistical features such as number of hashtags and number of URLs, we examine the time property through advancing the design of some features used in the literature, and proposing new time based features. The new design of features is divided between robust advanced statistical features incorporating explicitly the time attribute, and behavioral features identifying any posting behavior pattern. The experimental results show that the new form of features is able to classify correctly the majority of spammers with an accuracy higher than 93% when using Random Forest learning algorithm, applied on a collected and annotated data-set. The results obtained outperform the accuracy of the state of the art features by about 6%, proving the significance of leveraging time in detecting spam accounts.
In the last few years, the high acceptability of service computing delivered over the internet has exponentially created immense security challenges for the services providers. Cyber criminals are using advanced malware such as polymorphic botnets for participating in our everyday online activities and trying to access the desired information in terms of personal details, credit card numbers and banking credentials. Polymorphic botnet attack is one of the biggest attacks in the history of cybercrime and currently, millions of computers are infected by the botnet clients over the world. Botnet attack is an intelligent and highly coordinated distributed attack which consists of a large number of bots that generates big volumes of spamming e-mails and launching distributed denial of service (DDoS) attacks on the victim machines in a heterogeneous network environment. Therefore, it is necessary to detect the malicious bots and prevent their planned attacks in the cloud environment. A number of techniques have been developed for detecting the malicious bots in a network in the past literature. This paper recognize the ineffectiveness exhibited by the singnature based detection technique and networktraffic based detection such as NetFlow or traffic flow detection and Anomaly based detection. We proposed a real time malware detection methodology based on Domain Generation Algorithm. It increasesthe throughput in terms of early detection of malicious bots and high accuracy of identifying the suspicious behavior.
Botnets play major roles in a vast number of threats to network security, such as DDoS attacks, generation of spam emails, information theft. Detecting Botnets is a difficult task in due to the complexity and performance issues when analyzing the huge amount of data from real large-scale networks. In major Botnet malware, the use of Domain Generation Algorithms allows to decrease possibility to be detected using white list - blacklist scheme and thus DGA Botnets have higher survival. This paper proposes a DGA Botnet detection scheme based on DNS traffic analysis which utilizes semantic measures such as entropy, meaning the level of the domain, frequency of n-gram appearances and Mahalanobis distance for domain classification. The proposed method is an improvement of Phoenix botnet detection mechanism, where in the classification phase, the modified Mahalanobis distance is used instead of the original for classification. The clustering phase is based on modified k-means algorithm for archiving better effectiveness. The effectiveness of the proposed method was measured and compared with Phoenix, Linguistic and SVM Light methods. The experimental results show the accuracy of proposed Botnet detection scheme ranges from 90 to 99,97% depending on Botnet type.
The hyperlink structure of World Wide Web is modeled as a directed, dynamic, and huge web graph. Web graphs are analyzed for determining page rank, fighting web spam, detecting communities, and so on, by performing tasks such as clustering, classification, and reachability. These tasks involve operations such as graph navigation, checking link existence, and identifying active links, which demand scanning of entire graphs. Frequent scanning of very large graphs involves more I/O operations and memory overheads. To rectify these issues, several data structures have been proposed to represent graphs in a compact manner. Even though the problem of representing graphs has been actively studied in the literature, there has been much less focus on representation of dynamic graphs. In this paper, we propose Tree-Dictionary-Representation (TDR), a compressed graph representation that supports dynamic nature of graphs as well as the various graph operations. Our experimental study shows that this representation works efficiently with limited main memory use and provides fast traversal of edges.
While email plays a growingly important role on the Internet, we are faced with more severe challenges brought by compromised email accounts, especially for the administrators of institutional email service providers. Inspired by the previous experience on spam filtering and compromised accounts detection, we propose several criteria, like Success Outdegree Proportion, Reverse Pagerank, Recipient Clustering Coefficient and Legitimate Recipient Proportion, for compromised email accounts detection from the perspective of graph topology in this paper. Specifically, several widely used social network analysis metrics are used and adapted according to the characteristics of mail log analysis. We evaluate our methods on a dataset constructed by mining the one month (30 days) mail log from an university with 118,617 local users and 11,460,399 mail log entries. The experimental results demonstrate that our methods achieve very positive performance, and we also prove that these methods can be efficiently applied on even larger datasets.
Tremendous amounts of data are generated daily. Accordingly, unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers as this data contains abundant information about various consumers' opinions. However, as the usefulness of text data is increasing, attempts to gain profits by distorting text data maliciously or non-maliciously are also increasing. In this sense, various types of spam detection techniques have been studied to prevent the side effects of spamming. The most representative studies include e-mail spam detection, web spam detection, and opinion spam detection. "Spam" is recognized on the basis of three characteristics and actions: (1) if a certain user is recognized as a spammer, then all content created by that user should be recognized as spam; (2) if certain content is exposed to other users (regardless of the users' intention), then content is recognized as spam; and (3) any content that contains malicious or non-malicious false information is recognized as spam. Many studies have been performed to solve type (1) and type (2) spamming by analyzing various metadata, such as user networks and spam words. In the case of type (3), however, relatively few studies have been conducted because it is difficult to determine the veracity of a certain word or information. In this study, we regard a hashtag that is irrelevant to the content of a blog post as spam and devise a methodology to detect such spam hashtags.
Text messaging is used by more people around the world than any other communications technology. As such, it presents a desirable medium for spammers. While this problem has been studied by many researchers over the years, the recent increase in legitimate bulk traffic (e.g., account verification, 2FA, etc.) has dramatically changed the mix of traffic seen in this space, reducing the effectiveness of previous spam classification efforts. This paper demonstrates the performance degradation of those detectors when used on a large-scale corpus of text messages containing both bulk and spam messages. Against our labeled dataset of text messages collected over 14 months, the precision and recall of past classifiers fall to 23.8% and 61.3% respectively. However, using our classification techniques and labeled clusters, precision and recall rise to 100% and 96.8%. We not only show that our collected dataset helps to correct many of the overtraining errors seen in previous studies, but also present insights into a number of current SMS spam campaigns.
We present algorithmic techniques for parallel PDE solvers that leverage numerical smoothness properties of physics simulation to detect and correct silent data corruption within local computations. We initially model such silent hardware errors (which are of concern for extreme scale) via injected DRAM bit flips. Our mitigation approach generalizes previously developed "robust stencils" and uses modified linear algebra operations that spatially interpolate to replace large outlier values. Prototype implementations for 1D hyperbolic and 3D elliptic solvers, tested on up to 2048 cores, show that this error mitigation enables tolerating orders of magnitude higher bit-flip rates. The runtime overhead of the approach generally decreases with greater solver scale and complexity, becoming no more than a few percent in some cases. A key advantage is that silent data corruption can be handled transparently with data in cache, reducing the cost of false-positive detections compared to rollback approaches.
In this paper we propose Mastino, a novel defense system to detect malware download events. A download event is a 3-tuple that identifies the action of downloading a file from a URL that was triggered by a client (machine). Mastino utilizes global situation awareness and continuously monitors various network- and system-level events of the clients' machines across the Internet and provides real time classification of both files and URLs to the clients upon submission of a new, unknown file or URL to the system. To enable detection of the download events, Mastino builds a large download graph that captures the subtle relationships among the entities of download events, i.e. files, URLs, and machines. We implemented a prototype version of Mastino and evaluated it in a large-scale real-world deployment. Our experimental evaluation shows that Mastino can accurately classify malware download events with an average of 95.5% true positive (TP), while incurring less than 0.5% false positives (FP). In addition, we show the Mastino can classify a new download event as either benign or malware in just a fraction of a second, and is therefore suitable as a real time defense system.
Traditional sensitive data disclosure analysis faces two challenges: to identify sensitive data that is not generated by specific API calls, and to report the potential disclosures when the disclosed data is recognized as sensitive only after the sink operations. We address these issues by developing BidText, a novel static technique to detect sensitive data disclosures. BidText formulates the problem as a type system, in which variables are typed with the text labels that they encounter (e.g., during key-value pair operations). The type system features a novel bi-directional propagation technique that propagates the variable label sets through forward and backward data-flow. A data disclosure is reported if a parameter at a sink point is typed with a sensitive text label. BidText is evaluated on 10,000 Android apps. It reports 4,406 apps that have sensitive data disclosures, with 4,263 apps having log based disclosures and 1,688 having disclosures due to other sinks such as HTTP requests. Existing techniques can only report 64.0% of what BidText reports. And manual inspection shows that the false positive rate for BidText is 10%.
Based on Storm, a distributed, reliable, fault-tolerant real-time data stream processing system, we propose a recognition system of web intrusion detection. The system is based on machine learning, feature selection algorithm by TF-IDF(Term Frequency–Inverse Document Frequency) and the optimised cosine similarity algorithm, at low false positive rate and a higher detection rate of attacks and malicious behavior in real-time to protect the security of user data. From comparative analysis of experiments we find that the system for intrusion recognition rate and false positive rate has improved to some extent, it can be better to complete the intrusion detection work.
With the growth of internet world has transformed into a global market with all monetary and business exercises being carried online. Being the most imperative resource of the developing scene, it is the vulnerable object and hence needs to be secured from the users with dangerous personality set. Since the Internet does not have focal surveillance component, assailants once in a while, utilizing varied and advancing hacking topologies discover a path to bypass framework's security and one such collection of assaults is Intrusion. An intrusion is a movement of breaking into the framework by compromising the security arrangements of the framework set up. The technique of looking at the system information for the conceivable intrusions is known intrusion detection. For the last two decades, automatic intrusion detection system has been an important exploration point. Till now researchers have developed Intrusion Detection Systems (IDS) with the capability of detecting attacks in several available environments; latest on the scene are Machine Learning approaches. Machine learning techniques are the set of evolving algorithms that learn with experience, have improved performance in the situations they have already encountered and also enjoy a broad range of applications in speech recognition, pattern detection, outlier analysis etc. There are a number of machine learning techniques developed for different applications and there is no universal technique that can work equally well on all datasets. In this work, we evaluate all the machine learning algorithms provided by Weka against the standard data set for intrusion detection i.e. KddCupp99. Different measurements contemplated are False Positive Rate, precision, ROC, True Positive Rate.
Face recognition has attained a greater importance in bio-metric authentication due to its non-intrusive property of identifying individuals at varying stand-off distance. Face recognition based on multi-spectral imaging has recently gained prime importance due to its ability to capture spatial and spectral information across the spectrum. Our first contribution in this paper is to use extended multi-spectral face recognition in two different age groups. The second contribution is to show empirically the performance of face recognition for two age groups. Thus, in this paper, we developed a multi-spectral imaging sensor to capture facial database for two different age groups (≤ 15years and ≥ 20years) at nine different spectral bands covering 530nm to 1000nm range. We then collected a new facial images corresponding to two different age groups comprises of 168 individuals. Extensive experimental evaluation is performed independently on two different age group databases using four different state-of-the-art face recognition algorithms. We evaluate the verification and identification rate across individual spectral bands and fused spectral band for two age groups. The obtained evaluation results shows higher recognition rate for age groups ≥ 20years than ≤ 15years, which indicates the variation in face recognition across the different age groups.
The face is the most dominant and distinct communication tool of human beings. Automatic analysis of facial behavior allows machines to understand and interpret a human's states and needs for natural interactions. This research focuses on developing advanced computer vision techniques to process and analyze facial images for the recognition of various facial behaviors. Specifically, this research consists of two parts: automatic facial landmark detection and tracking, and facial behavior analysis and recognition using the tracked facial landmark points. In the first part, we develop several facial landmark detection and tracking algorithms on facial images with varying conditions, such as varying facial expressions, head poses and facial occlusions. First, to handle facial expression and head pose variations, we introduce a hierarchical probabilistic face shape model and a discriminative deep face shape model to capture the spatial relationships among facial landmark points under different facial expressions and face poses to improve facial landmark detection. Second, to handle facial occlusion, we improve upon the effective cascade regression framework and propose the robust cascade regression framework for facial landmark detection, which iteratively predicts the landmark visibility probabilities and landmark locations. The second part of this research applies our facial landmark detection and tracking algorithms to facial behavior analysis, including facial action recognition and face pose estimation. For facial action recognition, we introduce a novel regression framework for joint facial landmark detection and facial action recognition. For head pose estimation, we are working on a robust algorithm that can perform head pose estimation under facial occlusion.
Recognition of facial expressions authenticity is quite troublesome for humans. Therefore, it is an interesting topic for the computer vision community, as the developed algorithms for facial expressions authenticity estimation may be used as indicators of deception. This paper discusses the state-of-the art methods developed for smile veracity estimation and proposes a plan of development and validation of a novel approach to automated discrimination between genuine and posed facial expressions. The proposed fully automated technique is based on the extension of the high-dimensional Local Binary Patterns (LBP) to the spatio-temporal domain and combines them with the dynamics of facial landmarks movements. The proposed technique will be validated on several existing smile databases and a novel database created with the use of a high speed camera. Finally, the developed framework will be applied for the detection of deception in real life scenarios.
Automatic face recognition techniques applied on particular group or mass database introduces error cases. Error prevention is crucial for the court. Reranking of recognition results based on anthropology analysis can significant improve the accuracy of automatic methods. Previous studies focused on manual facial comparison. This paper proposed a weighted facial similarity computing method based on morphological analysis of components characteristics. Search sequence of face recognition reranked according to similarity, while the interference terms can be removed. Within this research project, standardized photographs, surveillance videos, 3D face images, identity card photographs of 241 male subjects from China were acquired. Sequencing results were modified by modeling selected individual features from the DMV altas. The improved method raises the accuracy of face recognition through anthroposophic or morphologic theory.
Face is crucial for human identity, while face identification has become crucial to information security. It is important to understand and work with the problems and challenges for all different aspects of facial feature extraction and face identification. In this tutorial, we identify and discuss four research challenges in current Face Detection/Recognition research and related research areas: (1) Unavoidable Facial Feature Alterations, (2) Voluntary Facial Feature Alterations, (3) Uncontrolled Environments, and (4) Accuracy Control on Large-scale Dataset. We also direct several different applications (spin-offs) of facial feature studies in the tutorial.
Augmented reality is poised to become a dominant computing paradigm over the next decade. With promises of three-dimensional graphics and interactive interfaces, augmented reality experiences will rival the very best science fiction novels. This breakthrough also brings in unique challenges on how users can authenticate one another to share rich content between augmented reality headsets. Traditional authentication protocols fall short when there is no common central entity or when access to the central authentication server is not available or desirable. Looks Good To Me (LGTM) is an authentication protocol that leverages the unique hardware and context provided with augmented reality headsets to bring innate human trust mechanisms into the digital world to solve authentication in a usable and secure way. LGTM works over point to point wireless communication so users can authenticate one another in a variety of circumstances and is designed with usability at its core, requiring users to perform only two actions: one to initiate and one to confirm. Users intuitively authenticate one another, using seemingly only each other's faces, but under the hood LGTM uses a combination of facial recognition and wireless localization to bootstrap trust from a wireless signal, to a location, to a face, for secure and usable authentication.
Heterogeneous face recognition aims to identify or verify person identity by matching facial images of different modalities. In practice, it is known that its performance is highly influenced by modality inconsistency, appearance occlusions, illumination variations and expressions. In this paper, a new method named as ensemble of sparse cross-modal metrics is proposed for tackling these challenging issues. In particular, a weak sparse cross-modal metric learning method is firstly developed to measure distances between samples of two modalities. It learns to adjust rank-one cross-modal metrics to satisfy two sets of triplet based cross-modal distance constraints in a compact form. Meanwhile, a group based feature selection is performed to enforce that features in the same position of two modalities are selected simultaneously. By neglecting features that attribute to "noise" in the face regions (eye glasses, expressions and so on), the performance of learned weak metrics can be markedly improved. Finally, an ensemble framework is incorporated to combine the results of differently learned sparse metrics into a strong one. Extensive experiments on various face datasets demonstrate the benefit of such feature selection especially when heavy occlusions exist. The proposed ensemble metric learning has been shown superiority over several state-of-the-art methods in heterogeneous face recognition.