Biblio
Software effort estimation (SEE) is a crucial step in software development. Effort data missing usually occurs in real-world data collection. Focusing on the missing data problem, existing SEE methods employ the deletion, ignoring, or imputation strategy to address the problem, where the imputation strategy was found to be more helpful for improving the estimation performance. Current imputation methods in SEE use classical imputation techniques for missing data imputation, yet these imputation techniques have their respective disadvantages and might not be appropriate for effort data. In this paper, we aim to provide an effective solution for the effort data missing problem. Incompletion includes the drive factor missing case and effort label missing case. We introduce the low-rank recovery technique for addressing the drive factor missing case. And we employ the semi-supervised regression technique to perform imputation in the case of effort label missing. We then propose a novel effort data imputation approach, named low-rank recovery and semi-supervised regression imputation (LRSRI). Experiments on 7 widely used software effort datasets indicate that: (1) the proposed approach can obtain better effort data imputation effects than other methods; (2) the imputed data using our approach can apply to multiple estimators well.
The prodigious amount of user-generated content continues to grow at an enormous rate. While it greatly facilitates the flow of information and ideas among people and communities, it may pose great threat to our individual privacy. In this paper, we demonstrate that the private traits of individuals can be inferred from user-generated content by using text classification techniques. Specifically, we study three private attributes on Twitter users: religion, political leaning, and marital status. The ground truth labels of the private traits can be readily collected from the Twitter bio field. Based on the tweets posted by the users and their corresponding bios, we show that text classification yields a high accuracy of identification of these personal attributes, which poses a great privacy risk on user-generated content. We further propose a constrained utility maximization framework for preserving user privacy. The goal is to maximize the utility of data when modifying the user-generated content, while degrading the prediction performance of the adversary. The KL divergence is minimized between the prior knowledge about the private attribute and the posterior probability after seeing the user-generated data. Based on this proposed framework, we investigate several specific data sanitization operations for privacy preservation: add, delete, or replace words in the tweets. We derive the exact transformation of the data under each operation. The experiments demonstrate the effectiveness of the proposed framework.
We present history-independent alternatives to a B-tree, the primary indexing data structure used in databases. A data structure is history independent (HI) if it is impossible to deduce any information by examining the bit representation of the data structure that is not already available through the API. We show how to build a history-independent cache-oblivious B-tree and a history-independent external-memory skip list. One of the main contributions is a data structure we build on the way–-a history-independent packed-memory array (PMA). The PMA supports efficient range queries, one of the most important operations for answering database queries. Our HI PMA matches the asymptotic bounds of prior non-HI packed-memory arrays and sparse tables. Specifically, a PMA maintains a dynamic set of elements in sorted order in a linear-sized array. Inserts and deletes take an amortized O(log2 N) element moves with high probability. Simple experiments with our implementation of HI PMAs corroborate our theoretical analysis. Comparisons to regular PMAs give preliminary indications that the practical cost of adding history-independence is not too large. Our HI cache-oblivious B-tree bounds match those of prior non-HI cache-oblivious B-trees. Searches take O(logB N) I/Os; inserts and deletes take O((log2 N)/B+ logB N) amortized I/Os with high probability; and range queries returning k elements take O(logB N + k/B) I/Os. Our HI external-memory skip list achieves optimal bounds with high probability, analogous to in-memory skip lists: O(logB N) I/Os for point queries and amortized O(logB N) I/Os for inserts/deletes. Range queries returning k elements run in O(logB N + k/B) I/Os. In contrast, the best possible high-probability bounds for inserting into the folklore B-skip list, which promotes elements with probability 1/B, is just Theta(log N) I/Os. This is no better than the bounds one gets from running an in-memory skip list in external memory.
We investigate minimization of tree pattern queries that use the child relation, descendant relation, node labels, and wildcards. We prove that minimization for such tree patterns is Sigma2P-complete and thus solve a problem first attacked by Flesca, Furfaro, and Masciari in 2003. We first provide an example that shows that tree patterns cannot be minimized by deleting nodes. This example shows that the M-NR conjecture, which states that minimality of tree patterns is equivalent to their nonredundancy, is false. We then show how the example can be turned into a gadget that allows us to prove Sigma2P-completeness.
In this paper we provide a framework to analyze the effect of uniform sampling on graph optimization problems. Interestingly, we apply this framework to a general class of graph optimization problems that we call heavy subgraph problems, and show that uniform sampling preserves a 1-ε approximate solution to these problems. This class contains many interesting problems such as densest subgraph, directed densest subgraph, densest bipartite subgraph, d-max cut, and d-sum-max clustering. As an immediate impact of this result, one can use uniform sampling to solve these problems in streaming, turnstile or Map-Reduce settings. Indeed, our results by characterizing heavy subgraph problems address Open Problem 13 at the IITK Workshop on Algorithms for Data Streams in 2006 regarding the effects of subsampling, in the context of graph streams. Recently Bhattacharya et al. in STOC 2015 provide the first one pass algorithm for the densest subgraph problem in the streaming model with additions and deletions to its edges, i.e., for dynamic graph streams. They present a (0.5-ε)-approximation algorithm using \textasciitildeO(n) space, where factors of ε and log(n) are suppressed in the \textasciitildeO notation. In this paper we improve the (0.5-ε)-approximation algorithm of Bhattacharya et al. by providing a (1-ε)-approximation algorithm using \textasciitildeO(n) space.
Unlike most social media, where automatic archiving of data is the default, Snapchat defaults to ephemerality: deleting content shortly after it is viewed by a receiver. Interviews with 25 Snapchat users show that ephemerality plays a key role in shaping their practices. Along with friend-adding features that facilitate a network of mostly close relations, default deletion affords everyday, mundane talk and reduces self-consciousness while encouraging playful interaction. Further, although receivers can save content through screenshots, senders are notified; this selective saving with notification supports complex information norms that preserve the feel of ephemeral communication while supporting the capture of meaningful content. This dance of giving and taking, sharing and showing, and agency for both senders and receivers provides the basis for a rich design space of mechanisms, levels, and domains for ephemerality.
Conventional overwriting-based and encryption-based secure deletion schemes can only sanitize data. However, the past existence of the deleted data may leave artifacts in the layout at all layers of a computing system. These structural artifacts may be utilized by the adversary to infer sensitive information about the deleted data or even to fully recover them. The conventional secure deletion solutions unfortunately cannot sanitize them. In this work, we introduce truly secure deletion, a novel security notion that is much stronger than the conventional secure deletion. Truly secure deletion requires sanitizing both the obsolete data as well as the corresponding structural artifacts, so that the resulting storage layout after a delete operation is indistinguishable from that the deleted data never appeared. We propose TedFlash, a Truly secure deletion scheme for Flash-based block devices. TedFlash can successfully sanitize both the data and the structural artifacts, while satisfying the design constraints imposed for flash memory. Security analysis and experimental evaluation show that TedFlash can achieve the truly secure deletion guarantee with a small additional overhead compared to conventional secure deletion solutions.
In the cloud storage, users lose direct control over their data. How to surely delete data in the cloud becomes a crucial problem for a secure cloud storage system. The existing way to this problem is to encrypt the data before outsourcing and destroy the encryption key when deleting. However, this solution may cause heavy computation overhead for the user-side and the encrypted data remains intact in the cloud after the deletion operation. To solve this challenge problem, we propose a novel method to surely delete data in the cloud storage by overwriting. Different from existing works, our scheme is efficient in the user-side and is able to wipe out the deleted data from the drives of the cloud servers.
There are currently no requirements (technical or otherwise) that routing paths must be contained within national boundaries. Indeed, some paths experience international detours, i.e., originate in one country, cross international boundaries and return to the same country. In most cases these are sensible traffic engineering or peering decisions at ISPs that serve multiple countries. In some cases such detours may be suspicious. Characterizing international detours is useful to a number of players: (a) network engineers trying to diagnose persistent problems, (b) policy makers aiming at adhering to certain national communication policies, (c) entrepreneurs looking for opportunities to deploy new networks, or (d) privacy-conscious states trying to minimize the amount of internal communication traversing different jurisdictions. In this paper we characterize international detours in the Internet during the month of January 2016. To detect detours we sample BGP RIBs every 8 hours from 461 RouteViews and RIPE RIS peers spanning 30 countries. We use geolocation of ASes which geolocates each BGP prefix announced by each AS, mapping its presence at IXPs and geolocation infrastructure IPs. Finally, we analyze each global BGP RIB entry looking for detours. Our analysis shows more than 5K unique BGP prefixes experienced a detour. 132 prefixes experienced more than 50% of the detours. We observe about 544K detours. Detours either last for a few days or persist the entire month. Out of all the detours, more than 90% were transient detours that lasted for 72 hours or less. We also show different countries experience different characteristics of detours.
One of the most challenging issues facing academic conferences and educational institutions today is plagiarism detection. Typically, these entities wish to ensure that the work products submitted to them have not been plagiarized from another source (e.g., authors submitting identical papers to multiple journals). Assembling large centralized databases of documents dramatically improves the effectiveness of plagiarism detection techniques, but introduces a number of privacy and legal issues: all document contents must be completely revealed to the database operator, making it an attractive target for abuse or attack. Moreover, this content aggregation involves the disclosure of potentially sensitive private content, and in some cases this disclosure may be prohibited by law. In this work, we introduce Elxa, the first scalable centralized plagiarism detection system that protects the privacy of the submissions. Elxa incorporates techniques from the current state of the art in plagiarism detection, as evaluated by the information retrieval community. Our system is designed to be operated on existing cloud computing infrastructure, and to provide incentives for the untrusted database operator to maintain the availability of the network. Elxa can be used to detect plagiarism in student work, duplicate paper submissions (and their associated peer reviews), similarities between confidential reports (e.g., malware summaries), or any approximate text reuse within a network of private documents. We implement a prototype using the Hadoop MapReduce framework, and demonstrate that it is feasible to achieve competitive detection effectiveness in the private setting.
The popularity of digital currencies, especially cryptocurrencies, has been continuously growing since the appearance of Bitcoin. Bitcoin is a peer-to-peer (P2P) cryptocurrency protocol enabling transactions between individuals without the need of a trusted authority. Its network is formed from resources contributed by individuals known as miners. Users of Bitcoin currency create transactions that are stored in a specialised data structure called a block chain. Bitcoin's security lies in a proof-of-work scheme, which requires high computational resources at the miners. These miners have to be synchronised with any update in the network, which produces high data traffic rates. Despite advances in mobile technology, no cryptocurrencies have been proposed for mobile devices. This is largely due to the lower processing capabilities of mobile devices when compared with conventional computers and the poorer Internet connectivity to that of the wired networking. In this work, we propose LocalCoin, an alternative cryptocurrency that requires minimal computational resources, produces low data traffic and works with off-the-shelf mobile devices. LocalCoin replaces the computational hardness that is at the root of Bitcoin's security with the social hardness of ensuring that all witnesses to a transaction are colluders. It is based on opportunistic networking rather than relying on infrastructure and incorporates characteristics of mobile networks such as users' locations and their coverage radius in order to employ an alternative proof-of-work scheme. Localcoin features (i) a lightweight proof-of-work scheme and (ii) a distributed block chain.
Friends, family and colleagues at work may repeatedly observe how their peers unlock their smartphones. These "insiders" may combine multiple partial observations to form a hypothesis of a target's secret. This changing landscape requires that we update the methods used to assess the security of unlocking mechanisms against human shoulder surfing attacks. In our paper, we introduce a methodology to study shoulder surfing risks in the insider threat model. Our methodology dissects the authentication process into minimal observations by humans. Further processing is based on simulations. The outcome is an estimate of the number of observations needed to break a mechanism. The flexibility of this approach benefits the design of new mechanisms. We demonstrate the application of our methodology by performing an analysis of the SwiPIN scheme published at CHI 2015. Our results indicate that SwiPIN can be defeated reliably by a majority of the population with as few as 6 to 11 observations.
Path prediction on the Internet has been a topic of research in the networking community for close to a decade. Applications of path prediction solutions have ranged from optimizing selection of peers in peer- to-peer networks to improving and debugging CDN predictions. Recently, revelations of traffic correlation and surveillance on the Internet have raised the topic of path prediction in the context of network security. Specifically, predicting network paths can allow us to identify and avoid given organizations on network paths (e.g., to avoid traffic correlation attacks in Tor) or to infer the impact of hijacks and interceptions when direct measurements are not available. In this poster we propose the design and implementation of PathCache which aims to reuse measurement data to estimate AS level paths on the Internet. Unlike similar systems, PathCache does not assume that routing on the Internet is destination based. Instead, we develop an algorithm to compute confidence in paths between ASes. These multiple paths ranked by their confidence values are returned to the user.
Resource discovery in unstructured peer-to-peer networks causes a search query to be flooded throughout the network via random nodes, leading to security and privacy issues. The owner of the search query does not have control over the transmission of its query through the network. Although algorithms have been proposed for policy-compliant query or data routing in a network, these algorithms mainly deal with authentic route computation and do not provide mechanisms to actually verify the network paths taken by the query. In this work, we propose an approach to deal with the problem of verifying network paths taken by a search query during resource discovery, and detection of malicious forwarding of search query. Our approach aims at being secure and yet very scalable, even in the presence of huge number of nodes in the network.
WebRTC is one of the latest additions to the ever growing repository of Web browser technologies, which push the envelope of native Web application capabilities. WebRTC allows real-time peer-to-peer audio and video chat, that runs purely in the browser. Unlike existing video chat solutions, such as Skype, that operate in a closed identity ecosystem, WebRTC was designed to be highly flexible, especially in the domains of signaling and identity federation. This flexibility, however, opens avenues for identity fraud. In this paper, we explore the technical underpinnings of WebRTC's identity management architecture. Based on this analysis, we identify three novel attacks against endpoint authenticity. To answer the identified threats, we propose and discuss defensive strategies, including security improvements for the WebRTC specifications and mitigation techniques for the identity and service providers.
In this study, we report on techniques and analyses that enable us to capture Internet-wide activity at individual IP address-level granularity by relying on server logs of a large commercial content delivery network (CDN) that serves close to 3 trillion HTTP requests on a daily basis. Across the whole of 2015, these logs recorded client activity involving 1.2 billion unique IPv4 addresses, the highest ever measured, in agreement with recent estimates. Monthly client IPv4 address counts showed constant growth for years prior, but since 2014, the IPv4 count has stagnated while IPv6 counts have grown. Thus, it seems we have entered an era marked by increased complexity, one in which the sole enumeration of active IPv4 addresses is of little use to characterize recent growth of the Internet as a whole. With this observation in mind, we consider new points of view in the study of global IPv4 address activity. Our analysis shows significant churn in active IPv4 addresses: the set of active IPv4 addresses varies by as much as 25% over the course of a year. Second, by looking across the active addresses in a prefix, we are able to identify and attribute activity patterns to networkm restructurings, user behaviors, and, in particular, various address assignment practices. Third, by combining spatio-temporal measures of address utilization with measures of traffic volume, and sampling-based estimates of relative host counts, we present novel perspectives on worldwide IPv4 address activity, including empirical observation of under-utilization in some areas, and complete utilization, or exhaustion, in others.
Proxy Mobile IPv6 (PMIPv6) is an IP mobility protocol. In a PMIPv6 domain, local mobility anchor is involved in control as well as data communication. To ease the load on a mobility anchor and avoid single point of failure, the PMIPv6 standard provides the opportunity of having multiple mobility anchors. In this paper, we propose a Software Defined Networking (SDN) based solution to provide load balancing among mobility anchors, in a SDN based PMIPv6 domain. In the proposed solution, a mobility controller performs acts as a central control entity, and performs load monitoring on the mobility anchors. On detecting the load crossing over a threshold for a certain mobility anchor, the controller moves some traffic from highly loaded mobility anchor to relatively less loaded mobility anchor. Analytical model and primitive performance evaluation of the proposed solution is presented in this paper, which demonstrates 5% and 40% improvement in uplink and downlink traffic disruption periods, respectively
We discuss a key engineering challenge in implementing the Identifier- Locator Network Protocol (ILNP), as described in IRTF Experimental RFCs 6740–6748: enabling legacy applications that use the C sockets API. We have built the first two OS kernel implementations of ILNPv6 (ILNP as a superset of IPv6), in both the Linux OS kernel and the FreeBSD OS kernel. Our evaluation is in comparison with IPv6, in the context of a topical and challenging scenario: host mobility implemented as a purely end-to-end function. Our experiments show that ILNPv6 has excellent potential for deployment using existing IPv6 infrastructure, whilst offering the new properties and functionality of ILNP.
The IETF has developed protocols that promote a healthy IPv4 and IPv6 co-existence. The Happy Eyeballs (HE) algorithm, for instance, prevents bad user experience in situations where IPv6 connectivity is broken. Using an active test (happy) that measures TCP connection establishment times, we evaluate the effects of the HE algorithm. The happy test measures against ALEXA top 10K websites from 80 SamKnows probes connected to dual-stacked networks representing 58 different ASes. Using a 3-years long (2013 - 2016) dataset, we show that TCP connect times to popular websites over IPv6 have considerably improved over time. As of May 2016, 18% of these websites are faster over IPv6 with 91% of the rest at most 1 ms slower. The historical trend shows that only around 1% of the TCP connect times over IPv6 were ever above the HE timer value (300 ms), which leaves around 2% chance for IPv4 to win a HE race towards these websites. As such, 99% of these websites prefer IPv6 connections more than 98% of the time. We show that although absolute TCP connect times (in ms) are not that far apart in both address families, HE with a 300 ms timer value tends to prefer slower IPv6 connections in around 90% of the cases. We show that lowering the HE timer value to 150 ms gives us a margin benefit of 10% while retaining same preference levels over IPv6.
Millions of users worldwide resort to mobile VPN clients to either circumvent censorship or to access geo-blocked content, and more generally for privacy and security purposes. In practice, however, users have little if any guarantees about the corresponding security and privacy settings, and perhaps no practical knowledge about the entities accessing their mobile traffic. In this paper we provide a first comprehensive analysis of 283 Android apps that use the Android VPN permission, which we extracted from a corpus of more than 1.4 million apps on the Google Play store. We perform a number of passive and active measurements designed to investigate a wide range of security and privacy features and to study the behavior of each VPN-based app. Our analysis includes investigation of possible malware presence, third-party library embedding, and traffic manipulation, as well as gauging user perception of the security and privacy of such apps. Our experiments reveal several instances of VPN apps that expose users to serious privacy and security vulnerabilities, such as use of insecure VPN tunneling protocols, as well as IPv6 and DNS traffic leakage. We also report on a number of apps actively performing TLS interception. Of particular concern are instances of apps that inject JavaScript programs for tracking, advertising, and for redirecting e-commerce traffic to external partners.
In today's enterprise networks, there are many ways for a determined attacker to obtain a foothold, bypass current protection technologies, and attack the intended target. Over several years we have developed the Self-shielding Dynamic Network Architecture (SDNA) technology, which prevents an attacker from targeting, entering, or spreading through an enterprise network by adding dynamics that present a changing view of the network over space and time. SDNA was developed with the support of government sponsored research and development and corporate internal resources. The SDNA technology was purchased by Cryptonite, LLC in 2015 and has been developed into a robust product offering called Cryptonite NXT. In this paper, we describe the journey and lessons learned along the course of feasibility demonstration, technology development, security testing, productization, and deployment in a production network.
This paper outlines a set of 10 cyber security concerns associated with Industrial Control Systems (ICS). The concerns address software and hardware development, implementation, and maintenance practices, supply chain assurance, the need for cyber forensics in ICS, a lack of awareness and training, and finally, a need for test beds which can be used to address the first 9 cited concerns. The concerns documented in this paper were developed based on the authors' combined experience conducting research in this field for the US Department of Homeland Security, the National Science Foundation, and the Department of Defense. The second half of this paper documents a virtual test bed platform which is offered as a tool to address the concerns listed in the first half of the paper. The paper discusses various types of test beds proposed in literature for ICS research, provides an overview of the virtual test bed platform developed by the authors, and lists future works required to extend the existing test beds to serve as a development platform.
A major component of modern vehicles is the infotainment system, which interfaces with its drivers and passengers. Other mobile devices, such as handheld phones and laptops, can relay information to the embedded infotainment system through Bluetooth and vehicle WiFi. The ability to extract information from these systems would help forensic analysts determine the general contents that is stored in an infotainment system. Based off the data that is extracted, this would help determine what stored information is relevant to law enforcement agencies and what information is non-essential when it comes to solving criminal activities relating to the vehicle itself. This would overall solidify the Intelligent Transport System and Vehicular Ad Hoc Network infrastructure in combating crime through the use of vehicle forensics. Additionally, determining the content of these systems will allow forensic analysts to know if they can determine anything about the end-user directly and/or indirectly.
Modern OS kernels including Windows, Linux, and Mac OS all have adopted kernel Address Space Layout Randomization (ASLR), which shifts the base address of kernel code and data into different locations in different runs. Consequently, when performing introspection or forensic analysis of kernel memory, we cannot use any pre-determined addresses to interpret the kernel events. Instead, we must derandomize the address space layout and use the new addresses. However, few efforts have been made to derandomize the kernel address space and yet there are many questions left such as which approach is more efficient and robust. Therefore, we present the first systematic study of how to derandomize a kernel when given a memory snapshot of a running kernel instance. Unlike the derandomization approaches used in traditional memory exploits in which only remote access is available, with introspection and forensics applications, we can use all the information available in kernel memory to generate signatures and derandomize the ASLR. In other words, there exists a large volume of solutions for this problem. As such, in this paper we examine a number of typical approaches to generate strong signatures from both kernel code and data based on the insight of how kernel code and data is updated, and compare them from efficiency (in terms of simplicity, speed etc.) and robustness (e.g., whether the approach is hard to be evaded or forged) perspective. In particular, we have designed four approaches including brute-force code scanning, patched code signature generation, unpatched code signature generation, and read-only pointer based approach, according to the intrinsic behavior of kernel code and data with respect to kernel ASLR. We have gained encouraging results for each of these approaches and the corresponding experimental results are reported in this paper.
Social Networking is fundamentally shifting the way we communicate, sharing idea and form opinions. All people try to use social media for there need, people from every age group are involved in social media site or e-commerce site. Nowadays almost every illegal activity is happened using the social network and instant messages. It means that present system is not capable to found all suspicious words. In this paper, we provided a brief description of problem and review on the different framework developed so far. Propose a better system which can be indentify criminal activity through social networking more efficiently. Use Ontology Based Information Extraction (OBIE) technique to identify domain of word and Association Rule mining to generate rules. Heuristic method checks in user database for malicious users according to predefine elements and Naïve Bayes method is use to identify the context behind the message or post. The experimental result is used for further action on victim by cyber crime department.