Visible to the public Biblio

Found 7504 results

Filters: Keyword is Metrics  [Clear All Filters]
2017-08-18
Armitage, William D., Gauvin, William, Sheffield, Adam.  2016.  Design and Launch of an Intensive Cybersecurity Program for Military Veterans. Proceedings of the 17th Annual Conference on Information Technology Education. :40–45.

The demand for trained cybersecurity operators is growing more quickly than traditional programs in higher education can fill. At the same time, unemployment for returning military veterans has become a nationally discussed problem. We describe the design and launch of New Skills for a New Fight (NSNF), an intensive, one-year program to train military veterans for the cybersecurity field. This non-traditional program, which leverages experience that veterans gained in military service, includes recruitment and selection, a base of knowledge in the form of four university courses in a simultaneous cohort mode, a period of hands-on cybersecurity training, industry certifications and a practical internship in a Security Operations Center (SOC). Twenty veterans entered this pilot program in January of 2016, and will complete in less than a year's time. Initially funded by a global financial services company, the program provides veterans with an expense-free preparation for an entry-level cybersecurity job.

Blair, Jean, Sobiesk, Edward, Ekstrom, Joseph J., Parrish, Allen.  2016.  What is Information Technology's Role in Cybersecurity? Proceedings of the 17th Annual Conference on Information Technology Education. :46–47.

This panel will discuss and debate what role(s) the information technology discipline should have in cybersecurity. Diverse viewpoints will be considered including current and potential ACM curricular recommendations, current and potential ABET and NSA accreditation criteria, the emerging cybersecurity discipline(s), consideration of government frameworks, the need for a multi-disciplinary approach to cybersecurity, and what aspects of cybersecurity should be under information technology's purview.

Burley, Diana, Bishop, Matt, Hawthorne, Elizabeth, Kaza, Siddharth, Buck, Scott, Futcher, Lynn.  2016.  Special Session: ACM Joint Task Force on Cyber Education. Proceedings of the 47th ACM Technical Symposium on Computing Science Education. :234–235.

In this special session, members of the ACM Joint Task Force on Cyber Education to Develop Undergraduate Curricular Guidance will provide an overview of the task force mission, objectives, and work plan. After the overview, task force members will engage session participants in the curricular development process.

Lakhdhar, Yosra, Rekhis, Slim, Boudriga, Noureddine.  2016.  An Approach To A Graph-Based Active Cyber Defense Model. Proceedings of the 14th International Conference on Advances in Mobile Computing and Multi Media. :261–268.

Securing cyber system is a major concern as security attacks become more and more sophisticated. We develop in this paper a novel graph-based Active Cyber Defense (ACD) model to proactively respond to cyber attacks. The proposed model is based on the use of a semantically rich graph to describe cyber systems, types of used interconnection between them, and security related data useful to develop active defense strategies. The developed model takes into consideration the probabilistic nature of cyber attacks, and their degree of complexity. In this context, analytics are provided to proactively test the impact of vulnerabilities/threats increase on the system, analyze the consequent behavior of cyber systems and security solution, and decide about the security state of the whole cyber system. Our model integrates in the same framework decisions made by cyber defenders based on their expertise and knowledge, and decisions that are automatically generated using security analytic rules.

Ji, Shouling, Li, Weiqing, Srivatsa, Mudhakar, He, Jing Selena, Beyah, Raheem.  2016.  General Graph Data De-Anonymization: From Mobility Traces to Social Networks. ACM Trans. Inf. Syst. Secur.. 18:12:1–12:29.

When people utilize social applications and services, their privacy suffers a potential serious threat. In this article, we present a novel, robust, and effective de-anonymization attack to mobility trace data and social data. First, we design a Unified Similarity (US) measurement, which takes account of local and global structural characteristics of data, information obtained from auxiliary data, and knowledge inherited from ongoing de-anonymization results. By analyzing the measurement on real datasets, we find that some data can potentially be de-anonymized accurately and the other can be de-anonymized in a coarse granularity. Utilizing this property, we present a US-based De-Anonymization (DA) framework, which iteratively de-anonymizes data with accuracy guarantee. Then, to de-anonymize large-scale data without knowledge of the overlap size between the anonymized data and the auxiliary data, we generalize DA to an Adaptive De-Anonymization (ADA) framework. By smartly working on two core matching subgraphs, ADA achieves high de-anonymization accuracy and reduces computational overhead. Finally, we examine the presented de-anonymization attack on three well-known mobility traces: St Andrews, Infocom06, and Smallblue, and three social datasets: ArnetMiner, Google+, and Facebook. The experimental results demonstrate that the presented de-anonymization framework is very effective and robust to noise. The source code and employed datasets are now publicly available at SecGraph [2015].

Pei, Kexin, Gu, Zhongshu, Saltaformaggio, Brendan, Ma, Shiqing, Wang, Fei, Zhang, Zhiwei, Si, Luo, Zhang, Xiangyu, Xu, Dongyan.  2016.  HERCULE: Attack Story Reconstruction via Community Discovery on Correlated Log Graph. Proceedings of the 32Nd Annual Conference on Computer Security Applications. :583–595.

Advanced cyber attacks consist of multiple stages aimed at being stealthy and elusive. Such attack patterns leave their footprints spatio-temporally dispersed across many different logs in victim machines. However, existing log-mining intrusion analysis systems typically target only a single type of log to discover evidence of an attack and therefore fail to exploit fundamental inter-log connections. The output of such single-log analysis can hardly reveal the complete attack story for complex, multi-stage attacks. Additionally, some existing approaches require heavyweight system instrumentation, which makes them impractical to deploy in real production environments. To address these problems, we present HERCULE, an automated multi-stage log-based intrusion analysis system. Inspired by graph analytics research in social network analysis, we model multi-stage intrusion analysis as a community discovery problem. HERCULE builds multi-dimensional weighted graphs by correlating log entries across multiple lightweight logs that are readily available on commodity systems. From these, HERCULE discovers any "attack communities" embedded within the graphs. Our evaluation with 15 well known APT attack families demonstrates that HERCULE can reconstruct attack behaviors from a spectrum of cyber attacks that involve multiple stages with high accuracy and low false positive rates.

Cook, Kyle, Shaw, Thomas, Hawrylak, Peter, Hale, John.  2016.  Scalable Attack Graph Generation. Proceedings of the 11th Annual Cyber and Information Security Research Conference. :21:1–21:4.

Attack graphs are a powerful modeling technique with which to explore the attack surface of a system. However, they can be difficult to generate due to the exponential growth of the state space, often times making exhaustive search impractical. This paper discusses an approach for generating large attack graphs with an emphasis on scalable generation over a distributed system. First, a serial algorithm is presented, highlighting bottlenecks and opportunities to exploit inherent concurrency in the generation process. Then a strategy to parallelize this process is presented. Finally, we discuss plans for future work to implement the parallel algorithm using a hybrid distributed/shared memory programming model on a heterogeneous compute node cluster.

2017-08-02
Chabanne, Hervé, Keuffer, Julien, Lescuyer, Roch.  2016.  Study of a Verifiable Biometric Matching. Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security. :183–184.

In this paper, we apply verifiable computing techniques to a biometric matching. The purpose of verifiable computation is to give the result of a computation along with a proof that the calculations were correctly performed. We adapt a protocol called sumcheck protocol and present a system that performs verifiable biometric matching in the case of a fast border control. This is a work in progress and we focus on verifying an inner product. We then give some experimental results of its implementation. Verifiable computation here helps to enforce the authentication phase bringing in the process a proof that the biometric verification has been correctly performed.

Sapkal, Shubhangi, Deshmukh, R. R..  2016.  Biometric Template Protection with Fuzzy Vault and Fuzzy Commitment. Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies. :60:1–60:6.

Conventional security methods like password and ID card methods are now rapidly replacing by biometrics for identification of a person. Biometrics uses physiological or behavioral characteristics of a person. Usage of biometric raises critical privacy and security concerns that, due to the noisy nature of biometrics, cannot be addressed using standard cryptographic methods. The loss of an enrollment biometric to an attacker is a security hazard because it may allow the attacker to get an unauthorized access to the system. Biometric template can be stolen and intruder can get access of biometric system using fake input. Hence, it becomes essential to design biometric system with secure template or if the biometric template in an application is compromised, the biometric signal itself is not lost forever and a new biometric template can be issued. One way is to combine the biometrics and cryptography or use transformed data instead of original biometric template. But traditional cryptography methods are not useful in biometrics because of intra-class variation. Biometric cryptosystem can apply fuzzy vault, fuzzy commitment, helper data and secure sketch, whereas, cancelable biometrics uses distorting transforms, Bio-Hashing, and Bio-Encoding techniques. In this paper, biometric cryptosystem is presented with fuzzy vault and fuzzy commitment techniques for fingerprint recognition system.

Dürmuth, Markus, Oswald, David, Pastewka, Niklas.  2016.  Side-Channel Attacks on Fingerprint Matching Algorithms. Proceedings of the 6th International Workshop on Trustworthy Embedded Devices. :3–13.

Biometric authentication schemes are frequently used to establish the identity of a user. Often, a trusted hardware device is used to decide if a provided biometric feature is sufficiently close to the features stored by the legitimate user during enrollment. In this paper, we address the question whether the stored features can be extracted with side-channel attacks. We consider several models for types of leakage that are relevant specifically for fingerprint verification, and show results for attacks against the Bozorth3 and a custom matching algorithm. This work shows an interesting path for future research on the susceptibility of biometric algorithms towards side-channel attacks.

Xue, Wanli, Luo, Chengwen, Rana, Rajib, Hu, Wen, Seneviratne, Aruna.  2016.  CScrypt: A Compressive-Sensing-Based Encryption Engine for the Internet of Things: Demo Abstract. Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems CD-ROM. :286–287.

Internet of Things (IoT) have been connecting the physical world seamlessly and provides tremendous opportunities to a wide range of applications. However, potential risks exist when IoT system collects local sensor data and uploads to the Cloud. The private data leakage can be severe with curious database administrator or malicious hackers who compromise the Cloud. In this demo, we solve this problem of guaranteeing the user data privacy and security using compressive sensing based cryptographic method. We present CScrypt, a compressive-sensing-based encryption engine for the Cloud-enabled IoT systems to secure the interaction between the IoT devices and the Cloud. Our system exploits the fact that each individual's biometric data can be trained to a unique dictionary which can be used as an encryption key meanwhile to compress the original data. We will demonstrate a functioning prototype of our system using live data stream when attending the conference.

Khalaf, Emad Taha, Mohammed, Muamer N., Sulaiman, Norrozila.  2016.  Iris Template Protection Based on Enhanced Hill Cipher. Proceedings of the 2016 International Conference on Communication and Information Systems. :53–57.

Biometric is uses to identify authorized person based on specific physiological or behavioral features. Template protection is a crucial requirement when designing an authentication system, where the template could be modified by attacker. Hill Cipher is a block cipher and symmetric key algorithm it has several advantages such as simplicity, high speed and high throughput can be used to protect Biometric Template. Unfortunately, Hill Cipher has some disadvantages such as takes smaller sizes of blocks, very simple and vulnerable for exhaustive key search attack and known plain text attack, also the key matrix which entered should be invertible. This paper proposed an enhancement to overcome these drawbacks of Hill Cipher by using a large and random key with large data block, beside overcome the Invertible-key Matrix problem. The efficiency of encryption has been checked out by Normalized Correlation Coefficient (NCC) and running time.

Gong, Neil Zhenqiang, Payer, Mathias, Moazzezi, Reza, Frank, Mario.  2016.  Forgery-Resistant Touch-based Authentication on Mobile Devices. Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security. :499–510.

Mobile devices store a diverse set of private user data and have gradually become a hub to control users' other personal Internet-of-Things devices. Access control on mobile devices is therefore highly important. The widely accepted solution is to protect access by asking for a password. However, password authentication is tedious, e.g., a user needs to input a password every time she wants to use the device. Moreover, existing biometrics such as face, fingerprint, and touch behaviors are vulnerable to forgery attacks. We propose a new touch-based biometric authentication system that is passive and secure against forgery attacks. In our touch-based authentication, a user's touch behaviors are a function of some random "secret". The user can subconsciously know the secret while touching the device's screen. However, an attacker cannot know the secret at the time of attack, which makes it challenging to perform forgery attacks even if the attacker has already obtained the user's touch behaviors. We evaluate our touch-based authentication system by collecting data from 25 subjects. Results are promising: the random secrets do not influence user experience and, for targeted forgery attacks, our system achieves 0.18 smaller Equal Error Rates (EERs) than previous touch-based authentication.

Jagadiswary, D., Saraswady, D..  2016.  Multimodal Biometric Fusion Using Image Encryption Algorithm. Proceedings of the International Conference on Informatics and Analytics. :46:1–46:5.

India being digitized through digital India, the most basic unique identity for each individual is biometrics. Since India is the second most populous nation, the database that has to be maintained is surplus. Shielding those information by using the present techniques has been questioned. This contravene problem can be overcome by using cryptographic algorithms in accumulation to biometrics. Hence proposed system is developed by combining multimodal biometric (Fingerprint, Retina, Finger vein) with cryptographic algorithm with Genuine Acceptance Rate of 94%, False Acceptance Rate of 1.46%, and False Rejection Rate of 1.07%.

Puri, Gurjeet Singh, Gupta, Himanshu.  2016.  ID Based Encryption in Modern Cryptography. Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies. :15:1–15:5.

Now a days, ATM is used for money transaction for the convenience of the user by providing round the clock 24*7 services in financial transaction. Bank provides the Debit or Credit card to its user along with particular PIN number (which is only known by the Bank and User). Sometimes, user's card may be stolen by someone and this person can access all confidential information as Credit card number, Card holder name, Expiry date and CVV number through which he/she can complete fake transaction. In this paper, we introduced the biometric encryption of "EYE RETINA" to enhance the security over the wireless and unreliable network as internet. In this method user can authorizeasthird person his/her behalf to make the transaction using Debit or Credit card. In proposed method, third person can also perform financial transaction by providing his/her eye retina for the authorization & identification purpose.

Matsuki, Tatsuma, Matsuoka, Naoki.  2016.  A Resource Contention Analysis Framework for Diagnosis of Application Performance Anomalies in Consolidated Cloud Environments. Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering. :173–184.

Cloud services have made large contributions to the agile developments and rapid revisions of various applications. However, the performance of these applications is still one of the largest concerns for developers. Although it has created many performance analysis frameworks, most of them have not been efficient for the rapid application revisions because they have required performance models, which may have had to be remodeled whenever application revisions occurred. We propose an analysis framework for diagnosis of application performance anomalies. We designed our framework so that it did not require any performance models to be efficient in rapid application revisions. That investigates the Pearson correlation and association rules between system metrics and application performance. The association rules are widely used in data-mining areas to find relations between variables in databases. We demonstrated through an experiment and testing on a real data set that our framework could select causal metrics even when the metrics were temporally correlated, which reduced the false negatives obtained from cause diagnosis. We evaluated our framework from the perspective of the expected remaining diagnostic costs of framework users. The results indicated that it was expected to reduce the diagnostic costs by 84.8\textbackslash% at most, compared with a method that only used the Pearson correlation.

Guo, Qi, Song, Yang.  2016.  Large-Scale Analysis of Viewing Behavior: Towards Measuring Satisfaction with Mobile Proactive Systems. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :579–588.

Recently, proactive systems such as Google Now and Microsoft Cortana have become increasingly popular in reforming the way users access information on mobile devices. In these systems, relevant content is presented to users based on their context without a query in the form of information cards that do not require a click to satisfy the users. As a result, prior approaches based on clicks cannot provide reliable measurements of user satisfaction with such systems. It is also unclear how much of the previous findings regarding good abandonment with reactive Web searches can be applied to these proactive systems due to the intrinsic difference in user intent, the greater variety of content types and their presentations. In this paper, we present the first large-scale analysis of viewing behavior based on the viewport (the visible fraction of a Web page) of the mobile devices, towards measuring user satisfaction with the information cards of the mobile proactive systems. In particular, we identified and analyzed a variety of factors that may influence the viewing behavior, including biases from ranking positions, the types and attributes of the information cards, and the touch interactions with the mobile devices. We show that by modeling the various factors we can better measure user satisfaction with the mobile proactive systems, enabling stronger statistical power in large-scale online A/B testing.

Stauffert, Jan-Philipp, Niebling, Florian, Latoschik, Marc Erich.  2016.  Towards Comparable Evaluation Methods and Measures for Timing Behavior of Virtual Reality Systems. Proceedings of the 22Nd ACM Conference on Virtual Reality Software and Technology. :47–50.

A low latency is a fundamental timeliness requirement to reduce the potential risks of cyber sickness and to increase effectiveness, efficiency, and user experience of Virtual Reality Systems. The effects of uniform latency degradation based on mean or worst-case values are well researched. In contrast, the effects of latency jitter, the distribution pattern of latency changes over time has largely been ignored so far although today's consumer VR systems are extremely vulnerable in this respect. We investigate the applicability of the Walsh, generalized ESD, and the modified z-score test for the detection of outliers as one central latency distribution aspect. The tests are applied to well defined test cases mimicking typical timing behavior expected from concurrent architectures of today. We introduce accompanying graphical visualization methods to inspect, analyze and communicate the latency behavior of VR systems beyond simple mean or worst-case values. As a result, we propose a stacked modified z-score test for more detailed analysis.

Niedermayr, Rainer, Juergens, Elmar, Wagner, Stefan.  2016.  Will My Tests Tell Me if I Break This Code? Proceedings of the International Workshop on Continuous Software Evolution and Delivery. :23–29.

Automated tests play an important role in software evolution because they can rapidly detect faults introduced during changes. In practice, code-coverage metrics are often used as criteria to evaluate the effectiveness of test suites with focus on regression faults. However, code coverage only expresses which portion of a system has been executed by tests, but not how effective the tests actually are in detecting regression faults. Our goal was to evaluate the validity of code coverage as a measure for test effectiveness. To do so, we conducted an empirical study in which we applied an extreme mutation testing approach to analyze the tests of open-source projects written in Java. We assessed the ratio of pseudo-tested methods (those tested in a way such that faults would not be detected) to all covered methods and judged their impact on the software project. The results show that the ratio of pseudo-tested methods is acceptable for unit tests but not for system tests (that execute large portions of the whole system). Therefore, we conclude that the coverage metric is only a valid effectiveness indicator for unit tests.

Hirzel, Matthias, Klaeren, Herbert.  2016.  Code Coverage for Any Kind of Test in Any Kind of Transcompiled Cross-platform Applications. Proceedings of the 2Nd International Workshop on User Interface Test Automation. :1–10.

Code coverage is a widely used measure to determine how thoroughly an application is tested. There are many tools available for different languages. However, to the best of our knowledge, most of them focus on unit testing and ignore end-to-end tests with ui- or web tests. Furthermore, there is no support for determining code coverage of transcompiled cross-platform applications. This kind of application is written in one language, but compiled to and executed in a different programming language. Besides, it may run on a different platform. In this paper, we propose a new code coverage testing method that calculates the code coverage of any kind of test (unit-, integration- or ui-/web-test) for any type of (transcompiled) applications (desktop, web or mobile application). Developers obtain information about which parts of the source code are uncovered by tests. The basis of our approach is generic and may be applied in numerous programming languages based on an abstract syntax tree. We present our approach for any-kind-applications developed in Java and evaluate our tool on a web application created with Google Web Toolkit, on standard desktop applications, and on some small Java applications that use the Swing library to create user interfaces. Our results show that our tool is able to judge the code coverage of any kind of test. In particular, our tool is independent of the unit- or ui-/web test-framework in use. The runtime performance is promising although it is not as fast as already existing tools in the area of unit-testing.

Gao, Ning, Bagdouri, Mossaab, Oard, Douglas W..  2016.  Pearson Rank: A Head-Weighted Gap-Sensitive Score-Based Correlation Coefficient. Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. :941–944.

One way of evaluating the reusability of a test collection is to determine whether removing the unique contributions of some system would alter the preference order between that system and others. Rank correlation measures such as Kendall's tau are often used for this purpose. Rank correlation measures are appropriate for ordinal measures in which only preference order is important, but many evaluation measures produce system scores in which both the preference order and the magnitude of the score difference are important. Such measures are referred to as interval. Pearson's rho offers one way in which correlation can be computed over results from an interval measure such that smaller errors in the gap size are preferred. When seeking to improve over existing systems, we care the most about comparisons among the best systems. For that purpose we prefer head-weighed measures such as tau\_AP, which is designed for ordinal data. No present head weighted measure fully leverages the information present in interval effectiveness measures. This paper introduces such a measure, referred to as Pearson Rank.

Schroeder, Jan, Berger, Christian, Staron, Miroslaw, Herpel, Thomas, Knauss, Alessia.  2016.  Unveiling Anomalies and Their Impact on Software Quality in Model-based Automotive Software Revisions with Software Metrics and Domain Experts. Proceedings of the 25th International Symposium on Software Testing and Analysis. :154–164.

The validation of simulation models (e.g., of electronic control units for vehicles) in industry is becoming increasingly challenging due to their growing complexity. To systematically assess the quality of such models, software metrics seem to be promising. In this paper we explore the use of software metrics and outlier analysis as a means to assess the quality of model-based software. More specifically, we investigate how results from regression analysis applied to measurement data received from size and complexity metrics can be mapped to software quality. Using the moving averages approach, models were fit to data received from over 65,000 software revisions for 71 simulation models that represent different electronic control units of real premium vehicles. Consecutive investigations using studentized deleted residuals and Cook’s Distance revealed outliers among the measurements. From these outliers we identified a subset, which provides meaningful information (anomalies) by comparing outlier scores with expert opinions. Eight engineers were interviewed separately for outlier impact on software quality. Findings were validated in consecutive workshops. The results show correlations between outliers and their impact on four of the considered quality characteristics. They also demonstrate the applicability of this approach in industry.

Bondi, André B..  2016.  Challenges with Applying Performance Testing Methods for Systems Deployed on Shared Environments with Indeterminate Competing Workloads: Position Paper. Companion Publication for ACM/SPEC on International Conference on Performance Engineering. :41–44.

There is a tendency to move production environments from corporate-owned data centers to cloud-based services. Users who do not maintain a private production environment might not wish to maintain a private performance test environment either. The application of performance engineering methods to the development and delivery of software systems is complicated when the form and or parameters of the target deployment environment cannot be controlled or determined. The difficulty of diagnosing the causes of performance issues during testing or production may be increased by the presence of highly variable workloads on the target platform that compete with the application of interest for resources in ways that might be hard to determine. In particular, performance tests might be conducted in virtualized environments that introduce factors influencing customer-affecting metrics (such as transaction response time) and observed resource usage. Observed resource usage metrics in virtualized environments can have different meanings from those in a native environment. Virtual machines may suffer delays in execution. We explore factors that exacerbate these complications. We argue that these complexities reinforce the case for rigorously using software performance engineering methods rather than diminishing it. We also explore possible performance testing methods for mitigating the risk associated with these complexities.

Menninghaus, Mathias, Pulvermüller, Elke.  2016.  Towards Using Code Coverage Metrics for Performance Comparison on the Implementation Level. Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering. :101–104.

The development process for new algorithms or data structures often begins with the analysis of benchmark results to identify the drawbacks of already existing implementations. Furthermore it ends with the comparison of old and new implementations by using one or more well established benchmark. But how relevant, reproducible, fair, verifiable and usable those benchmarks may be, they have certain drawbacks. On the one hand a new implementation may be biased to provide good results for a specific benchmark. On the other hand benchmarks are very general and often fail to identify the worst and best cases of a specific implementation. In this paper we present a new approach for the comparison of algorithms and data structures on the implementation level using code coverage. Our approach uses model checking and multi-objective evolutionary algorithms to create test cases with a high code coverage. It then executes each of the given implementations with each of the test cases in order to calculate a cross coverage. Using this it calculates a combined coverage and weighted performance where implementations, which are not fully covered by the test cases of the other implementations, are punished. These metrics can be used to compare the performance of several implementations on a much deeper level than traditional benchmarks and they incorporate worst, best and average cases in an equal manner. We demonstrate this approach by two example sets of algorithms and outline the next research steps required in this context along with the greatest risks and challenges.

Dmitriev, Pavel, Wu, Xian.  2016.  Measuring Metrics. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :429–437.

You get what you measure, and you can't manage what you don't measure. Metrics are a powerful tool used in organizations to set goals, decide which new products and features should be released to customers, which new tests and experiments should be conducted, and how resources should be allocated. To a large extent, metrics drive the direction of an organization, and getting metrics 'right' is one of the most important and difficult problems an organization needs to solve. However, creating good metrics that capture long-term company goals is difficult. They try to capture abstract concepts such as success, delight, loyalty, engagement, life-time value, etc. How can one determine that a metric is a good one? Or, that one metric is better than another? In other words, how do we measure the quality of metrics? Can the evaluation process be automated so that anyone with an idea of a new metric can quickly evaluate it? In this paper we describe the metric evaluation system deployed at Bing, where we have been working on designing and improving metrics for over five years. We believe that by applying a data driven approach to metric evaluation we have been able to substantially improve our metrics and, as a result, ship better features and improve search experience for Bing's users.