Visible to the public Biblio

Filters: Keyword is metrics testing  [Clear All Filters]
2020-03-09
Moukahal, Lama, Zulkernine, Mohammad.  2019.  Security Vulnerability Metrics for Connected Vehicles. 2019 IEEE 19th International Conference on Software Quality, Reliability and Security Companion (QRS-C). :17–23.

Software integration in modern vehicles is continuously expanding. This is due to the fact that vehicle manufacturers are always trying to enhance and add more innovative and competitive features that may rely on complex software functionalities. However, these features come at a cost. They amplify the security vulnerabilities in vehicles and lead to more security issues in today's automobiles. As a result, the need for identifying vulnerable components in a vehicle software system has become crucial. Security experts need to know which components of the vehicle software system can be exploited for attacks and should focus their testing and inspection efforts on it. Nevertheless, it is a challenging and costly task to identify these weak components in a vehicle's system. In this paper, we propose some security vulnerability metrics for connected vehicles that aim to assist software testers during the development life-cycle in order to identify the frail links that put the vehicle at highsecurity risks. Vulnerable function assessment can give software testers a good idea about which components in a connected vehicle need to be prioritized in order to mitigate the risk and hence secure the vehicle. The proposed metrics were applied to OpenPilot - a software that provides Autopilot feature - and has been integrated with 48 different vehicles.. The application shows how the defined metrics can be effectively used to quantitatively measure the vulnerabilities of a vehicle software system.

Chhillar, Dheeraj, Sharma, Kalpana.  2019.  ACT Testbot and 4S Quality Metrics in XAAS Framework. 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon). :503–509.

The purpose of this paper is to analyze all Cloud based Service Models, Continuous Integration, Deployment and Delivery process and propose an Automated Continuous Testing and testing as a service based TestBot and metrics dashboard which will be integrated with all existing automation, bug logging, build management, configuration and test management tools. Recently cloud is being used by organizations to save time, money and efforts required to setup and maintain infrastructure and platform. Continuous Integration and Delivery is in practice nowadays within Agile methodology to give capability of multiple software releases on daily basis and ensuring all the development, test and Production environments could be synched up quickly. In such an agile environment there is need to ramp up testing tools and processes so that overall regression testing including functional, performance and security testing could be done along with build deployments at real time. To support this phenomenon, we researched on Continuous Testing and worked with industry professionals who are involved in architecting, developing and testing the software products. A lot of research has been done towards automating software testing so that testing of software product could be done quickly and overall testing process could be optimized. As part of this paper we have proposed ACT TestBot tool, metrics dashboard and coined 4S quality metrics term to quantify quality of the software product. ACT testbot and metrics dashboard will be integrated with Continuous Integration tools, Bug reporting tools, test management tools and Data Analytics tools to trigger automation scripts, continuously analyze application logs, open defects automatically and generate metrics reports. Defect pattern report will be created to support root cause analysis and to take preventive action.

2020-02-17
Yee, George O. M..  2019.  Designing Good Security Metrics. 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC). 2:580–585.

This paper begins with an introduction to security metrics, describing the need for security metrics, followed by a discussion of the nature of security metrics, including the challenges found with some security metrics used in the past. The paper then discusses what makes a good security metric and proposes a rigorous step-by-step method that can be applied to design good security metrics, and to test existing security metrics to see if they are good metrics. Application examples are included to illustrate the method.

2019-07-01
Medeiros, N., Ivaki, N., Costa, P., Vieira, M..  2018.  An Approach for Trustworthiness Benchmarking Using Software Metrics. 2018 IEEE 23rd Pacific Rim International Symposium on Dependable Computing (PRDC). :84–93.

Trustworthiness is a paramount concern for users and customers in the selection of a software solution, specially in the context of complex and dynamic environments, such as Cloud and IoT. However, assessing and benchmarking trustworthiness (worthiness of software for being trusted) is a challenging task, mainly due to the variety of application scenarios (e.g., businesscritical, safety-critical), the large number of determinative quality attributes (e.g., security, performance), and last, but foremost, due to the subjective notion of trust and trustworthiness. In this paper, we present trustworthiness as a measurable notion in relative terms based on security attributes and propose an approach for the assessment and benchmarking of software. The main goal is to build a trustworthiness assessment model based on software metrics (e.g., Cyclomatic Complexity, CountLine, CBO) that can be used as indicators of software security. To demonstrate the proposed approach, we assessed and ranked several files and functions of the Mozilla Firefox project based on their trustworthiness score and conducted a survey among several software security experts in order to validate the obtained rank. Results show that our approach is able to provide a sound ranking of the benchmarked software.

Arabsorkhi, A., Ghaffari, F..  2018.  Security Metrics: Principles and Security Assessment Methods. 2018 9th International Symposium on Telecommunications (IST). :305–310.

Nowadays, Information Technology is one of the important parts of human life and also of organizations. Organizations face problems such as IT problems. To solve these problems, they have to improve their security sections. Thus there is a need for security assessments within organizations to ensure security conditions. The use of security standards and general metric can be useful for measuring the safety of an organization; however, it should be noted that the general metric which are applied to businesses in general cannot be effective in this particular situation. Thus it's important to select metric standards for different businesses to improve both cost and organizational security. The selection of suitable security measures lies in the use of an efficient way to identify them. Due to the numerous complexities of these metric and the extent to which they are defined, in this paper that is based on comparative study and the benchmarking method, taxonomy for security measures is considered to be helpful for a business to choose metric tailored to their needs and conditions.

2018-06-20
Yang, Sen, Dong, Xin, Sun, Leilei, Zhou, Yichen, Farneth, Richard A., Xiong, Hui, Burd, Randall S., Marsic, Ivan.  2017.  A Data-driven Process Recommender Framework. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. :2111–2120.
We present an approach for improving the performance of complex knowledge-based processes by providing data-driven step-by-step recommendations. Our framework uses the associations between similar historic process performances and contextual information to determine the prototypical way of enacting the process. We introduce a novel similarity metric for grouping traces into clusters that incorporates temporal information about activity performance and handles concurrent activities. Our data-driven recommender system selects the appropriate prototype performance of the process based on user-provided context attributes. Our approach for determining the prototypes discovers the commonly performed activities and their temporal relationships. We tested our system on data from three real-world medical processes and achieved recommendation accuracy up to an F1 score of 0.77 (compared to an F1 score of 0.37 using ZeroR) with 63.2% of recommended enactments being within the first five neighbors of the actual historic enactments in a set of 87 cases. Our framework works as an interactive visual analytic tool for process mining. This work shows the feasibility of data-driven decision support system for complex knowledge-based processes.
Sundaresan, Srikanth, Allman, Mark, Dhamdhere, Amogh, Claffy, Kc.  2017.  TCP Congestion Signatures. Proceedings of the 2017 Internet Measurement Conference. :64–77.

We develop and validate Internet path measurement techniques to distinguish congestion experienced when a flow self-induces congestion in the path from when a flow is affected by an already congested path. One application of this technique is for speed tests, when the user is affected by congestion either in the last mile or in an interconnect link. This difference is important because in the latter case, the user is constrained by their service plan (i.e., what they are paying for), and in the former case, they are constrained by forces outside of their control. We exploit TCP congestion control dynamics to distinguish these cases for Internet paths that are predominantly TCP traffic. In TCP terms, we re-articulate the question: was a TCP flow bottlenecked by an already congested (possibly interconnect) link, or did it induce congestion in an otherwise idle (possibly a last-mile) link? TCP congestion control affects the round-trip time (RTT) of packets within the flow (i.e., the flow RTT): an endpoint sends packets at higher throughput, increasing the occupancy of the bottleneck buffer, thereby increasing the RTT of packets in the flow. We show that two simple, statistical metrics derived from the flow RTT during the slow start period—its coefficient of variation, and the normalized difference between the maximum and minimum RTT—can robustly identify which type of congestion the flow encounters. We use extensive controlled experiments to demonstrate that our technique works with up to 90% accuracy. We also evaluate our techniques using two unique real-world datasets of TCP throughput measurements using Measurement Lab data and the Ark platform. We find up to 99% accuracy in detecting self-induced congestion, and up to 85% accuracy in detecting external congestion. Our results can benefit regulators of interconnection markets, content providers trying to improve customer service, and users trying to understand whether poor performance is something they can fix by upgrading their service tier.

Michael, Nicolas, Ramannavar, Nitin, Shen, Yixiao, Patil, Sheetal, Sung, Jan-Lung.  2017.  CloudPerf: A Performance Test Framework for Distributed and Dynamic Multi-Tenant Environments. Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering. :189–200.

The evolution of cloud-computing imposes many challenges on performance testing and requires not only a different approach and methodology of performance evaluation and analysis, but also specialized tools and frameworks to support such work. In traditional performance testing, typically a single workload was run against a static test configuration. The main metrics derived from such experiments included throughput, response times, and system utilization at steady-state. While this may have been sufficient in the past, where in many cases a single application was run on dedicated hardware, this approach is no longer suitable for cloud-based deployments. Whether private or public cloud, such environments typically host a variety of applications on distributed shared hardware resources, simultaneously accessed by a large number of tenants running heterogeneous workloads. The number of tenants as well as their activity and resource needs dynamically change over time, and the cloud infrastructure reacts to this by reallocating existing or provisioning new resources. Besides metrics such as the number of tenants and overall resource utilization, performance testing in the cloud must be able to answer many more questions: How is the quality of service of a tenant impacted by the constantly changing activity of other tenants? How long does it take the cloud infrastructure to react to changes in demand, and what is the effect on tenants while it does so? How well are service level agreements met? What is the resource consumption of individual tenants? How can global performance metrics on application- and system-level in a distributed system be correlated to an individual tenant's perceived performance? In this paper we present CloudPerf, a performance test framework specifically designed for distributed and dynamic multi-tenant environments, capable of answering all of the above questions, and more. CloudPerf consists of a distributed harness, a protocol-independent load generator and workload modeling framework, an extensible statistics framework with live-monitoring and post-analysis tools, interfaces for cloud deployment operations, and a rich set of both low-level as well as high-level workloads from different domains.

Fehlmann, Thomas, Kranich, Eberhard.  2017.  Autonomous Real-time Software & Systems Testing. Proceedings of the 27th International Workshop on Software Measurement and 12th International Conference on Software Process and Product Measurement. :54–63.

For the Internet of Things (IoT), for safety in automotive, or for data protection, to be legally compliant requires testing the impact of any actions before allowing them to occur. However, system boundaries change at runtime. When adding a new, previously unknown device to an IoT orchestra, or when an autonomous car meets another, or with truck platooning, the original base system expands and needs being tested before it can do decisions with the potential of affecting harm to humans. This paper explains the theory and outlines the implementation approach a framework for autonomous real-time testing of a software-based system while in operation, with an example from IoT.

Holterbach, Thomas, Aben, Emile, Pelsser, Cristel, Bush, Randy, Vanbever, Laurent.  2017.  Measurement Vantage Point Selection Using A Similarity Metric. Proceedings of the Applied Networking Research Workshop. :1–3.

It is a challenge to select the most appropriate vantage points in a measurement platform with a wide selection. RIPE Atlas [2], for example currently has over 9600 active measurement vantage points, with selections based on AS, country, etc. A user is limited to how many vantage points they can use in a measurement. This is not only due to limitations the measurement platform imposes, but data from a large number of vantage points would produce a large volume to analyse and store. So it makes sense to optimize for a minimal set of vantage points with a maximum chance of observing the phenomenon in which the user is interested. Network operators often need to debug with only limited information about the problem ("Our network is slow for users in France!"). doing a minimal set of measurements that would allow testing through a wide diversity of networks could be a valuable add-on to the tools available to network operators. Given platforms with numerous vantage points, we have the luxury of testing a large set of end-customer outgoing paths. A diversity metric would allow selection of the most dissimilar vantage points, while exploring from as diverse angles as possible, even with a limited probing budget. If one finds an interesting network phenomenon, one could use the similarity metric to advantage by selecting the most similar vantage points to the one exhibiting the phenomenon, to validate the phenomenon from multiple vantage points. We propose a novel means of selecting vantage points, not based on categorical properties such as origin AS, or geographic location, but rather on topological (dis)similarity between vantage points. We describe a similarity metric across RIPE Atlas probes, and show how it performs better for the purpose of topology discovery than the default probe selection mechanism built into RIPE Atlas.

Dmitriev, Pavel, Gupta, Somit, Kim, Dong Woo, Vaz, Garnet.  2017.  A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. :1427–1436.

Online controlled experiments (e.g., A/B tests) are now regularly used to guide product development and accelerate innovation in software. Product ideas are evaluated as scientific hypotheses, and tested in web sites, mobile applications, desktop applications, services, and operating systems. One of the key challenges for organizations that run controlled experiments is to come up with the right set of metrics [1] [2] [3]. Having good metrics, however, is not enough. In our experience of running thousands of experiments with many teams across Microsoft, we observed again and again how incorrect interpretations of metric movements may lead to wrong conclusions about the experiment's outcome, which if deployed could hurt the business by millions of dollars. Inspired by Steven Goodman's twelve p-value misconceptions [4], in this paper, we share twelve common metric interpretation pitfalls which we observed repeatedly in our experiments. We illustrate each pitfall with a puzzling example from a real experiment, and describe processes, metric design principles, and guidelines that can be used to detect and avoid the pitfall. With this paper, we aim to increase the experimenters' awareness of metric interpretation issues, leading to improved quality and trustworthiness of experiment results and better data-driven decisions.

2017-08-02
Matsuki, Tatsuma, Matsuoka, Naoki.  2016.  A Resource Contention Analysis Framework for Diagnosis of Application Performance Anomalies in Consolidated Cloud Environments. Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering. :173–184.

Cloud services have made large contributions to the agile developments and rapid revisions of various applications. However, the performance of these applications is still one of the largest concerns for developers. Although it has created many performance analysis frameworks, most of them have not been efficient for the rapid application revisions because they have required performance models, which may have had to be remodeled whenever application revisions occurred. We propose an analysis framework for diagnosis of application performance anomalies. We designed our framework so that it did not require any performance models to be efficient in rapid application revisions. That investigates the Pearson correlation and association rules between system metrics and application performance. The association rules are widely used in data-mining areas to find relations between variables in databases. We demonstrated through an experiment and testing on a real data set that our framework could select causal metrics even when the metrics were temporally correlated, which reduced the false negatives obtained from cause diagnosis. We evaluated our framework from the perspective of the expected remaining diagnostic costs of framework users. The results indicated that it was expected to reduce the diagnostic costs by 84.8\textbackslash% at most, compared with a method that only used the Pearson correlation.

Guo, Qi, Song, Yang.  2016.  Large-Scale Analysis of Viewing Behavior: Towards Measuring Satisfaction with Mobile Proactive Systems. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :579–588.

Recently, proactive systems such as Google Now and Microsoft Cortana have become increasingly popular in reforming the way users access information on mobile devices. In these systems, relevant content is presented to users based on their context without a query in the form of information cards that do not require a click to satisfy the users. As a result, prior approaches based on clicks cannot provide reliable measurements of user satisfaction with such systems. It is also unclear how much of the previous findings regarding good abandonment with reactive Web searches can be applied to these proactive systems due to the intrinsic difference in user intent, the greater variety of content types and their presentations. In this paper, we present the first large-scale analysis of viewing behavior based on the viewport (the visible fraction of a Web page) of the mobile devices, towards measuring user satisfaction with the information cards of the mobile proactive systems. In particular, we identified and analyzed a variety of factors that may influence the viewing behavior, including biases from ranking positions, the types and attributes of the information cards, and the touch interactions with the mobile devices. We show that by modeling the various factors we can better measure user satisfaction with the mobile proactive systems, enabling stronger statistical power in large-scale online A/B testing.

Stauffert, Jan-Philipp, Niebling, Florian, Latoschik, Marc Erich.  2016.  Towards Comparable Evaluation Methods and Measures for Timing Behavior of Virtual Reality Systems. Proceedings of the 22Nd ACM Conference on Virtual Reality Software and Technology. :47–50.

A low latency is a fundamental timeliness requirement to reduce the potential risks of cyber sickness and to increase effectiveness, efficiency, and user experience of Virtual Reality Systems. The effects of uniform latency degradation based on mean or worst-case values are well researched. In contrast, the effects of latency jitter, the distribution pattern of latency changes over time has largely been ignored so far although today's consumer VR systems are extremely vulnerable in this respect. We investigate the applicability of the Walsh, generalized ESD, and the modified z-score test for the detection of outliers as one central latency distribution aspect. The tests are applied to well defined test cases mimicking typical timing behavior expected from concurrent architectures of today. We introduce accompanying graphical visualization methods to inspect, analyze and communicate the latency behavior of VR systems beyond simple mean or worst-case values. As a result, we propose a stacked modified z-score test for more detailed analysis.

Niedermayr, Rainer, Juergens, Elmar, Wagner, Stefan.  2016.  Will My Tests Tell Me if I Break This Code? Proceedings of the International Workshop on Continuous Software Evolution and Delivery. :23–29.

Automated tests play an important role in software evolution because they can rapidly detect faults introduced during changes. In practice, code-coverage metrics are often used as criteria to evaluate the effectiveness of test suites with focus on regression faults. However, code coverage only expresses which portion of a system has been executed by tests, but not how effective the tests actually are in detecting regression faults. Our goal was to evaluate the validity of code coverage as a measure for test effectiveness. To do so, we conducted an empirical study in which we applied an extreme mutation testing approach to analyze the tests of open-source projects written in Java. We assessed the ratio of pseudo-tested methods (those tested in a way such that faults would not be detected) to all covered methods and judged their impact on the software project. The results show that the ratio of pseudo-tested methods is acceptable for unit tests but not for system tests (that execute large portions of the whole system). Therefore, we conclude that the coverage metric is only a valid effectiveness indicator for unit tests.

Hirzel, Matthias, Klaeren, Herbert.  2016.  Code Coverage for Any Kind of Test in Any Kind of Transcompiled Cross-platform Applications. Proceedings of the 2Nd International Workshop on User Interface Test Automation. :1–10.

Code coverage is a widely used measure to determine how thoroughly an application is tested. There are many tools available for different languages. However, to the best of our knowledge, most of them focus on unit testing and ignore end-to-end tests with ui- or web tests. Furthermore, there is no support for determining code coverage of transcompiled cross-platform applications. This kind of application is written in one language, but compiled to and executed in a different programming language. Besides, it may run on a different platform. In this paper, we propose a new code coverage testing method that calculates the code coverage of any kind of test (unit-, integration- or ui-/web-test) for any type of (transcompiled) applications (desktop, web or mobile application). Developers obtain information about which parts of the source code are uncovered by tests. The basis of our approach is generic and may be applied in numerous programming languages based on an abstract syntax tree. We present our approach for any-kind-applications developed in Java and evaluate our tool on a web application created with Google Web Toolkit, on standard desktop applications, and on some small Java applications that use the Swing library to create user interfaces. Our results show that our tool is able to judge the code coverage of any kind of test. In particular, our tool is independent of the unit- or ui-/web test-framework in use. The runtime performance is promising although it is not as fast as already existing tools in the area of unit-testing.

Gao, Ning, Bagdouri, Mossaab, Oard, Douglas W..  2016.  Pearson Rank: A Head-Weighted Gap-Sensitive Score-Based Correlation Coefficient. Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. :941–944.

One way of evaluating the reusability of a test collection is to determine whether removing the unique contributions of some system would alter the preference order between that system and others. Rank correlation measures such as Kendall's tau are often used for this purpose. Rank correlation measures are appropriate for ordinal measures in which only preference order is important, but many evaluation measures produce system scores in which both the preference order and the magnitude of the score difference are important. Such measures are referred to as interval. Pearson's rho offers one way in which correlation can be computed over results from an interval measure such that smaller errors in the gap size are preferred. When seeking to improve over existing systems, we care the most about comparisons among the best systems. For that purpose we prefer head-weighed measures such as tau\_AP, which is designed for ordinal data. No present head weighted measure fully leverages the information present in interval effectiveness measures. This paper introduces such a measure, referred to as Pearson Rank.

Schroeder, Jan, Berger, Christian, Staron, Miroslaw, Herpel, Thomas, Knauss, Alessia.  2016.  Unveiling Anomalies and Their Impact on Software Quality in Model-based Automotive Software Revisions with Software Metrics and Domain Experts. Proceedings of the 25th International Symposium on Software Testing and Analysis. :154–164.

The validation of simulation models (e.g., of electronic control units for vehicles) in industry is becoming increasingly challenging due to their growing complexity. To systematically assess the quality of such models, software metrics seem to be promising. In this paper we explore the use of software metrics and outlier analysis as a means to assess the quality of model-based software. More specifically, we investigate how results from regression analysis applied to measurement data received from size and complexity metrics can be mapped to software quality. Using the moving averages approach, models were fit to data received from over 65,000 software revisions for 71 simulation models that represent different electronic control units of real premium vehicles. Consecutive investigations using studentized deleted residuals and Cook’s Distance revealed outliers among the measurements. From these outliers we identified a subset, which provides meaningful information (anomalies) by comparing outlier scores with expert opinions. Eight engineers were interviewed separately for outlier impact on software quality. Findings were validated in consecutive workshops. The results show correlations between outliers and their impact on four of the considered quality characteristics. They also demonstrate the applicability of this approach in industry.

Bondi, André B..  2016.  Challenges with Applying Performance Testing Methods for Systems Deployed on Shared Environments with Indeterminate Competing Workloads: Position Paper. Companion Publication for ACM/SPEC on International Conference on Performance Engineering. :41–44.

There is a tendency to move production environments from corporate-owned data centers to cloud-based services. Users who do not maintain a private production environment might not wish to maintain a private performance test environment either. The application of performance engineering methods to the development and delivery of software systems is complicated when the form and or parameters of the target deployment environment cannot be controlled or determined. The difficulty of diagnosing the causes of performance issues during testing or production may be increased by the presence of highly variable workloads on the target platform that compete with the application of interest for resources in ways that might be hard to determine. In particular, performance tests might be conducted in virtualized environments that introduce factors influencing customer-affecting metrics (such as transaction response time) and observed resource usage. Observed resource usage metrics in virtualized environments can have different meanings from those in a native environment. Virtual machines may suffer delays in execution. We explore factors that exacerbate these complications. We argue that these complexities reinforce the case for rigorously using software performance engineering methods rather than diminishing it. We also explore possible performance testing methods for mitigating the risk associated with these complexities.

Menninghaus, Mathias, Pulvermüller, Elke.  2016.  Towards Using Code Coverage Metrics for Performance Comparison on the Implementation Level. Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering. :101–104.

The development process for new algorithms or data structures often begins with the analysis of benchmark results to identify the drawbacks of already existing implementations. Furthermore it ends with the comparison of old and new implementations by using one or more well established benchmark. But how relevant, reproducible, fair, verifiable and usable those benchmarks may be, they have certain drawbacks. On the one hand a new implementation may be biased to provide good results for a specific benchmark. On the other hand benchmarks are very general and often fail to identify the worst and best cases of a specific implementation. In this paper we present a new approach for the comparison of algorithms and data structures on the implementation level using code coverage. Our approach uses model checking and multi-objective evolutionary algorithms to create test cases with a high code coverage. It then executes each of the given implementations with each of the test cases in order to calculate a cross coverage. Using this it calculates a combined coverage and weighted performance where implementations, which are not fully covered by the test cases of the other implementations, are punished. These metrics can be used to compare the performance of several implementations on a much deeper level than traditional benchmarks and they incorporate worst, best and average cases in an equal manner. We demonstrate this approach by two example sets of algorithms and outline the next research steps required in this context along with the greatest risks and challenges.

Dmitriev, Pavel, Wu, Xian.  2016.  Measuring Metrics. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :429–437.

You get what you measure, and you can't manage what you don't measure. Metrics are a powerful tool used in organizations to set goals, decide which new products and features should be released to customers, which new tests and experiments should be conducted, and how resources should be allocated. To a large extent, metrics drive the direction of an organization, and getting metrics 'right' is one of the most important and difficult problems an organization needs to solve. However, creating good metrics that capture long-term company goals is difficult. They try to capture abstract concepts such as success, delight, loyalty, engagement, life-time value, etc. How can one determine that a metric is a good one? Or, that one metric is better than another? In other words, how do we measure the quality of metrics? Can the evaluation process be automated so that anyone with an idea of a new metric can quickly evaluate it? In this paper we describe the metric evaluation system deployed at Bing, where we have been working on designing and improving metrics for over five years. We believe that by applying a data driven approach to metric evaluation we have been able to substantially improve our metrics and, as a result, ship better features and improve search experience for Bing's users.

Wuxia Jin, Ting Liu, Yu Qu, Jianlei Chi, Di Cui, Qinghua Zheng.  2016.  Dynamic cohesion measurement for distributed system.

Instead of developing single-server software system for the powerful computers, the software is turning from large single-server to multi-server system such as distributed system. This change introduces a new challenge for the software quality measurement, since the current software analysis methods for single-server software could not observe and assess the correlation among the components on different nodes. In this paper, a new dynamic cohesion approach is proposed for distributed system. We extend Calling Network model for distributed system by differentiating methods of components deployed on different nodes. Two new cohesion metrics are proposed to describe the correlation at component level, by extending the cohesion metric of single-server software system. The experiments, conducted on a distributed systems-Netflix RSS Reader, present how to trace the various system functions accomplished on three nodes, how to abstract dynamic behaviors using our model among different nodes and how to evaluate the software cohesion on distributed system.