Visible to the public Biblio

Found 112 results

Filters: Keyword is pubcrawl170201  [Clear All Filters]
2017-03-07
Summers, Cameron, Tronel, Greg, Cramer, Jason, Vartakavi, Aneesh, Popp, Phillip.  2016.  GNMID14: A Collection of 110 Million Global Music Identification Matches. Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. :693–696.

A new dataset is presented composed of music identification matches from Gracenote, a leading global music metadata company. Matches from January 1, 2014 to December 31, 2014 have been curated and made available as a public dataset called Gracenote Music Identification 2014, or GNMID14, at the following address: https://developer.gracenote.com/mid2014. This collection is the first significant music identification dataset and one of the largest music related datasets available containing more than 110M matches in 224 countries for 3M unique tracks, and 509K unique artists. It features geotemporal information (i.e. country and match date), genre and mood metadata. In this paper, we characterize the dataset and demonstrate its utility for Information Retrieval (IR) research.

Schild, Christopher-J., Schultz, Simone.  2016.  Linking Deutsche Bundesbank Company Data Using Machine-Learning-Based Classification: Extended Abstract. Proceedings of the Second International Workshop on Data Science for Macro-Modeling. :10:1–10:3.

We present a process of linking various Deutsche Bundesbank datasources on companies based on a semi-automatic classification. The linkage process involves data cleaning and harmonization, blocking, construction of comparison features, as well as training and testing a statistical classification model on a "ground-truth" subset of known matches and non-matches. The evaluation of our method shows that the process limits the need for manual classifications to a small percentage of ambiguously classified match candidates.

Kiran, Indra, Guha, Tanaya, Pandey, Gaurav.  2016.  Blind Image Quality Assessment Using Subspace Alignment. Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing. :91:1–91:6.

This paper addresses the problem of estimating the quality of an image as it would be perceived by a human. A well accepted approach to assess perceptual quality of an image is to quantify its loss of structural information. We propose a blind image quality assessment method that aims at quantifying structural information loss in a given (possibly distorted) image by comparing its structures with those extracted from a database of clean images. We first construct a subspace from the clean natural images using (i) principal component analysis (PCA), and (ii) overcomplete dictionary learning with sparsity constraint. While PCA provides mathematical convenience, an overcomplete dictionary is known to capture the perceptually important structures resembling the simple cells in the primary visual cortex. The subspace learned from the clean images is called the source subspace. Similarly, a subspace, called the target subspace, is learned from the distorted image. In order to quantify the structural information loss, we use a subspace alignment technique which transforms the target subspace into the source by optimizing over a transformation matrix. This transformation matrix is subsequently used to measure the global and local (patch-based) quality score of the distorted image. The quality scores obtained by the proposed method are shown to correlate well with the subjective scores obtained from human annotators. Our method achieves competitive results when evaluated on three benchmark databases.

Petrić, Jean, Bowes, David, Hall, Tracy, Christianson, Bruce, Baddoo, Nathan.  2016.  The Jinx on the NASA Software Defect Data Sets. Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering. :13:1–13:5.

Background: The NASA datasets have previously been used extensively in studies of software defects. In 2013 Shepperd et al. presented an essential set of rules for removing erroneous data from the NASA datasets making this data more reliable to use. Objective: We have now found additional rules necessary for removing problematic data which were not identified by Shepperd et al. Results: In this paper, we demonstrate the level of erroneous data still present even after cleaning using Shepperd et al.'s rules and apply our new rules to remove this erroneous data. Conclusion: Even after systematic data cleaning of the NASA MDP datasets, we found new erroneous data. Data quality should always be explicitly considered by researchers before use.

Agrawal, Divy, Ba, Lamine, Berti-Equille, Laure, Chawla, Sanjay, Elmagarmid, Ahmed, Hammady, Hossam, Idris, Yasser, Kaoudi, Zoi, Khayyat, Zuhair, Kruse, Sebastian et al..  2016.  Rheem: Enabling Multi-Platform Task Execution. Proceedings of the 2016 International Conference on Management of Data. :2069–2072.

Many emerging applications, from domains such as healthcare and oil & gas, require several data processing systems for complex analytics. This demo paper showcases system, a framework that provides multi-platform task execution for such applications. It features a three-layer data processing abstraction and a new query optimization approach for multi-platform settings. We will demonstrate the strengths of system by using real-world scenarios from three different applications, namely, machine learning, data cleaning, and data fusion.

Vasek, Marie, Weeden, Matthew, Moore, Tyler.  2016.  Measuring the Impact of Sharing Abuse Data with Web Hosting Providers. Proceedings of the 2016 ACM on Workshop on Information Sharing and Collaborative Security. :71–80.

Sharing incident data among Internet operators is widely seen as an important strategy in combating cybercrime. However, little work has been done to quantify the positive benefits of such sharing. To that end, we report on an observational study of URLs blacklisted for distributing malware that the non-profit anti-malware organization StopBadware shared with requesting web hosting providers. Our dataset comprises over 28,000 URLs shared with 41 organizations between 2010 and 2015. We show that sharing has an immediate effect of cleaning the reported URLs and reducing the likelihood that they will be recompromised; despite this, we find that long-lived malware takes much longer to clean, even after being reported. Furthermore, we find limited evidence that one-time sharing of malware data improves the malware cleanup response of all providers over the long term. Instead, some providers improve while others worsen.

Chatlani, Neeraj, Myers, Daniel S..  2016.  A Curiosity-Driven System for Developing Coding Literacy (Abstract Only). Proceedings of the 47th ACM Technical Symposium on Computing Science Education. :695–695.

Coding literacy is the ability to understand a written computer program and interpret its functionality and output. Literacy is a valuable skill for programmers at all levels, because understanding written code requires developing and applying mental models of program execution. Previous work has shown that explicit instruction in program literacy is beneficial for new computer science students and aids the development of algorithmic thinking. This poster summarizes the authors' work- in-progress developing COLT: the Coding Literacy Trainer, a web-based adaptive tutorial system that provides instruction in the fundamentals of coding literacy and program interpretation to new computer science students. In addition to its pedagogical applications, COLT serves as a development platform for a novel theoretical foundation for adaptive teaching systems based on the concept of intrinsic curiosity. Inspired by the work of Lee et al. in the field of developmental robotics, a curiosity-driven system explores its complete knowledge environment in way that continually maximizes its learning progress. Thus, learners are driven to explore areas where they are currently making the greatest advances, while avoiding regions of the knowledge space that are either too simple to be interesting or too challenging to be approachable at the current time. The poster summarizes the theoretical background and implementation of the COLT system in a clear, easy-to-read format. A web-based version of COLT is currently under active development and slated for an open-source release in the spring of 2016.

Golab, Wojciech, Ramaraju, Aditya.  2016.  Recoverable Mutual Exclusion: [Extended Abstract]. Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing. :65–74.

Mutex locks have traditionally been the most common mechanism for protecting shared data structures in parallel programs. However, the robustness of such locks against process failures has not been studied thoroughly. Most (user-level) mutex algorithms are designed around the assumption that processes are reliable, meaning that a process may not fail while executing the lock acquisition and release code, or while inside the critical section. If such a failure does occur, then the liveness properties of a conventional mutex lock may cease to hold until the application or operating system intervenes by cleaning up the internal structure of the lock. For example, a process that is attempting to acquire an otherwise starvation-free mutex may be blocked forever waiting for a failed process to release the critical section. Adding to the difficulty, if the failed process recovers and attempts to acquire the same mutex again without appropriate cleanup, then the mutex may become corrupted to the point where it loses safety, notably the mutual exclusion property. We address this challenge by formalizing the problem of recoverable mutual exclusion, and proposing several solutions that vary both in their assumptions regarding hardware support for synchronization, and in their time complexity. Compared to known solutions, our algorithms are more robust as they do not restrict where or when a process may crash, and provide stricter guarantees in terms of time complexity, which we define in terms of remote memory references.

Lin, Xiaofeng, Chen, Yu, Li, Xiaodong, Mao, Junjie, He, Jiaquan, Xu, Wei, Shi, Yuanchun.  2016.  Scalable Kernel TCP Design and Implementation for Short-Lived Connections. Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems. :339–352.

With the rapid growth of network bandwidth, increases in CPU cores on a single machine, and application API models demanding more short-lived connections, a scalable TCP stack is performance-critical. Although many clean-state designs have been proposed, production environments still call for a bottom-up parallel TCP stack design that is backward-compatible with existing applications. We present Fastsocket, a BSD Socket-compatible and scalable kernel socket design, which achieves table-level connection partition in TCP stack and guarantees connection locality for both passive and active connections. Fastsocket architecture is a ground up partition design, from NIC interrupts all the way up to applications, which naturally eliminates various lock contentions in the entire stack. Moreover, Fastsocket maintains the full functionality of the kernel TCP stack and BSD-socket-compatible API, and thus applications need no modifications. Our evaluations show that Fastsocket achieves a speedup of 20.4x on a 24-core machine under a workload of short-lived connections, outperforming the state-of-the-art Linux kernel TCP implementations. When scaling up to 24 CPU cores, Fastsocket increases the throughput of Nginx and HAProxy by 267% and 621% respectively compared with the base Linux kernel. We also demonstrate that Fastsocket can achieve scalability and preserve BSD socket API at the same time. Fastsocket is already deployed in the production environment of Sina WeiBo, serving 50 million daily active users and billions of requests per day.

Igarashi, Takeo, Shono, Naoyuki, Kin, Taichi, Saito, Toki.  2016.  Interactive Volume Segmentation with Threshold Field Painting. Proceedings of the 29th Annual Symposium on User Interface Software and Technology. :403–413.

An interactive method for segmentation and isosurface extraction of medical volume data is proposed. In conventional methods, users decompose a volume into multiple regions iteratively, segment each region using a threshold, and then manually clean the segmentation result by removing clutter in each region. However, this is tedious and requires many mouse operations from different camera views. We propose an alternative approach whereby the user simply applies painting operations to the volume using tools commonly seen in painting systems, such as flood fill and brushes. This significantly reduces the number of mouse and camera control operations. Our technical contribution is in the introduction of the threshold field, which assigns spatially-varying threshold values to individual voxels. This generalizes discrete decomposition of a volume into regions and segmentation using a constant threshold in each region, thereby offering a much more flexible and efficient workflow. This paper describes the details of the user interaction and its implementation. Furthermore, the results of a user study are discussed. The results indicate that the proposed method can be a few times faster than a conventional method.

Yashiro, Hisashi, Terai, Masaaki, Yoshida, Ryuji, Iga, Shin-ichi, Minami, Kazuo, Tomita, Hirofumi.  2016.  Performance Analysis and Optimization of Nonhydrostatic ICosahedral Atmospheric Model (NICAM) on the K Computer and TSUBAME2.5. Proceedings of the Platform for Advanced Scientific Computing Conference. :3:1–3:8.

We summarize the optimization and performance evaluation of the Nonhydrostatic ICosahedral Atmospheric Model (NICAM) on two different types of supercomputers: the K computer and TSUBAME2.5. First, we evaluated and improved several kernels extracted from the model code on the K computer. We did not significantly change the loop and data ordering for sufficient usage of the features of the K computer, such as the hardware-aided thread barrier mechanism and the relatively high bandwidth of the memory, i.e., a 0.5 Byte/FLOP ratio. Loop optimizations and code cleaning for a reduction in memory transfer contributed to a speed-up of the model execution time. The sustained performance ratio of the main loop of the NICAM reached 0.87 PFLOPS with 81,920 nodes on the K computer. For GPU-based calculations, we applied OpenACC to the dynamical core of NICAM. The performance and scalability were evaluated using the TSUBAME2.5 supercomputer. We achieved good performance results, which showed efficient use of the memory throughput performance of the GPU as well as good weak scalability. A dry dynamical core experiment was carried out using 2560 GPUs, which achieved 60 TFLOPS of sustained performance.

Wang, Xi, Sun, Zhenfeng, Zhang, Wenqiang, Zhou, Yu, Jiang, Yu-Gang.  2016.  Matching User Photos to Online Products with Robust Deep Features. Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. :7–14.

This paper focuses on a practically very important problem of matching a real-world product photo to exactly the same item(s) in online shopping sites. The task is extremely challenging because the user photos (i.e., the queries in this scenario) are often captured in uncontrolled environments, while the product images in online shops are mostly taken by professionals with clean backgrounds and perfect lighting conditions. To tackle the problem, we study deep network architectures and training schemes, with the goal of learning a robust deep feature representation that is able to bridge the domain gap between the user photos and the online product images. Our contributions are two-fold. First, we propose an alternative of the popular contrastive loss used in siamese deep networks, namely robust contrastive loss, where we "relax" the penalty on positive pairs to alleviate over-fitting. Second, a multi-task fine-tuning approach is introduced to learn a better feature representation, which not only incorporates knowledge from the provided training photo pairs, but also explores additional information from the large ImageNet dataset to regularize the fine-tuning procedure. Experiments on two challenging real-world datasets demonstrate that both the robust contrastive loss and the multi-task fine-tuning approach are effective, leading to very promising results with a time cost suitable for real-time retrieval.

Chen, Yu-Ting, Cong, Jason, Fang, Zhenman, Zhou, Peipei.  2016.  ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architecture (Abstact Only). Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. :281–281.

Compared to conventional general-purpose processors, accelerator-rich architectures (ARAs) can provide orders-of-magnitude performance and energy gains. In this paper we design and implement the ARAPrototyper to enable rapid design space explorations for ARAs in real silicons and reduce the tedious prototyping efforts. First, ARAPrototyper provides a reusable baseline prototype with a highly customizable memory system, including interconnect between accelerators and buffers, interconnect between buffers and last-level cache (LLC) or DRAM, coherency choice at LLC or DRAM, and address translation support. To provide more insights into performance analysis, ARAPrototyper adds several performance counters on the accelerator side and leverages existing performance counters on the CPU side. Second, ARAPrototyper provides a clean interface to quickly integrate a user?s own accelerators written in high-level synthesis (HLS) code. Then, an ARA prototype can be automatically generated and mapped to a Xilinx Zynq SoC. To quickly develop applications that run seamlessly on the ARA prototype, ARAPrototyper provides a system software stack and abstracts the accelerators as software libraries for application developers. Our results demonstrate that ARAPrototyper enables a wide range of design space explorations for ARAs at manageable prototyping efforts and 4,000 to 10,000X faster evaluation time than full-system simulations. We believe that ARAPrototyper can be an attractive alternative for ARA design and evaluation.

Ahmed, Sadia.  2016.  Time and Frequency Domain Analysis and Measurement Results of Varying Acoustic Signal to Determine Water Pollutants in the Rio Grande River. Proceedings of the 11th ACM International Conference on Underwater Networks & Systems. :30:1–30:2.

Water occupies three forth of earth's surface. Water is directly and indirectly polluted in many ways. Therefore, it is of vital importance to monitor water pollution levels effectively and regularly. It is a well known fact that changes in the water medium and its parameters directly affect the propagation of acoustic signal through it. As a result, time and frequency domain analysis of an acoustic signal propagating through water can be a valued indicator of water pollution. Preliminary investigative results to determine water contaminants using acoustic signal in an indoor laboratory tank environment was presented in [1]. This paper presents an extended abstract of the continuing research involving a time and frequency domain analysis of acoustic signal in the presence of three water pollutants, namely fertilizer, household detergent, and pesticide. A measurement will be conducted in the Rio Grande River, Espanola, NM, at three different locations by transmitting a single pulse through the water at different depths and distances. The same measurement will be conducted in a tank with clean water and in a tank with three pollutants added separately. The three sets of received signal from the three measurements will be compared to each other. The sets of received signal from the measurement results will be compared to the simulated result of the time and frequency domain response of the acoustic signal for validation. To the best knowledge of the author(s) utilizing acoustic signal and its properties to determine water pollutants using the proposed method is a new approach.

Kannao, Raghvendra, Guha, Prithwijit.  2016.  Generic TV Advertisement Detection Using Progressively Balanced Perceptron Trees. Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing. :8:1–8:8.

Automatic detection of TV advertisements is of paramount importance for various media monitoring agencies. Existing works in this domain have mostly focused on news channels using news specific features. Most commercial products use near copy detection algorithms instead of generic advertisement classification. A generic detector needs to handle inter-class and intra-class imbalances present in data due to variability in content aired across channels and frequent repetition of advertisements. Imbalances present in data make classifiers biased towards one of the classes and thus require special treatment. We propose to use tree of perceptrons to solve this problem. The training data available for each perceptron node is balanced using cluster based over-sampling and TOMEK link cleaning as we traverse the tree downwards. The trained perceptron node then passes the original unbalanced data to its children. This process is repeated recursively till we reach the leaf nodes. We call this new algorithm as "Progressively Balanced Perceptron Tree". We have also contributed a TV advertisements dataset consisting of 250 hours of videos recorded from five non-news TV channels of different genres. Experimentations on this dataset have shown that the proposed approach has comparatively superior and balanced performance with respect to six baseline methods. Our proposal generalizes well across channels, with varying training data sizes and achieved a top F1-score of 97% in detecting advertisements.

Espinosa, Floren Alexis T., Guerrero III, Guillermo Gohan E., Vea, Larry A..  2016.  Modeling Free-form Handwriting Gesture User Authentication for Android Smartphones. Proceedings of the International Conference on Mobile Software Engineering and Systems. :3–6.

Smartphones nowadays are customized to help users with their daily tasks such as storing important data or making transactions through the internet. With the sensitivity of the data involved, authentication mechanism such as fixed-text password, PIN, or unlock patterns are used to safeguard these data against intruders. However, these mechanisms have the risk from security threats such as cracking or shoulder surfing. To enhance mobile and/or information security, this study aimed to develop a free-form handwriting gesture user authentication for smartphones. It also tried to discover the static and dynamic handwriting features that significantly influence the recognition of a legitimate user. The experiment was then conducted by asking thirty (30) individuals to draw or swipe using their fingertip their desired free-form security pattern ten (10) times. These patterns were then cleaned and processed, and extracted seven (7) static and eleven (11) dynamic handwriting features. By means of Neural Network classifier of the RapidMiner data mining tool, these features were used to develop, validate, and test a model for user authentication. The model showed a very promising recognition rate of 96.67%. The model is further tested through a prototype, and it still gave a very satisfactory result.

Fletcher, Kathryn.  2016.  Developing Best Practices for Qualtrics Administration. Proceedings of the 2016 ACM on SIGUCCS Annual Conference. :89–94.

In 2013 West Virginia University consolidated a few individually purchased college and individual licenses for Qualtrics survey software into a single campus-wide license that includes all of our colleges and regional campuses, to be implemented as a campus standard and enterprise solution for our campus. Due to some staff reorganizations over the past two years, I and the other Qualtrics brand administrators at WVU are all new to this administrative role. In this paper, I plan to share lessons that I learned while (1) participating in developing and documenting new business processes, (2) transitioning to serve as the main brand administrator, (3) cleaning up user accounts that had not been actively managed for years, and (4) working with the Qualtrics vendor, local group administrators, my IT colleagues, and campus users as we refine a set of best practices for product usage and administration. Although this paper discusses a campus-wide implementation of Qualtrics survey software, I feel that the lessons I learned during this process could be extrapolated to the development of best practices for other products or IT services.

West, Ruth, Kajihara, Meghan, Parola, Max, Hays, Kathryn, Hillard, Luke, Carlew, Anne, Deutsch, Jeremey, Lane, Brandon, Holloway, Michelle, John, Brendan et al..  2016.  Eliciting Tacit Expertise in 3D Volume Segmentation. Proceedings of the 9th International Symposium on Visual Information Communication and Interaction. :59–66.

The output of 3D volume segmentation is crucial to a wide range of endeavors. Producing accurate segmentations often proves to be both inefficient and challenging, in part due to lack of imaging data quality (contrast and resolution), and because of ambiguity in the data that can only be resolved with higher-level knowledge of the structure and the context wherein it resides. Automatic and semi-automatic approaches are improving, but in many cases still fail or require substantial manual clean-up or intervention. Expert manual segmentation and review is therefore still the gold standard for many applications. Unfortunately, existing tools (both custom-made and commercial) are often designed based on the underlying algorithm, not the best method for expressing higher-level intention. Our goal is to analyze manual (or semi-automatic) segmentation to gain a better understanding of both low-level (perceptual tasks and actions) and high-level decision making. This can be used to produce segmentation tools that are more accurate, efficient, and easier to use. Questioning or observation alone is insufficient to capture this information, so we utilize a hybrid capture protocol that blends observation, surveys, and eye tracking. We then developed, and validated, data coding schemes capable of discerning low-level actions and overall task structures.

Inoue, Jun, Kiselyov, Oleg, Kameyama, Yukiyoshi.  2016.  Staging Beyond Terms: Prospects and Challenges. Proceedings of the 2016 ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation. :103–108.

Staging is a program generation paradigm with a clean, well-investigated semantics which statically ensures that the generated code is always well-typed and well-scoped. Staging is often used for specializing programs to the known properties or parts of data to improve efficiency, but so far it has been limited to generating terms. This short paper describes our ongoing work on extending staging, with its strong safety guarantees, to generation of non-terms, focusing on ML-style modules. The purpose is to map out the promises and challenges, then to pose a question to solicit the community's expertise in evaluating how essential our extensions are for the purpose of applying staging beyond the realm of terms. We demonstrate our extensions' use in specializing functor applications to eliminate its (currently large) overhead in OCaml. We explain the challenges that those extensions bring in and identify a promising line of attack. Unexpectedly, however, it turns out that we can avoid module generation altogether by representing modules, possibly containing abstract types, as polymorphic records. With the help of first-class modules, module specialization reduces to ordinary term specialization, which can be done with conventional staging. The extent to which this hack generalizes is unclear. Thus we have a question to the community: is there a compelling use case for module generation? With these insights and questions, we offer a starting point for a long-term program in the next stage of staging research.

Talbot, Jeremie, Piretti, Mark, Singleton, Kevin, Hessler, Mark.  2016.  Designing an Interaction with an Octopus. ACM SIGGRAPH 2016 Talks. :43:1–43:2.

In Pixar's Finding Dory, we are introduced to a new character: Hank the Octopus. This is a very different character than Pixar has been asked to animate before. Our directors demanded both precise control and graceful, clean silhouettes. The reference artwork we were given showed complex curves between arms and body without any disjointed shapes or breaks in form. Video of Octopus in motion reveals an infinitely malleable creature capable of an enormous shape language. This art direction required a small group of TDs to create a control scheme that was sensible, flexible and with a new level of control in order for animators to bring Hank to life. We had to think deeply from the tips of the fingers all the way through how the tentacles connect to the mouth corners, and eye sockets. Each of this issues raised concerns around design, deformation and finally how the end user can manipulate such complexity effectively.

Imajo, Tomoaki, Sumiya, Kazutoshi, Ushiama, Taketoshi.  2016.  An SNS Based on Implicit Beneficial Social Relations in A Regional Community. Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication. :47:1–47:7.

In this paper, we propose a novel Social Networking Service (SNS) for a regional community. The purpose of the SNS is to support and encourage people by making them aware beneficial social relations in the real world. The conventional SNSs can hardly deal with beneficial social relations, because they are implicit and dynamic. The proposed SNS is designed to provide positive information for two types of people: people who does community voluntary works, such as cleaning, as contributors, and people who receives benefit from them as beneficiary. This paper introduces the basic scheme based on the SNS for beneficial social relations, and evaluates the effectiveness of our scheme based on the result of the experimental studies. The experimental result shows the users of our SNS tend to consider the information about the voluntary works valuable if they have been performed in their living area, and it suggests that our proposed SNS system would work well in a regional community.

Kim, Kunho, Giles, C. Lee.  2016.  Financial Entity Record Linkage with Random Forests. Proceedings of the Second International Workshop on Data Science for Macro-Modeling. :13:1–13:2.

Record linkage refers to the task of finding same entity across different databases. We propose a machine learning based record linkage algorithm for financial entity databases. Record linkage on financial databases are essential for information integration on certain financial entity, since those databases do not have common unified identifier. Our algorithm works in two steps to determine if a pair of record is same entity or not. First we check with proposed rules if the record pair can be exactly matched after cleaning the entity name and address. Second, inspired by earlier work on author name disambiguation, we train a binary Random Forest classifier to decide the linkage. To reduce and scale the computation, this process is done only for candidate pairs within a proposed heuristic. Initial evaluation for precision, recall and F1 measures on two different linking tasks in the Financial Entity Identification and Information Integration (FEIII) Challenge show promising results.

Erete, Sheena, Ryou, Emily, Smith, Geoff, Fassett, Khristina Marie, Duda, Sarah.  2016.  Storytelling with Data: Examining the Use of Data by Non-Profit Organizations. Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing. :1273–1283.

Despite the growing promotion of the “open data” movement, the collection, cleaning, management, interpretation, and dissemination of open data is laborious and cost intensive, particularly for non-profits with limited resources. In this paper, we describe how non-profit organizations (NPOs) use open data, building on prior literature that focuses on understanding challenges that NPOs face. Based on 15 interviews of staff from 10 NPOs, our results suggest that NPOs use data to develop narratives to build a case for support from grantors and other stakeholders. We then present empirical results based on the usage of a data portal we created, which suggests that technologies should be designed to not only make data accessible, but also to facilitate communication and support relationships between expert data analysts and NPOs.

Chung, Yeounoh, Mortensen, Michael Lind, Binnig, Carsten, Kraska, Tim.  2016.  Estimating the Impact of Unknown Unknowns on Aggregate Query Results. Proceedings of the 2016 International Conference on Management of Data. :861–876.

It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results? In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.

Alfano, Marco, Lenzitti, Biagio, Lo Bosco, Giosuè, Taibi, Davide.  2016.  A Framework for Opening Data and Creating Advanced Services in the Health and Social Fields. Proceedings of the 17th International Conference on Computer Systems and Technologies 2016. :57–64.

Open data is publicly available data that can be universally and readily accessed, used, and redistributed. Open data holds particular potential in the health and social sectors but, presently, health and social data are often published in a 'closed' format. There are different tools that allow to 'open' data, clean, structure and process them in order to elaborate them and build advanced services but, unfortunately, there is no single tool that can be used to perform all different tasks. We believe that the availability of Open Data in the health and social fields should be greatly increased and a way for creating new health and social services should be provided. In this paper, we present a framework that allows to create health and social Open Data starting from whatever is available on the web and to easily build advanced services based on those data.