Visible to the public Biblio

Found 112 results

Filters: Keyword is pubcrawl170201  [Clear All Filters]
2017-03-07
Agnihotri, Lalitha, Mojarad, Shirin, Lewkow, Nicholas, Essa, Alfred.  2016.  Educational Data Mining with Python and Apache Spark: A Hands-on Tutorial. Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. :507–508.

Enormous amount of educational data has been accumulated through Massive Open Online Courses (MOOCs), as well as commercial and non-commercial learning platforms. This is in addition to the educational data released by US government since 2012 to facilitate disruption in education by making data freely available. The high volume, variety and velocity of collected data necessitate use of big data tools and storage systems such as distributed databases for storage and Apache Spark for analysis. This tutorial will introduce researchers and faculty to real-world applications involving data mining and predictive analytics in learning sciences. In addition, the tutorial will introduce statistics required to validate and accurately report results. Topics will cover how big data is being used to transform education. Specifically, we will demonstrate how exploratory data analysis, data mining, predictive analytics, machine learning, and visualization techniques are being applied to educational big data to improve learning and scale insights driven from millions of student's records. The tutorial will be held over a half day and will be hands on with pre-posted material. Due to the interdisciplinary nature of work, the tutorial appeals to researchers from a wide range of backgrounds including big data, predictive analytics, learning sciences, educational data mining, and in general, those interested in how big data analytics can transform learning. As a prerequisite, attendees are required to have familiarity with at least one programming language.

Madaio, Michael, Chen, Shang-Tse, Haimson, Oliver L., Zhang, Wenwen, Cheng, Xiang, Hinds-Aldrich, Matthew, Chau, Duen Horng, Dilkina, Bistra.  2016.  Firebird: Predicting Fire Risk and Prioritizing Fire Inspections in Atlanta. Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. :185–194.

The Atlanta Fire Rescue Department (AFRD), like many municipal fire departments, actively works to reduce fire risk by inspecting commercial properties for potential hazards and fire code violations. However, AFRD's fire inspection practices relied on tradition and intuition, with no existing data-driven process for prioritizing fire inspections or identifying new properties requiring inspection. In collaboration with AFRD, we developed the Firebird framework to help municipal fire departments identify and prioritize commercial property fire inspections, using machine learning, geocoding, and information visualization. Firebird computes fire risk scores for over 5,000 buildings in the city, with true positive rates of up to 71% in predicting fires. It has identified 6,096 new potential commercial properties to inspect, based on AFRD's criteria for inspection. Furthermore, through an interactive map, Firebird integrates and visualizes fire incidents, property information and risk scores to help AFRD make informed decisions about fire inspections. Firebird has already begun to make positive impact at both local and national levels. It is improving AFRD's inspection processes and Atlanta residents' safety, and was highlighted by National Fire Protection Association (NFPA) as a best practice for using data to inform fire inspections.

DiScala, Michael, Abadi, Daniel J..  2016.  Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data. Proceedings of the 2016 International Conference on Management of Data. :295–310.

Self-describing key-value data formats such as JSON are becoming increasingly popular as application developers choose to avoid the rigidity imposed by the relational model. Database systems designed for these self-describing formats, such as MongoDB, encourage users to use denormalized, heavily nested data models so that relationships across records and other schema information need not be predefined or standardized. Such data models contribute to long-term development complexity, as their lack of explicit entity and relationship tracking burdens new developers unfamiliar with the dataset. Furthermore, the large amount of data repetition present in such data layouts can introduce update anomalies and poor scan performance, which reduce both the quality and performance of analytics over the data. In this paper we present an algorithm that automatically transforms the denormalized, nested data commonly found in NoSQL systems into traditional relational data that can be stored in a standard RDBMS. This process includes a schema generation algorithm that discovers relationships across the attributes of the denormalized datasets in order to organize those attributes into relational tables. It further includes a matching algorithm that discovers sets of attributes that represent overlapping entities and merges those sets together. These algorithms reduce data repetition, allow the use of data analysis tools targeted at relational data, accelerate scan-intensive algorithms over the data, and help users gain a semantic understanding of complex, nested datasets.

Zhang, Ce, Shin, Jaeho, Ré, Christopher, Cafarella, Michael, Niu, Feng.  2016.  Extracting Databases from Dark Data with DeepDive. Proceedings of the 2016 International Conference on Management of Data. :847–859.

DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data — scientific papers, Web classified ads, customer service notes, and so on — were instead in a relational database, it would give analysts access to a massive and highly-valuable new set of "big data" to exploit. DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.

Lau, Billy Pik Lik, Chaturvedi, Tanmay, Ng, Benny Kai Kiat, Li, Kai, Hasala, Marakkalage S., Yuen, Chau.  2016.  Spatial and Temporal Analysis of Urban Space Utilization with Renewable Wireless Sensor Network. Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. :133–142.

Space utilization are important elements for a smart city to determine how well public space are being utilized. Such information could also provide valuable feedback to the urban developer on what are the factors that impact space utilization. The spatial and temporal information for space utilization can be studied and further analyzed to generate insights about that particular space. In our research context, these elements are translated to part of big data and Internet of things (IoT) to eliminate the need of on site investigation. However, there are a number of challenges for large scale deployment, eg. hardware cost, computation capability, communication bandwidth, scalability, data fragmentation, and resident privacy etc. In this paper, we designed and prototype a Renewable Wireless Sensor Network (RWSN), which addressed the aforementioned challenges. Finally, analyzed results based on initial data collected is presented.

Pohjalainen, Jouni, Fabien Ringeval, Fabien, Zhang, Zixing, Schuller, Björn.  2016.  Spectral and Cepstral Audio Noise Reduction Techniques in Speech Emotion Recognition. Proceedings of the 2016 ACM on Multimedia Conference. :670–674.

Signal noise reduction can improve the performance of machine learning systems dealing with time signals such as audio. Real-life applicability of these recognition technologies requires the system to uphold its performance level in variable, challenging conditions such as noisy environments. In this contribution, we investigate audio signal denoising methods in cepstral and log-spectral domains and compare them with common implementations of standard techniques. The different approaches are first compared generally using averaged acoustic distance metrics. They are then applied to automatic recognition of spontaneous and natural emotions under simulated smartphone-recorded noisy conditions. Emotion recognition is implemented as support vector regression for continuous-valued prediction of arousal and valence on a realistic multimodal database. In the experiments, the proposed methods are found to generally outperform standard noise reduction algorithms.

Zarras, Apostolis, Kohls, Katharina, Dürmuth, Markus, Pöpper, Christina.  2016.  Neuralyzer: Flexible Expiration Times for the Revocation of Online Data. Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy. :14–25.

Once data is released to the Internet, there is little hope to successfully delete it, as it may have been duplicated, reposted, and archived in multiple places. This poses a significant threat to users' privacy and their right to permanently erase their very own data. One approach to control the implications on privacy is to assign a lifetime value to the published data and ensure that the data is no longer accessible after this point in time. However, such an approach suffers from the inability to successfully predict the right time when the data should vanish. Consequently, the author of the data can only estimate the correct time, which unfortunately can cause the premature or belated deletion of data. This paper tackles the problem of prefixed lifetimes in data deletion from a different angle and argues that alternative approaches are a desideratum for research. In our approach, we consider different criteria when data should be deleted, such as keeping data available as long as there is sufficient interest for it or untimely delete it in cases of excessive accesses. To assist the self-destruction of data, we propose a protocol and develop a prototype, called Neuralyzer, which leverages the caching mechanisms of the Domain Name System (DNS) to ensure the successful deletion of data. Our experimental results demonstrate that our approach can completely delete published data while at the same time achieving flexible expiration times varying from few days to several months depending on the users' interest.

Xia, Xiaoxu, Song, Wei, Chen, Fangfei, Li, Xuansong, Zhang, Pengcheng.  2016.  Effa: A proM Plugin for Recovering Event Logs. Proceedings of the 8th Asia-Pacific Symposium on Internetware. :108–111.

While event logs generated by business processes play an increasingly significant role in business analysis, the quality of data remains a serious problem. Automatic recovery of dirty event logs is desirable and thus receives more attention. However, existing methods only focus on missing event recovery, or fall short of efficiency. To this end, we present Effa, a ProM plugin, to automatically recover event logs in the light of process specifications. Based on advanced heuristics including process decomposition and trace replaying to search the minimum recovery, Effa achieves a balance between repairing accuracy and efficiency.

Baba, Asif Iqbal, Jaeger, Manfred, Lu, Hua, Pedersen, Torben Bach, Ku, Wei-Shinn, Xie, Xike.  2016.  Learning-Based Cleansing for Indoor RFID Data. Proceedings of the 2016 International Conference on Management of Data. :925–936.

RFID is widely used for object tracking in indoor environments, e.g., airport baggage tracking. Analyzing RFID data offers insight into the underlying tracking systems as well as the associated business processes. However, the inherent uncertainty in RFID data, including noise (cross readings) and incompleteness (missing readings), pose challenges to high-level RFID data querying and analysis. In this paper, we address these challenges by proposing a learning-based data cleansing approach that, unlike existing approaches, requires no detailed prior knowledge about the spatio-temporal properties of the indoor space and the RFID reader deployment. Requiring only minimal information about RFID deployment, the approach learns relevant knowledge from raw RFID data and uses it to cleanse the data. In particular, we model raw RFID readings as time series that are sparse because the indoor space is only partly covered by a limited number of RFID readers. We propose the Indoor RFID Multi-variate Hidden Markov Model (IR-MHMM) to capture the uncertainties of indoor RFID data as well as the correlation of moving object locations and object RFID readings. We propose three state space design methods for IR-MHMM that enable the learning of parameters while contending with raw RFID data time series. We solely use raw uncleansed RFID data for the learning of model parameters, requiring no special labeled data or ground truth. The resulting IR-MHMM based RFID data cleansing approach is able to recover missing readings and reduce cross readings with high effectiveness and efficiency, as demonstrated by extensive experimental studies with both synthetic and real data. Given enough indoor RFID data for learning, the proposed approach achieves a data cleansing accuracy comparable to or even better than state-of-the-art techniques requiring very detailed prior knowledge, making our solution superior in terms of both effectiveness and employability.

Ren, Xiang, El-Kishky, Ahmed, Ji, Heng, Han, Jiawei.  2016.  Automatic Entity Recognition and Typing in Massive Text Data. Proceedings of the 2016 International Conference on Management of Data. :2235–2239.

In today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To extract value from these large, multi-domain pools of text, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their fine-grained types (e.g., people, product and food) in a scalable way. Since these methods do not rely on annotated data, predefined typing schema or hand-crafted features, they can be quickly adapted to a new domain, genre and language. We demonstrate on real datasets including various genres (e.g., news articles, discussion forum posts, and tweets), domains (general vs. bio-medical domains) and languages (e.g., English, Chinese, Arabic, and even low-resource languages like Hausa and Yoruba) how these typed entities aid in knowledge discovery and management.

Ceolin, Davide, Noordegraaf, Julia, Aroyo, Lora, van Son, Chantal.  2016.  Towards Web Documents Quality Assessment for Digital Humanities Scholars. Proceedings of the 8th ACM Conference on Web Science. :315–317.

We present a framework for assessing the quality of Web documents, and a baseline of three quality dimensions: trustworthiness, objectivity and basic scholarly quality. Assessing Web document quality is a "deep data" problem necessitating approaches to handle both data size and complexity.

Schubotz, Moritz, Grigorev, Alexey, Leich, Marcus, Cohl, Howard S., Meuschke, Norman, Gipp, Bela, Youssef, Abdou S., Markl, Volker.  2016.  Semantification of Identifiers in Mathematics for Better Math Information Retrieval. Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. :135–144.

Mathematical formulae are essential in science, but face challenges of ambiguity, due to the use of a small number of identifiers to represent an immense number of concepts. Corresponding to word sense disambiguation in Natural Language Processing, we disambiguate mathematical identifiers. By regarding formulae and natural text as one monolithic information source, we are able to extract the semantics of identifiers in a process we term Mathematical Language Processing (MLP). As scientific communities tend to establish standard (identifier) notations, we use the document domain to infer the actual meaning of an identifier. Therefore, we adapt the software development concept of namespaces to mathematical notation. Thus, we learn namespace definitions by clustering the MLP results and mapping those clusters to subject classification schemata. In addition, this gives fundamental insights into the usage of mathematical notations in science, technology, engineering and mathematics. Our gold standard based evaluation shows that MLP extracts relevant identifier-definitions. Moreover, we discover that identifier namespaces improve the performance of automated identifier-definition extraction, and elevate it to a level that cannot be achieved within the document context alone.

Farid, Mina, Roatis, Alexandra, Ilyas, Ihab F., Hoffmann, Hella-Franziska, Chu, Xu.  2016.  CLAMS: Bringing Quality to Data Lakes. Proceedings of the 2016 International Conference on Management of Data. :2089–2092.

With the increasing incentive of enterprises to ingest as much data as they can in what is commonly referred to as "data lakes", and with the recent development of multiple technologies to support this "load-first" paradigm, the new environment presents serious data management challenges. Among them, the assessment of data quality and cleaning large volumes of heterogeneous data sources become essential tasks in unveiling the value of big data. The coveted use of unstructured and semi-structured data in large volumes makes current data cleaning tools (primarily designed for relational data) not directly adoptable. We present CLAMS, a system to discover and enforce expressive integrity constraints from large amounts of lake data with very limited schema information (e.g., represented as RDF triples). This demonstration shows how CLAMS is able to discover the constraints and the schemas they are defined on simultaneously. CLAMS also introduces a scale-out solution to efficiently detect errors in the raw data. CLAMS interacts with human experts to both validate the discovered constraints and to suggest data repairs. CLAMS has been deployed in a real large-scale enterprise data lake and was experimented with a real data set of 1.2 billion triples. It has been able to spot multiple obscure data inconsistencies and errors early in the data processing stack, providing huge value to the enterprise.

Liu, Yinan, Shen, Wei, Yuan, Xiaojie.  2016.  Deola: A System for Linking Author Entities in Web Document with DBLP. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :2449–2452.

In this paper, we present Deola, an Online system for Author Entity Linking with DBLP. Unlike most existing entity linking systems which focus on linking entities with Wikipedia and depend largely on the special features associated with Wikipedia (e.g., Wikipedia articles), Deola links author names appearing in the web document which belongs to the domain of computer science with their corresponding entities existing in the DBLP network. This task is helpful for the enrichment of the DBLP network and the understanding of the domain-specific document. This task is challenging due to name ambiguity and limited knowledge existing in DBLP. Given a fragment of domain-specific web document belonging to the domain of computer science, Deola can return the mapping entity in DBLP for each author name appearing in the input document.

Legaard, Lasse, Thomsen, Josephine Raun, Lorentzen, Christian Hannesbo, Techen, Jonas Peter.  2016.  Exploring SCI As Means of Interaction Through the Design Case of Vacuum Cleaning. Proceedings of the TEI '16: Tenth International Conference on Tangible, Embedded, and Embodied Interaction. :488–493.

This paper explores the opportunities for incorporating shape changing properties into everyday home appliances. Throughout a design research approach the vacuum cleaner is used as a design case with the overall aim of enhancing the user experience by transforming the appliance into a sensing object. Three fully functional prototypes were developed in order to illustrate how shape change can fit into the context of our homes. The shape changing functionalities are: 1) a digital power button that supports dynamic affordances, 2) an analog handle that mediates the amount of dust particles through haptic feedback and 3) a body that behaves in a lifelike manner dependent on the user treatment. We report the development and implementation of the functional prototypes as well as technical limitations and initial user reactions on the prototypes.

Ziegler, Andreas, Rothberg, Valentin, Lohmann, Daniel.  2016.  Analyzing the Impact of Feature Changes in Linux. Proceedings of the Tenth International Workshop on Variability Modelling of Software-intensive Systems. :25–32.

In a software project as large and as rapidly evolving as the Linux kernel, automated testing systems are an integral component to the development process. Extensive build and regression tests can catch potential problems in changes before they appear in a stable release. Current systems, however, do not systematically incorporate the configuration system Kconfig. In this work, we present an approach to identify relationships between configuration options. These relationships allow us to find source files which might be affected by a change to a configuration option and hence require retesting. Our findings show that the majority of configuration options only affects few files, while very few options influence almost all files in the code base. We further observe that developers sometimes value usability over clean dependency modelling, leading to counterintuitive outliers in our results.

Lappalainen, Tuomas, Virtanen, Lasse, Häkkilä, Jonna.  2016.  Experiences with Wellness Ring and Bracelet Form Factor. Proceedings of the 15th International Conference on Mobile and Ubiquitous Multimedia. :351–353.

This paper explores experiences with ring and bracelet activity tracker form factors. During the first week of a 2-week field study participants (n=6) wore non-functional mock-ups of ring and bracelet wellness trackers, and provided feedback on their experiences. During the second week, participants used a commercial wellness tracking ring, which collected physical exercise and sleep data and visualized it in a mobile application. Our salient findings based on 196 user diary entries suggest, that the ring form factor is considered beautiful, aesthetic and contributing to the wearer's image. However, the bracelet form factor is more practical for active lifestyle, and preferred in situations where the hands are performing tasks requiring gripping objects, such as sport activities, cleaning the car, cooking and washing dishes. Users strongly identified the ring form factor as jewellery that is intended to be seen, whereas bracelets were considered hidden and inconspicuous elements of the user's ensemble.

Wang, Ju, Zhang, Lichao, Wang, Xuan, Xiong, Jie, Chen, Xiaojiang, Fang, Dingyi.  2016.  A Novel CSI Pre-processing Scheme for Device-free Localization Indoors. Proceedings of the Eighth Wireless of the Students, by the Students, and for the Students Workshop. :6–8.

Device-free localization of people and objects indoors not equipped with radios is playing a critical role in many emerging applications. This paper presents a novel channel state information (CSI) pre-processing scheme that enables accurate device-free localization indoors. The basic idea is simple: CSI is sensitive to a target's location and by modelling the CSI measurements of multiple wireless links as a set of power fading based equations, the target location can be determined. However, due to rich multipaths in indoor environment, the received signal strength (RSS) or even the fine-grained CSI can not be easily modelled. We observe that even in a rich multipath environment, not all subcarriers are equally affected by multipath reflections. Our preprocessing scheme tries to identify the subcarriers not affected by multipath. Thus, CSIs on the "clean" subcarriers can be modelled and utilized for accurate localization. Extensive experiments demonstrate the effectiveness of the proposed pre-processing scheme.

Celik, Ahmet, Knaust, Alex, Milicevic, Aleksandar, Gligoric, Milos.  2016.  Build System with Lazy Retrieval for Java Projects. Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. :643–654.

In the modern-day development, projects use Continuous Integration Services (CISs) to execute the build for every change in the source code. To ensure that the project remains correct and deployable, a CIS performs a clean build each time. In a clean environment, a build system needs to retrieve the project's dependencies (e.g., guava.jar). The retrieval, however, can be costly due to dependency bloat: despite a project using only a few files from each library, the existing build systems still eagerly retrieve all the libraries at the beginning of the build. This paper presents a novel build system, Molly, which lazily retrieves parts of libraries (i.e., files) that are needed during the execution of a build target. For example, the compilation target needs only public interfaces of classes within the libraries and the test target needs only implementation of the classes that are being invoked by the tests. Additionally, Molly generates a transfer script that retrieves parts of libraries based on prior builds. Molly's design requires that we ignore the boundaries set by the library developers and look at the files within the libraries. We implemented Molly for Java and evaluated it on 17 popular open-source projects. We show that test targets (on average) depend on only 9.97% of files in libraries. A variant of Molly speeds up retrieval by 44.28%. Furthermore, the scripts generated by Molly retrieve dependencies, on average, 93.81% faster than the Maven build system.

Li, Jianshu, Zhao, Jian, Zhao, Fang, Liu, Hao, Li, Jing, Shen, Shengmei, Feng, Jiashi, Sim, Terence.  2016.  Robust Face Recognition with Deep Multi-View Representation Learning. Proceedings of the 2016 ACM on Multimedia Conference. :1068–1072.

This paper describes our proposed method targeting at the MSR Image Recognition Challenge MS-Celeb-1M. The challenge is to recognize one million celebrities from their face images captured in the real world. The challenge provides a large scale dataset crawled from the Web, which contains a large number of celebrities with many images for each subject. Given a new testing image, the challenge requires an identify for the image and the corresponding confidence score. To complete the challenge, we propose a two-stage approach consisting of data cleaning and multi-view deep representation learning. The data cleaning can effectively reduce the noise level of training data and thus improves the performance of deep learning based face recognition models. The multi-view representation learning enables the learned face representations to be more specific and discriminative. Thus the difficulties of recognizing faces out of a huge number of subjects are substantially relieved. Our proposed method achieves a coverage of 46.1% at 95% precision on the random set and a coverage of 33.0% at 95% precision on the hard set of this challenge.

Zhang, Zhenning, Zhao, Baokang, Feng, Zhenqian, Yu, Wanrong, Wu, Chunqing.  2016.  MSN: A Mobility-enhanced Satellite Network Architecture: Poster. Proceedings of the 22Nd Annual International Conference on Mobile Computing and Networking. :465–466.

The proposed MSN architecture is intended to directly address the challenge of mobility, which refers to the motion of users as well as the dynamics of the satellite constellation. A virtual access point layer consisting of fixed virtual satellite network attachment points is superimposed over the physical topology in order to hide the mobility of satellites from the mobile endpoints. Then the MSN enhances endpoint mobility by a clean separation of identity and logical network location through an identity-to-location resolution service, and taking full advantage of the user's geographical location information. Moreover, a SDN based implementation is presented to further illustrate the proposal.

Entem, E., Barthe, L., Cani, M.-P., van de Panne, M..  2016.  From Drawing to Animation-ready Vector Graphics. ACM SIGGRAPH 2016 Posters. :52:1–52:2.

We present an automatic method to build a layered vector graphics structure ready for animation from a clean-line vector drawing of an organic, smooth shape. Inspiring from 3D segmentation methods, we introduce a new metric computed on the medial axis of a region to identify and quantify the visual salience of a sub-region relative to the rest. This enables us to recursively separate each region into two closed sub-regions at the location of the most salient junction. The resulting structure, layered in depth, can be used to pose and animate the drawing using a regular 2D skeleton.

Francese, Rita, Gravino, Carmine, Risi, Michele, Tortora, Genoveffa, Scanniello, Giuseppe.  2016.  Estimate Method Calls in Android Apps. Proceedings of the International Conference on Mobile Software Engineering and Systems. :13–14.

In this paper, we focus on the definition of estimators to predict method calls in Android apps. Estimation models are based on information from requirements specification documents (e.g., number of actors, number of use cases, and number of classes in the conceptual model). We have used a dataset containing information on 23 Android apps. After performing data-cleaning, we applied linear regression to build estimation models on 21 data points. Results suggest that measures gathered from requirements specification documents can be considered good predictors to estimate the number of internal calls (i.e., methods invoking other methods present in the app) and external calls (i.e., invocations to API) as well as their sum.

Aal, Konstantin, Mouratidis, Marios, Weibert, Anne, Wulf, Volker.  2016.  Challenges of CI Initiatives in a Political Unstable Situation - Case Study of a Computer Club in a Refugee Camp. Proceedings of the 19th International Conference on Supporting Group Work. :409–412.

This poster describes the research around computer clubs in Palestinian refugee camps and the various lessons learned during the establishment of this intervention such the importance of the physical infrastructure (e.g. clean room, working hardware), soft technologies (e.g. knowledge transfer through workshops), social infrastructure (e.g. reliable partners in the refugee camp, partner from the university) and social capital (e.g. shared vision and values of all stakeholders). These important insights can be transferred on other interventions in similar unstable environments.

Heindorf, Stefan, Potthast, Martin, Stein, Benno, Engels, Gregor.  2016.  Vandalism Detection in Wikidata. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :327–336.

Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation. Its knowledge is increasingly used within Wikipedia itself and various other kinds of information systems, imposing high demands on its integrity. Wikidata can be edited by anyone and, unfortunately, it frequently gets vandalized, exposing all information systems using it to the risk of spreading vandalized and falsified information. In this paper, we present a new machine learning-based approach to detect vandalism in Wikidata. We propose a set of 47 features that exploit both content and context information, and we report on 4 classifiers of increasing effectiveness tailored to this learning task. Our approach is evaluated on the recently published Wikidata Vandalism Corpus WDVC-2015 and it achieves an area under curve value of the receiver operating characteristic, ROC-AUC, of 0.991. It significantly outperforms the state of the art represented by the rule-based Wikidata Abuse Filter (0.865 ROC-AUC) and a prototypical vandalism detector recently introduced by Wikimedia within the Objective Revision Evaluation Service (0.859 ROC-AUC).