Biblio
Machine learning (ML) models are often trained using private datasets that are very expensive to collect, or highly sensitive, using large amounts of computing power. The models are commonly exposed either through online APIs, or used in hardware devices deployed in the field or given to the end users. This provides an incentive for adversaries to steal these ML models as a proxy for gathering datasets. While API-based model exfiltration has been studied before, the theft and protection of machine learning models on hardware devices have not been explored as of now. In this work, we examine this important aspect of the design and deployment of ML models. We illustrate how an attacker may acquire either the model or the model architecture through memory probing, side-channels, or crafted input attacks, and propose (1) power-efficient obfuscation as an alternative to encryption, and (2) timing side-channel countermeasures.
The rapid growth in the size and scope of datasets in science and technology has created a need for novel foundational perspectives on data analysis that blend the inferential and computational sciences. That classical perspectives from these fields are not adequate to address emerging problems in "Big Data" is apparent from their sharply divergent nature at an elementary level-in computer science, the growth of the number of data points is a source of "complexity" that must be tamed via algorithms or hardware, whereas in statistics, the growth of the number of data points is a source of "simplicity" in that inferences are generally stronger and asymptotic results can be invoked. On a formal level, the gap is made evident by the lack of a role for computational concepts such as "runtime" in core statistical theory and the lack of a role for statistical concepts such as "risk" in core computational theory. I present several research vignettes aimed at bridging computation and statistics, including the problem of inference under privacy and communication constraints, and ways to exploit parallelism so as to trade off the speed and accuracy of inference.
Semantic Big Data is about the creation of new applications exploiting the richness and flexibility of declarative semantics combined with scalable and highly distributed data management systems. In this work, we present an application scenario in which a domain ontology, Open Refine and the Okkam Entity Name System enable a frictionless and scalable data integration process leading to a knowledge base for tax assessment. Further, we introduce the concept of Entiton as a flexible and efficient data model suitable for large scale data inference and analytic tasks. We successfully tested our data processing pipeline on a real world dataset, supporting ACI Informatica in the investigation for Vehicle Excise Duty (VED) evasion in Aosta Valley region (Italy). Besides useful business intelligence indicators, we implemented a distributed temporal inference engine to unveil VED evasion and circulation ban violations. The results of the integration are presented to the tax agents in a powerful Siren Solution KiBi dashboard, enabling seamless data exploration and business intelligence.