Biblio
Filters: Keyword is Lakes [Clear All Filters]
Dataset Discovery in Data Lakes. 2020 IEEE 36th International Conference on Data Engineering (ICDE). :709—720.
.
2020. Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target datasets can be constructed that enable value- adding analytics. Given the potential vastness of such data lakes, the issue arises of how to pull out of the lake those datasets that might contribute to wrangling out a given target. We refer to this as the problem of dataset discovery in data lakes and this paper contributes an effective and efficient solution to it. Our approach uses features of the values in a dataset to construct hash- based indexes that map those features into a uniform distance space. This makes it possible to define similarity distances between features and to take those distances as measurements of relatedness w.r.t. a target table. Given the latter (and exemplar tuples), our approach returns the most related tables in the lake. We provide a detailed description of the approach and report on empirical results for two forms of relatedness (unionability and joinability) comparing them with prior work, where pertinent, and showing significant improvements in all of precision, recall, target coverage, indexing and discovery times.
Enabling Cyber Security Data Sharing for Large-scale Enterprises Using Managed Security Services. 2018 IEEE Conference on Communications and Network Security (CNS). :1—7.
.
2018. Large enterprises and organizations from both private and public sectors typically outsource a platform solution, as part of the Managed Security Services (MSSs), from 3rd party providers (MSSPs) to monitor and analyze their data containing cyber security information. Sharing such data among these large entities is believed to improve their effectiveness and efficiency at tackling cybercrimes, via improved analytics and insights. However, MSS platform customers currently are not able or not willing to share data among themselves because of multiple reasons, including privacy and confidentiality concerns, even when they are using the same MSS platform. Therefore any proposed mechanism or technique to address such a challenge need to ensure that sharing is achieved in a secure and controlled way. In this paper, we propose a new architecture and use case driven designs to enable confidential, flexible and collaborative data sharing among such organizations using the same MSS platform. MSS platform is a complex environment where different stakeholders, including authorized MSSP personnel and customers' own users, have access to the same platform but with different types of rights and tasks. Hence we make every effort to improve the usability of the platform supporting sharing while keeping the existing rights and tasks intact. As an innovative and pioneering attempt to address the challenge of data sharing in the MSS platform, we hope to encourage further work to follow so that confidential and collaborative sharing eventually happens among MSS platform customers.
Efficiency Assessment of the Steganographic Coding Method with Indirect Integration of Critical Information. 2019 IEEE International Conference on Advanced Trends in Information Theory (ATIT). :36—40.
.
2019. The presented method of encoding and steganographic embedding of a series of bits for the hidden message was first developed by modifying the digital platform (bases) of the elements of the image container. Unlike other methods, steganographic coding and embedding is accomplished by changing the elements of the image fragment, followed by the formation of code structures for the established structure of the digital representation of the structural elements of the image media image. The method of estimating quantitative indicators of embedded critical data is presented. The number of bits of the container for the developed method of steganographic coding and embedding of critical information is estimated. The efficiency of the presented method is evaluated and the comparative analysis of the value of the embedded digital data in relation to the method of weight coefficients of the discrete cosine transformation matrix, as well as the comparative analysis of the developed method of steganographic coding, compared with the Koch and Zhao methods to determine the embedded data resistance against attacks of various types. It is determined that for different values of the quantization coefficient, the most critical are the built-in containers of critical information, which are built by changing the part of the digital video data platform depending on the size of the digital platform and the number of bits of the built-in container.
Big Provenance Stream Processing for Data Intensive Computations. 2018 IEEE 14th International Conference on e-Science (e-Science). :245–255.
.
2018. In the business and research landscape of today, data analysis consumes public and proprietary data from numerous sources, and utilizes any one or more of popular data-parallel frameworks such as Hadoop, Spark and Flink. In the Data Lake setting these frameworks co-exist. Our earlier work has shown that data provenance in Data Lakes can aid with both traceability and management. The sheer volume of fine-grained provenance generated in a multi-framework application motivates the need for on-the-fly provenance processing. We introduce a new parallel stream processing algorithm that reduces fine-grained provenance while preserving backward and forward provenance. The algorithm is resilient to provenance events arriving out-of-order. It is evaluated using several strategies for partitioning a provenance stream. The evaluation shows that the parallel algorithm performs well in processing out-of-order provenance streams, with good scalability and accuracy.