Visible to the public Biblio

Filters: Keyword is data wrangling  [Clear All Filters]
2021-08-05
Bogatu, Alex, Fernandes, Alvaro A. A., Paton, Norman W., Konstantinou, Nikolaos.  2020.  Dataset Discovery in Data Lakes. 2020 IEEE 36th International Conference on Data Engineering (ICDE). :709—720.
Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target datasets can be constructed that enable value- adding analytics. Given the potential vastness of such data lakes, the issue arises of how to pull out of the lake those datasets that might contribute to wrangling out a given target. We refer to this as the problem of dataset discovery in data lakes and this paper contributes an effective and efficient solution to it. Our approach uses features of the values in a dataset to construct hash- based indexes that map those features into a uniform distance space. This makes it possible to define similarity distances between features and to take those distances as measurements of relatedness w.r.t. a target table. Given the latter (and exemplar tuples), our approach returns the most related tables in the lake. We provide a detailed description of the approach and report on empirical results for two forms of relatedness (unionability and joinability) comparing them with prior work, where pertinent, and showing significant improvements in all of precision, recall, target coverage, indexing and discovery times.
2021-03-29
Yilmaz, I., Masum, R., Siraj, A..  2020.  Addressing Imbalanced Data Problem with Generative Adversarial Network For Intrusion Detection. 2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI). :25–30.

Machine learning techniques help to understand underlying patterns in datasets to develop defense mechanisms against cyber attacks. Multilayer Perceptron (MLP) technique is a machine learning technique used in detecting attack vs. benign data. However, it is difficult to construct any effective model when there are imbalances in the dataset that prevent proper classification of attack samples in data. In this research, we use UGR'16 dataset to conduct data wrangling initially. This technique helps to prepare a test set from the original dataset to train the neural network model effectively. We experimented with a series of inputs of varying sizes (i.e. 10000, 50000, 1 million) to observe the performance of the MLP neural network model with distribution of features over accuracy. Later, we use Generative Adversarial Network (GAN) model that produces samples of different attack labels (e.g. blacklist, anomaly spam, ssh scan) for balancing the dataset. These samples are generated based on data from the UGR'16 dataset. Further experiments with MLP neural network model shows that a balanced attack sample dataset, made possible with GAN, produces more accurate results than an imbalanced one.

2017-11-03
Baravalle, A., Lopez, M. S., Lee, S. W..  2016.  Mining the Dark Web: Drugs and Fake Ids. 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). :350–356.
In the last years, governmental bodies have been futilely trying to fight against dark web marketplaces. Shortly after the closing of "The Silk Road" by the FBI and Europol in 2013, new successors have been established. Through the combination of cryptocurrencies and nonstandard communication protocols and tools, agents can anonymously trade in a marketplace for illegal items without leaving any record. This paper presents a research carried out to gain insights on the products and services sold within one of the larger marketplaces for drugs, fake ids and weapons on the Internet, Agora. Our work sheds a light on the nature of the market, there is a clear preponderance of drugs, which accounts for nearly 80% of the total items on sale. The ready availability of counterfeit documents, while they make up for a much smaller percentage of the market, raises worries. Finally, the role of organized crime within Agora is discussed and presented.