Visible to the public Biblio

Filters: Keyword is Data Science  [Clear All Filters]
2022-01-31
Zhao, Rui.  2021.  The Vulnerability of the Neural Networks Against Adversarial Examples in Deep Learning Algorithms. 2021 2nd International Conference on Computing and Data Science (CDS). :287–295.
With the further development in the fields of computer vision, network security, natural language processing and so on so forth, deep learning technology gradually exposed certain security risks. The existing deep learning algorithms cannot effectively describe the essential characteristics of data, making the algorithm unable to give the correct result in the face of malicious input. Based on current security threats faced by deep learning, this paper introduces the problem of adversarial examples in deep learning, sorts out the existing attack and defense methods of black box and white box, and classifies them. It briefly describes the application of some adversarial examples in different scenarios in recent years, compares several defense technologies of adversarial examples, and finally summarizes the problems in this research field and prospects its future development. This paper introduces the common white box attack methods in detail, and further compares the similarities and differences between the attack of black and white boxes. Correspondingly, the author also introduces the defense methods, and analyzes the performance of these methods against the black and white box attack.
2022-01-10
Zhang, Qixin.  2021.  An Overview and Analysis of Hybrid Encryption: The Combination of Symmetric Encryption and Asymmetric Encryption. 2021 2nd International Conference on Computing and Data Science (CDS). :616–622.
In the current scenario, various forms of information are spread everywhere, especially through the Internet. A lot of valuable information is contained in the dissemination, so security issues have always attracted attention. With the emergence of cryptographic algorithms, information security has been further improved. Generally, cryptography encryption is divided into symmetric encryption and asymmetric encryption. Although symmetric encryption has a very fast computation speed and is beneficial to encrypt a large amount of data, the security is not as high as asymmetric encryption. The same pair of keys used in symmetric algorithms leads to security threats. Thus, if the key can be protected, the security could be improved. Using an asymmetric algorithm to protect the key and encrypting the message with a symmetric algorithm would be a good choice. This paper will review security issues in the information transmission and the method of hybrid encryption algorithms that will be widely used in the future. Also, the various characteristics of algorithms in different systems and some typical cases of hybrid encryption will be reviewed and analyzed to showcase the reinforcement by combining algorithms. Hybrid encryption algorithms will improve the security of the transmission without causing more other problems. Additionally, the way how the encryption algorithms combine to strength the security will be discussed with the aid of an example.
2021-11-29
Nazemi, Kawa, Klepsch, Maike J., Burkhardt, Dirk, Kaupp, Lukas.  2020.  Comparison of Full-Text Articles and Abstracts for Visual Trend Analytics through Natural Language Processing. 2020 24th International Conference Information Visualisation (IV). :360–367.
Scientific publications are an essential resource for detecting emerging trends and innovations in a very early stage, by far earlier than patents may allow. Thereby Visual Analytics systems enable a deep analysis by applying commonly unsupervised machine learning methods and investigating a mass amount of data. A main question from the Visual Analytics viewpoint in this context is, do abstracts of scientific publications provide a similar analysis capability compared to their corresponding full-texts? This would allow to extract a mass amount of text documents in a much faster manner. We compare in this paper the topic extraction methods LSI and LDA by using full text articles and their corresponding abstracts to obtain which method and which data are better suited for a Visual Analytics system for Technology and Corporate Foresight. Based on a easy replicable natural language processing approach, we further investigate the impact of lemmatization for LDA and LSI. The comparison will be performed qualitative and quantitative to gather both, the human perception in visual systems and coherence values. Based on an application scenario a visual trend analytics system illustrates the outcomes.
2021-03-29
Roy, S., Dey, D., Saha, M., Chatterjee, K., Banerjee, S..  2020.  Implementation of Fuzzy Logic Control in Predictive Analysis and Real Time Monitoring of Optimum Crop Cultivation : Fuzzy Logic Control In Optimum Crop Cultivation. 2020 10th International Conference on Cloud Computing, Data Science Engineering (Confluence). :6—11.

In this article, the writers suggested a scheme for analyzing the optimum crop cultivation based on Fuzzy Logic Network (Implementation of Fuzzy Logic Control in Predictive Analysis and Real Time Monitoring of Optimum Crop Cultivation) knowledge. The Fuzzy system is Fuzzy Logic's set. By using the soil, temperature, sunshine, precipitation and altitude value, the scheme can calculate the output of a certain crop. By using this scheme, the writers hope farmers can boost f arm output. This, thus will have an enormous effect on alleviating economical deficiency, strengthening rate of employment, the improvement of human resources and food security.

2021-02-01
Jin, N..  2020.  CNN-Based Image Style Transfer and Its Applications. 2020 International Conference on Computing and Data Science (CDS). :387–390.
Convolutional neural network is a variant of deep neural network. It is widely used to extract image features and used in image classification, image generation and style transfer. In the style transfer task, we can extract the content from one picture through the convolutional neural network, extract the style from another picture, and generate a new picture by the combination. In this paper, we show the general steps of image style transfer based on convolutional neural networks through a specific example, and discuss the future possible applications.
2020-11-23
Li, W., Zhu, H., Zhou, X., Shimizu, S., Xin, M., Jin, Q..  2018.  A Novel Personalized Recommendation Algorithm Based on Trust Relevancy Degree. 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech). :418–422.
The rapid development of the Internet and ecommerce has brought a lot of convenience to people's life. Personalized recommendation technology provides users with services that they may be interested according to users' information such as personal characteristics and historical behaviors. The research of personalized recommendation has been a hot point of data mining and social networks. In this paper, we focus on resolving the problem of data sparsity based on users' rating data and social network information, introduce a set of new measures for social trust and propose a novel personalized recommendation algorithm based on matrix factorization combining trust relevancy. Our experiments were performed on the Dianping datasets. The results show that our algorithm outperforms traditional approaches in terms of accuracy and stability.
2020-09-04
Song, Chengru, Xu, Changqiao, Yang, Shujie, Zhou, Zan, Gong, Changhui.  2019.  A Black-Box Approach to Generate Adversarial Examples Against Deep Neural Networks for High Dimensional Input. 2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC). :473—479.
Generating adversarial samples is gathering much attention as an intuitive approach to evaluate the robustness of learning models. Extensive recent works have demonstrated that numerous advanced image classifiers are defenseless to adversarial perturbations in the white-box setting. However, the white-box setting assumes attackers to have prior knowledge of model parameters, which are generally inaccessible in real world cases. In this paper, we concentrate on the hard-label black-box setting where attackers can only pose queries to probe the model parameters responsible for classifying different images. Therefore, the issue is converted into minimizing non-continuous function. A black-box approach is proposed to address both massive queries and the non-continuous step function problem by applying a combination of a linear fine-grained search, Fibonacci search, and a zeroth order optimization algorithm. However, the input dimension of a image is so high that the estimation of gradient is noisy. Hence, we adopt a zeroth-order optimization method in high dimensions. The approach converts calculation of gradient into a linear regression model and extracts dimensions that are more significant. Experimental results illustrate that our approach can relatively reduce the amount of queries and effectively accelerate convergence of the optimization method.
2020-08-28
Huang, Angus F.M., Chi-Wei, Yang, Tai, Hsiao-Chi, Chuan, Yang, Huang, Jay J.C., Liao, Yu-Han.  2019.  Suspicious Network Event Recognition Using Modified Stacking Ensemble Machine Learning. 2019 IEEE International Conference on Big Data (Big Data). :5873—5880.
This study aims to detect genuine suspicious events and false alarms within a dataset of network traffic alerts. The rapid development of cloud computing and artificial intelligence-oriented automatic services have enabled a large amount of data and information to be transmitted among network nodes. However, the amount of cyber-threats, cyberattacks, and network intrusions have increased in various domains of network environments. Based on the fields of data science and machine learning, this paper proposes a series of solutions involving data preprocessing, exploratory data analysis, new features creation, features selection, ensemble learning, models construction, and verification to identify suspicious network events. This paper proposes a modified form of stacking ensemble machine learning which includes AdaBoost, Neural Networks, Random Forest, LightGBM, and Extremely Randomised Trees (Extra Trees) to realise a high-performance classification. A suspicious network event recognition dataset for a security operations centre, which uses real network log observations from the 2019 IEEE BigData Cup Challenge, is used as an experimental dataset. This paper investigates the possibility of integrating big-data analytics, machine learning, and data science to improve intelligent cybersecurity.
2020-03-30
Miao, Hui, Deshpande, Amol.  2019.  Understanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization. 2019 IEEE 35th International Conference on Data Engineering (ICDE). :1710–1713.
Increasingly modern data science platforms today have non-intrusive and extensible provenance ingestion mechanisms to collect rich provenance and context information, handle modifications to the same file using distinguishable versions, and use graph data models (e.g., property graphs) and query languages (e.g., Cypher) to represent and manipulate the stored provenance/context information. Due to the schema-later nature of the metadata, multiple versions of the same files, and unfamiliar artifacts introduced by team members, the resulting "provenance graphs" are quite verbose and evolving; further, it is very difficult for the users to compose queries and utilize this valuable information just using standard graph query model. In this paper, we propose two high-level graph query operators to address the verboseness and evolving nature of such provenance graphs. First, we introduce a graph segmentation operator, which queries the retrospective provenance between a set of source vertices and a set of destination vertices via flexible boundary criteria to help users get insight about the derivation relationships among those vertices. We show the semantics of such a query in terms of a context-free grammar, and develop efficient algorithms that run orders of magnitude faster than state-of-the-art. Second, we propose a graph summarization operator that combines similar segments together to query prospective provenance of the underlying project. The operator allows tuning the summary by ignoring vertex details and characterizing local structures, and ensures the provenance meaning using path constraints. We show the optimal summary problem is PSPACE-complete and develop effective approximation algorithms. We implement the operators on top of Neo4j, evaluate our query techniques extensively, and show the effectiveness and efficiency of the proposed methods.
2019-12-16
Fast, Ethan, Chen, Binbin, Mendelsohn, Julia, Bassen, Jonathan, Bernstein, Michael S..  2018.  Iris: A Conversational Agent for Complex Tasks. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. :473:1–473:12.
Today, most conversational agents are limited to simple tasks supported by standalone commands, such as getting directions or scheduling an appointment. To support more complex tasks, agents must be able to generalize from and combine the commands they already understand. This paper presents a new approach to designing conversational agents inspired by linguistic theory, where agents can execute complex requests interactively by combining commands through nested conversations. We demonstrate this approach in Iris, an agent that can perform open-ended data science tasks such as lexical analysis and predictive modeling. To power Iris, we have created a domain-specific language that transforms Python functions into combinable automata and regulates their combinations through a type system. Running a user study to examine the strengths and limitations of our approach, we find that data scientists completed a modeling task 2.6 times faster with Iris than with Jupyter Notebook.
2019-10-30
Belkin, Maxim, Haas, Roland, Arnold, Galen Wesley, Leong, Hon Wai, Huerta, Eliu A., Lesny, David, Neubauer, Mark.  2018.  Container Solutions for HPC Systems: A Case Study of Using Shifter on Blue Waters. Proceedings of the Practice and Experience on Advanced Research Computing. :43:1-43:8.

Software container solutions have revolutionized application development approaches by enabling lightweight platform abstractions within the so-called "containers." Several solutions are being actively developed in attempts to bring the benefits of containers to high-performance computing systems with their stringent security demands on the one hand and fundamental resource sharing requirements on the other. In this paper, we discuss the benefits and short-comings of such solutions when deployed on real HPC systems and applied to production scientific applications. We highlight use cases that are either enabled by or significantly benefit from such solutions. We discuss the efforts by HPC system administrators and support staff to support users of these type of workloads on HPC systems not initially designed with these workloads in mind focusing on NCSA's Blue Waters system.

2019-10-28
Huang, Jingwei.  2018.  From Big Data to Knowledge: Issues of Provenance, Trust, and Scientific Computing Integrity. 2018 IEEE International Conference on Big Data (Big Data). :2197–2205.
This paper addresses the nature of data and knowledge, the relation between them, the variety of views as a characteristic of Big Data regarding that data may come from many different sources/views from different viewpoints, and the associated essential issues of data provenance, knowledge provenance, scientific computing integrity, and trust in the data science process. Towards the direction of data-intensive science and engineering, it is of paramount importance to ensure Scientific Computing Integrity (SCI). A failure of SCI may be caused by malicious attacks, natural environmental changes, faults of scientists, operations mistakes, faults of supporting systems, faults of processes, and errors in the data or theories on which a research relies. The complexity of scientific workflows and large provenance graphs as well as various causes for SCI failures make ensuring SCI extremely difficult. Provenance and trust play critical role in evaluating SCI. This paper reports our progress in building a model for provenance-based trust reasoning about SCI.
2019-06-10
Farooq, H. M., Otaibi, N. M..  2018.  Optimal Machine Learning Algorithms for Cyber Threat Detection. 2018 UKSim-AMSS 20th International Conference on Computer Modelling and Simulation (UKSim). :32-37.

With the exponential hike in cyber threats, organizations are now striving for better data mining techniques in order to analyze security logs received from their IT infrastructures to ensure effective and automated cyber threat detection. Machine Learning (ML) based analytics for security machine data is the next emerging trend in cyber security, aimed at mining security data to uncover advanced targeted cyber threats actors and minimizing the operational overheads of maintaining static correlation rules. However, selection of optimal machine learning algorithm for security log analytics still remains an impeding factor against the success of data science in cyber security due to the risk of large number of false-positive detections, especially in the case of large-scale or global Security Operations Center (SOC) environments. This fact brings a dire need for an efficient machine learning based cyber threat detection model, capable of minimizing the false detection rates. In this paper, we are proposing optimal machine learning algorithms with their implementation framework based on analytical and empirical evaluations of gathered results, while using various prediction, classification and forecasting algorithms.

2019-03-22
Dooley, Rion, Brandt, Steven R., Fonner, John.  2018.  The Agave Platform: An Open, Science-as-a-Service Platform for Digital Science. Proceedings of the Practice and Experience on Advanced Research Computing. :28:1-28:8.

The Agave Platform first appeared in 2011 as a pilot project for the iPlant Collaborative [11]. In its first two years, Foundation saw over 40% growth per month, supporting 1000+ clients, 600+ applications, 4 HPC systems at 3 centers across the US. It also gained users outside of plant biology. To better serve the needs of the general open science community, we rewrote Foundation as a scalable, cloud native application and named it the Agave Platform. In this paper we present the Agave Platform, a Science-as-a-Service (ScaaS) platform for reproducible science. We provide a brief history and technical overview of the project, and highlight three case studies leveraging the platform to create synergistic value for their users.

2019-03-06
Leung, C. K., Hoi, C. S. H., Pazdor, A. G. M., Wodi, B. H., Cuzzocrea, A..  2018.  Privacy-Preserving Frequent Pattern Mining from Big Uncertain Data. 2018 IEEE International Conference on Big Data (Big Data). :5101-5110.
As we are living in the era of big data, high volumes of wide varieties of data which may be of different veracity (e.g., precise data, imprecise and uncertain data) are easily generated or collected at a high velocity in many real-life applications. Embedded in these big data is valuable knowledge and useful information, which can be discovered by big data science solutions. As a popular data science task, frequent pattern mining aims to discover implicit, previously unknown and potentially useful information and valuable knowledge in terms of sets of frequently co-occurring merchandise items and/or events. Many of the existing frequent pattern mining algorithms use a transaction-centric mining approach to find frequent patterns from precise data. However, there are situations in which an item-centric mining approach is more appropriate, and there are also situations in which data are imprecise and uncertain. Hence, in this paper, we present an item-centric algorithm for mining frequent patterns from big uncertain data. In recent years, big data have been gaining the attention from the research community as driven by relevant technological innovations (e.g., clouds) and novel paradigms (e.g., social networks). As big data are typically published online to support knowledge management and fruition processes, these big data are usually handled by multiple owners with possible secure multi-part computation issues. Thus, privacy and security of big data has become a fundamental problem in this research context. In this paper, we present, not only an item-centric algorithm for mining frequent patterns from big uncertain data, but also a privacy-preserving algorithm. In other words, we present- in this paper-a privacy-preserving item-centric algorithm for mining frequent patterns from big uncertain data. Results of our analytical and empirical evaluation show the effectiveness of our algorithm in mining frequent patterns from big uncertain data in a privacy-preserving manner.
2017-12-28
Tane, E., Fujigaki, Y..  2017.  Cross-Disciplinary Survey on \#34;Data Science \#34; Field Development: Historical Analysis from 1600s-2000s. 2017 Portland International Conference on Management of Engineering and Technology (PICMET). :1–10.

For the last several decades, the rapid development of information technology and computer performance accelerates generation, transportation and accumulation of digital data, it came to be called "Big Data". In this context, researchers and companies are eager to utilize the data to create new values or manage a wide range of issues, and much focus is being placed on "Data Science" to extract useful information (knowledge) from digital data. Data Science has been developed from several independent fields such as Mathematics/Operations Research, Computer Science, Data Engineering, Visualization and Statistics since 1800s. In addition, Artificial Intelligence converges on this stream recent years. On the other hand, the national projects have been established to utilize data for society with concerns surrounding the security and privacy. In this paper, through detailed analysis on history of this field, processes of development and integration among related fields are discussed as well as comparative aspects between Japan and the United States. This paper also includes a brief discussion of future directions.

Danesh, W., Rahman, M..  2017.  Linear regression based multi-state logic decomposition approach for efficient hardware implementation. 2017 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH). :153–154.

Multi-state logic presents a promising avenue for more-than-Moore scaling, since efficient implementation of multi-valued logic (MVL) can significantly reduce switching and interconnection requirements and result in significant benefits compared to binary CMOS. So far, traditional approaches lag behind binary CMOS due to: (a) reliance on logic decomposition approaches [4][5][6] that result in many multi-valued minterms [4], complex polynomials [5], and decision diagrams [6], which are difficult to implement, and (b) emulation of multi-valued computation and communication through binary switches and medium that require data conversion, and large circuits. In this paper, we propose a fundamentally different approach for MVL decomposition, merging concepts from data science and nanoelectronics to tackle the problems, (a) First, we do linear regression on all inputs and outputs of a multivalued function, and find an expression that fits most input and output combinations. For unmatched combinations, we do successive regressions to find linear expressions. Next, using our novel visual pattern matching technique, we find conditions based on input and output conditions to select each expression. These expressions along with associated selection criteria ensure that for all possible inputs of a specific function, correct output can be reached. Our selection of regression model to find linear expressions, coefficients and conditions allow efficient hardware implementation. We discuss an approach for solving problem (b) and show an example of quaternary sum circuit. Our estimates show 65.6% saving of switching components compared with a 4-bit CMOS adder.

2017-05-19
Menzies, Tim.  2016.  How Not to Do It: Anti-patterns for Data Science in Software Engineering. Proceedings of the 38th International Conference on Software Engineering Companion. :887–887.

Many books and papers describe how to do data science. While those texts are useful, it can also be important to reflect on anti-patterns; i.e. common classes of errors seen when large communities of researchers and commercial software engineers use, and misuse data mining tools. This technical briefing will present those errors and show how to avoid them.

2017-03-07
Madaio, Michael, Chen, Shang-Tse, Haimson, Oliver L., Zhang, Wenwen, Cheng, Xiang, Hinds-Aldrich, Matthew, Chau, Duen Horng, Dilkina, Bistra.  2016.  Firebird: Predicting Fire Risk and Prioritizing Fire Inspections in Atlanta. Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. :185–194.

The Atlanta Fire Rescue Department (AFRD), like many municipal fire departments, actively works to reduce fire risk by inspecting commercial properties for potential hazards and fire code violations. However, AFRD's fire inspection practices relied on tradition and intuition, with no existing data-driven process for prioritizing fire inspections or identifying new properties requiring inspection. In collaboration with AFRD, we developed the Firebird framework to help municipal fire departments identify and prioritize commercial property fire inspections, using machine learning, geocoding, and information visualization. Firebird computes fire risk scores for over 5,000 buildings in the city, with true positive rates of up to 71% in predicting fires. It has identified 6,096 new potential commercial properties to inspect, based on AFRD's criteria for inspection. Furthermore, through an interactive map, Firebird integrates and visualizes fire incidents, property information and risk scores to help AFRD make informed decisions about fire inspections. Firebird has already begun to make positive impact at both local and national levels. It is improving AFRD's inspection processes and Atlanta residents' safety, and was highlighted by National Fire Protection Association (NFPA) as a best practice for using data to inform fire inspections.

2015-05-05
Mukkamala, R.R., Hussain, A., Vatrapu, R..  2014.  Towards a Set Theoretical Approach to Big Data Analytics. Big Data (BigData Congress), 2014 IEEE International Congress on. :629-636.

Formal methods, models and tools for social big data analytics are largely limited to graph theoretical approaches such as social network analysis (SNA) informed by relational sociology. There are no other unified modeling approaches to social big data that integrate the conceptual, formal and software realms. In this paper, we first present and discuss a theory and conceptual model of social data. Second, we outline a formal model based on set theory and discuss the semantics of the formal model with a real-world social data example from Facebook. Third, we briefly present and discuss the Social Data Analytics Tool (SODATO) that realizes the conceptual model in software and provisions social data analysis based on the conceptual and formal models. Fourth and last, based on the formal model and sentiment analysis of text, we present a method for profiling of artifacts and actors and apply this technique to the data analysis of big social data collected from Facebook page of the fast fashion company, H&M.