Visible to the public Data Imputation Techniques: An Empirical Study using Chronic Kidney Disease and Life Expectancy Datasets

TitleData Imputation Techniques: An Empirical Study using Chronic Kidney Disease and Life Expectancy Datasets
Publication TypeConference Paper
Year of Publication2022
AuthorsReddy Sankepally, Sainath, Kosaraju, Nishoak, Mallikharjuna Rao, K
Conference Name2022 International Conference on Innovative Trends in Information Technology (ICITIIT)
KeywordsCategorical data, chronic kidney disease, column-wise deletion, data deletion, data imputation, Data visualization, distortion, end of distribution imputation, frequent category imputation, information technology, life expectancy, list-wise deletion, Market research, mean imputation, median imputation, Medical services, missing category imputation, missing data, numerical data, pubcrawl, Real-time Systems, Scalability, Task Analysis
AbstractData is a collection of information from the activities of the real world. The file in which such data is stored after transforming into a form that machines can process is generally known as data set. In the real world, many data sets are not complete, and they contain various types of noise. Missing values is of one such kind. Thus, imputing data of these missing values is one of the significant task of data pre-processing. This paper deals with two real time health care data sets namely life expectancy (LE) dataset and chronic kidney disease (CKD) dataset, which are very different in their nature. This paper provides insights on various data imputation techniques to fill missing values by analyzing them. When coming to Data imputation, it is very common to impute the missing values with measure of central tendencies like mean, median, mode Which can represent the central value of distribution but choosing the apt choice is real challenge. In accordance with best of our knowledge this is the first and foremost paper which provides the complete analysis of impact of basic data imputation techniques on various data distributions which can be classified based on the size of data set, number of missing values, type of data (categorical/numerical), etc. This paper compared and analyzed the original data distribution with the data distribution after each imputation in terms of their skewness, outliers and by various descriptive statistic parameters.
DOI10.1109/ICITIIT54346.2022.9744211
Citation Keyreddy_sankepally_data_2022