Fair-SSL: Building fair ML Software with less data

Submitted by grigby1 on Fri, 02/03/2023 - 4:56pm

Title	Fair-SSL: Building fair ML Software with less data
Publication Type	Conference Paper
Year of Publication	2022
Authors	Chakraborty, Joymallya, Majumder, Suvodeep, Tu, Huy
Conference Name	2022 IEEE/ACM International Workshop on Equitable Data & Technology (FairWare)
Keywords	Buildings, Conferences, Ethics, Ethics in Software Engineering, Human Behavior, machine learning, Machine Learning with and for SE, Metrics, pubcrawl, resilience, Resiliency, Scalability, Semisupervised learning, SSL Trust Models, Training, Training data
Abstract	Ethical bias in machine learning models has become a matter of concern in the software engineering community. Most of the prior software engineering works concentrated on finding ethical bias in models rather than fixing it. After finding bias, the next step is mitigation. Prior researchers mainly tried to use supervised approaches to achieve fairness. However, in the real world, getting data with trustworthy ground truth is challenging and also ground truth can contain human bias. Semi-supervised learning is a technique where, incrementally, labeled data is used to generate pseudo-labels for the rest of data (and then all that data is used for model training). In this work, we apply four popular semi-supervised techniques as pseudo-labelers to create fair classification models. Our framework, Fair-SSL, takes a very small amount (10%) of labeled data as input and generates pseudo-labels for the unlabeled data. We then synthetically generate new data points to balance the training data based on class and protected attribute as proposed by Chakraborty et al. in FSE 2021. Finally, classification model is trained on the balanced pseudo-labeled data and validated on test data. After experimenting on ten datasets and three learners, we find that Fair-SSL achieves similar performance as three state-of-the-art bias mitigation algorithms. That said, the clear advantage of Fair-SSL is that it requires only 10% of the labeled training data. To the best of our knowledge, this is the first SE work where semi-supervised techniques are used to fight against ethical bias in SE ML models. To facilitate open science and replication, all our source code and datasets are publicly available at https://github.com/joymallyac/FairSSL. CCS CONCEPTS * Software and its engineering - Software creation and management; * Computing methodologies - Machine learning. ACM Reference Format: Joymallya Chakraborty, Suvodeep Majumder, and Huy Tu. 2022. Fair-SSL: Building fair ML Software with less data. In International Workshop on Equitable Data and Technology (FairWare '22), May 9, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3524491.3527305
DOI	10.1145/3524491.3527305
Citation Key	chakraborty_fair-ssl_2022

Groups:

Science of Security VO