Low-Cost Breaking of a Unique Chinese Language CAPTCHA Using Curriculum Learning and Clustering

Submitted by aekwall on Mon, 04/01/2019 - 10:08am

Title	Low-Cost Breaking of a Unique Chinese Language CAPTCHA Using Curriculum Learning and Clustering
Publication Type	Conference Paper
Year of Publication	2018
Authors	Stein, G., Peng, Q.
Conference Name	2018 IEEE International Conference on Electro/Information Technology (EIT)
Date Published	may
Keywords	automated access, CAPTCHA, captchas, Chinese language question-and-answer Website, CNN, composability, convolution, convolutional neural network, convolutional neural networks, correct response, curriculum clustering, curriculum learning, feature map, feedforward neural nets, Human Behavior, human user, image distortion, inverted character, inverted characters, inverted symbols, Kernel, learning (artificial intelligence), low-cost breaking, machine learning methods, Microsoft Windows, natural language processing, OCR software, optical character recognition, pattern clustering, potential training methods, pubcrawl, security, Task Analysis, text analysis, text distortion, text-based CAPTCHAs focus, Training, transcription tasks, unique Chinese language CAPTCHA, web services, Web sites
Abstract	Text-based CAPTCHAs are still commonly used to attempt to prevent automated access to web services. By displaying an image of distorted text, they attempt to create a challenge image that OCR software can not interpret correctly, but a human user can easily determine the correct response to. This work focuses on a CAPTCHA used by a popular Chinese language question-and-answer website and how resilient it is to modern machine learning methods. While the majority of text-based CAPTCHAs focus on transcription tasks, the CAPTCHA solved in this work is based on localization of inverted symbols in a distorted image. A convolutional neural network (CNN) was created to evaluate the likelihood of a region in the image belonging to an inverted character. It is used with a feature map and clustering to identify potential locations of inverted characters. Training of the CNN was performed using curriculum learning and compared to other potential training methods. The proposed method was able to determine the correct response in 95.2% of cases of a simulated CAPTCHA and 67.6% on a set of real CAPTCHAs. Potential methods to increase difficulty of the CAPTCHA and the success rate of the automated solver are considered.
URL	https://ieeexplore.ieee.org/document/8500113
DOI	10.1109/EIT.2018.8500113
Citation Key	stein_low-cost_2018

Groups:

Science of Security VO