Visible to the public User identification through command history analysis

TitleUser identification through command history analysis
Publication TypeConference Paper
Year of Publication2014
AuthorsKhosmood, F., Nico, P.L., Woolery, J.
Conference NameComputational Intelligence in Cyber Security (CICS), 2014 IEEE Symposium on
Date PublishedDec
KeywordsAccuracy, Cal Poly command history corpus, command history analysis, command-line behavior, command-line history, Computer science, Decision trees, editor war, Entropy, feature extraction, feature set, History, Information analysis, learning (artificial intelligence), natural language authorship attribution literature, natural language processing, Schonlau corpus, Schonlau masquerading corpus, Schonlau variant, Semantics, standard algorithm, Unix, Unix user, user configuration, user corpora, user identification, user profile corpus
Abstract

As any veteran of the editor wars can attest, Unix users can be fiercely and irrationally attached to the commands they use and the manner in which they use them. In this work, we investigate the problem of identifying users out of a large set of candidates (25-97) through their command-line histories. Using standard algorithms and feature sets inspired by natural language authorship attribution literature, we demonstrate conclusively that individual users can be identified with a high degree of accuracy through their command-line behavior. Further, we report on the best performing feature combinations, from the many thousands that are possible, both in terms of accuracy and generality. We validate our work by experimenting on three user corpora comprising data gathered over three decades at three distinct locations. These are the Greenberg user profile corpus (168 users), Schonlau masquerading corpus (50 users) and Cal Poly command history corpus (97 users). The first two are well known corpora published in 1991 and 2001 respectively. The last is developed by the authors in a year-long study in 2014 and represents the most recent corpus of its kind. For a 50 user configuration, we find feature sets that can successfully identify users with over 90% accuracy on the Cal Poly, Greenberg and one variant of the Schonlau corpus, and over 87% on the other Schonlau variant.

DOI10.1109/CICYBS.2014.7013363
Citation Key7013363