Who Wrote This? Textual Modeling with Authorship Attribution in Big Data
Title | Who Wrote This? Textual Modeling with Authorship Attribution in Big Data |
Publication Type | Conference Paper |
Year of Publication | 2014 |
Authors | Pratanwanich, N., Lio, P. |
Conference Name | Data Mining Workshop (ICDMW), 2014 IEEE International Conference on |
Date Published | Dec |
Keywords | Analytical models, area under curve, AUC, author-topic model, authorship attribution, authorship learning, authorship prediction, Bayesian inference, Big Data, Computational modeling, Data models, dimension reduction, Dirichlet distribution, High dimensional texual data, information discovery, Mathematical model, meta data, meta-data, multiple-author documents, Predictive models, Probabilistic topic models, receiver operating characteristic curve, ROC curve, SAT model, supervised author-topic model, text analysis, textual modeling, topic representations, topic-based generative models, Training, unsupervised AT model, unsupervised learning, unsupervised learning technique, Vectors |
Abstract | By representing large corpora with concise and meaningful elements, topic-based generative models aim to reduce the dimension and understand the content of documents. Those techniques originally analyze on words in the documents, but their extensions currently accommodate meta-data such as authorship information, which has been proved useful for textual modeling. The importance of learning authorship is to extract author interests and assign authors to anonymous texts. Author-Topic (AT) model, an unsupervised learning technique, successfully exploits authorship information to model both documents and author interests using topic representations. However, the AT model simplifies that each author has equal contribution on multiple-author documents. To overcome this limitation, we assumes that authors give different degrees of contributions on a document by using a Dirichlet distribution. This automatically transforms the unsupervised AT model to Supervised Author-Topic (SAT) model, which brings a novelty of authorship prediction on anonymous texts. The SAT model outperforms the AT model for identifying authors of documents written by either single authors or multiple authors with a better Receiver Operating Characteristic (ROC) curve and a significantly higher Area Under Curve (AUC). The SAT model not only achieves competitive performance to state-of-the-art techniques e.g. Random forests but also maintains the characteristics of the unsupervised models for information discovery i.e. Word distributions of topics, author interests, and author contributions. |
DOI | 10.1109/ICDMW.2014.140 |
Citation Key | 7022657 |
- textual modeling
- multiple-author documents
- Predictive models
- Probabilistic topic models
- receiver operating characteristic curve
- ROC curve
- SAT model
- supervised author-topic model
- text analysis
- meta-data
- topic representations
- topic-based generative models
- Training
- unsupervised AT model
- Unsupervised Learning
- unsupervised learning technique
- Vectors
- Computational modeling
- area under curve
- AUC
- author-topic model
- authorship attribution
- authorship learning
- authorship prediction
- Bayesian inference
- Big Data
- Analytical models
- Data models
- dimension reduction
- Dirichlet distribution
- High dimensional texual data
- information discovery
- Mathematical model
- meta data