Title | Classification Between Machine Translated Text and Original Text By Part Of Speech Tagging Representation |
Publication Type | Conference Paper |
Year of Publication | 2020 |
Authors | Piazza, Nancirose |
Conference Name | 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA) |
Date Published | oct |
Keywords | Artificial neural networks, composability, Dictionaries, Human Behavior, human factors, Indexes, Metrics, Numerical models, part of speech tagging, pubcrawl, Scalability, tagging, text analytics, Training data, trigram representation, Vocabulary, word embedding, Zipf’s Law |
Abstract | Classification between machine-translated text and original text are often tokenized on vocabulary of the corpi. With N-grams larger than uni-gram, one can create a model that estimates a decision boundary based on word frequency probability distribution; however, this approach is exponentially expensive because of high dimensionality and sparsity. Instead, we let samples of the corpi be represented by part-of-speech tagging which is significantly less vocabulary. With less trigram permutations, we can create a model with its tri-gram frequency probability distribution. In this paper, we explore less conventional ways of approaching techniques for handling documents, dictionaries, and the likes. |
DOI | 10.1109/DSAA49011.2020.00092 |
Citation Key | piazza_classification_2020 |