Title | Correction of Spaces in Persian Sentences for Tokenization |
Publication Type | Conference Paper |
Year of Publication | 2019 |
Authors | Panahandeh, Mahnaz, Ghanbari, Shirin |
Conference Name | 2019 5th Conference on Knowledge Based Engineering and Innovation (KBEI) |
Date Published | feb |
Keywords | automatic analysis, Computational modeling, decision making, exponential growth, full-spaces, half-spaces, Human Behavior, Internet, natural language processing, natural language tasks, normalization, online text, Persian, Persian language, Persian sentences, preprocessing tools, pubcrawl, Resiliency, Scalability, social media services, social networking (online), space, Standards, Task Analysis, text analysis, textual data, textual preprocessing, Tokenization, Tools, user comments, Web 2.0, word identification, word vocabulary |
Abstract | The exponential growth of the Internet and its users and the emergence of Web 2.0 have caused a large volume of textual data to be created. Automatic analysis of such data can be used in making decisions. As online text is created by different producers with different styles of writing, pre-processing is a necessity prior to any processes related to natural language tasks. An essential part of textual preprocessing prior to the recognition of the word vocabulary is normalization, which includes the correction of spaces that particularly in the Persian language this includes both full-spaces between words and half-spaces. Through the review of user comments within social media services, it can be seen that in many cases users do not adhere to grammatical rules of inserting both forms of spaces, which increases the complexity of the identification of words and henceforth, reducing the accuracy of further processing on the text. In this study, current issues in the normalization and tokenization of preprocessing tools within the Persian language and essentially identifying and correcting the separation of words are and the correction of spaces are proposed. The results obtained and compared to leading preprocessing tools highlight the significance of the proposed methodology. |
DOI | 10.1109/KBEI.2019.8734954 |
Citation Key | panahandeh_correction_2019 |