Biblio

Found 100 results

Filters: Keyword is classification  [Clear All Filters]
2019-05-01
Nadeem, Humaira, Rabbani, Imran Mujaddid, Aslam, Muhammad, M, Martinez Enriquez A..  2018.  KNN-Fuzzy Classification for Cloud Service Selection. Proceedings of the 2Nd International Conference on Future Networks and Distributed Systems. :66:1-66:8.

Cloud computing is an emerging technology that provides services to its users via Internet. It also allows sharing of resources there by reducing cost, money and space. With the popularity of cloud and its advantages, the trend of information industry shifting towards cloud services is increasing tremendously. Different cloud service providers are there on internet to provide services to the users. These services provided have certain parameters to provide better usage. It is difficult for the users to select a cloud service that is best suited to their requirements. Our proposed approach is based on data mining classification technique with fuzzy logic. Proposed algorithm uses cloud service design factors (security, agility and assurance etc.) and international standards to suggest the cloud service. The main objective of this research is to enable the end cloud users to choose best service as per their requirements and meeting international standards. We test our system with major cloud provider Google, Microsoft and Amazon.

2019-02-21
Feng, W., Chen, Z., Fu, Y..  2018.  Autoencoder Classification Algorithm Based on Swam Intelligence Optimization. 2018 17th International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES). :238–241.
BP algorithm used by autoencoder classification algorithm. But the BP algorithm is not only complicated and inefficient, but sometimes falls into local optimum. This makes autoencoder classification algorithm are not very good. So in this paper we combie Quantum Particle Swarm Optimization (QPSO) and autoencoder classification algorithm. QPSO used to optimize the weight of autoencoder neural network and the parameter of softmax. This method has been tested on some database, and the experimental result shows that this method has got good results.
2019-06-10
Kornish, D., Geary, J., Sansing, V., Ezekiel, S., Pearlstein, L., Njilla, L..  2018.  Malware Classification Using Deep Convolutional Neural Networks. 2018 IEEE Applied Imagery Pattern Recognition Workshop (AIPR). :1-6.

In recent years, deep convolution neural networks (DCNNs) have won many contests in machine learning, object detection, and pattern recognition. Furthermore, deep learning techniques achieved exceptional performance in image classification, reaching accuracy levels beyond human capability. Malware variants from similar categories often contain similarities due to code reuse. Converting malware samples into images can cause these patterns to manifest as image features, which can be exploited for DCNN classification. Techniques for converting malware binaries into images for visualization and classification have been reported in the literature, and while these methods do reach a high level of classification accuracy on training datasets, they tend to be vulnerable to overfitting and perform poorly on previously unseen samples. In this paper, we explore and document a variety of techniques for representing malware binaries as images with the goal of discovering a format best suited for deep learning. We implement a database for malware binaries from several families, stored in hexadecimal format. These malware samples are converted into images using various approaches and are used to train a neural network to recognize visual patterns in the input and classify malware based on the feature vectors. Each image type is assessed using a variety of learning models, such as transfer learning with existing DCNN architectures and feature extraction for support vector machine classifier training. Each technique is evaluated in terms of classification accuracy, result consistency, and time per trial. Our preliminary results indicate that improved image representation has the potential to enable more effective classification of new malware.

2018-04-11
Spanos, Georgios, Angelis, Lefteris, Toloudis, Dimitrios.  2017.  Assessment of Vulnerability Severity Using Text Mining. Proceedings of the 21st Pan-Hellenic Conference on Informatics. :49:1–49:6.

Software1 vulnerabilities are closely associated with information systems security, a major and critical field in today's technology. Vulnerabilities constitute a constant and increasing threat for various aspects of everyday life, especially for safety and economy, since the social impact from the problems that they cause is complicated and often unpredictable. Although there is an entire research branch in software engineering that deals with the identification and elimination of vulnerabilities, the growing complexity of software products and the variability of software production procedures are factors contributing to the ongoing occurrence of vulnerabilities, Hence, another area that is being developed in parallel focuses on the study and management of the vulnerabilities that have already been reported and registered in databases. The information contained in such databases includes, a textual description and a number of metrics related to vulnerabilities. The purpose of this paper is to investigate to what extend the assessment of the vulnerability severity can be inferred directly from the corresponding textual description, or in other words, to examine the informative power of the description with respect to the vulnerability severity. For this purpose, text mining techniques, i.e. text analysis and three different classification methods (decision trees, neural networks and support vector machines) were employed. The application of text mining to a sample of 70,678 vulnerabilities from a public data source shows that the description itself is a reliable and highly accurate source of information for vulnerability prioritization.

2018-03-19
Ghosh, Shalini, Das, Ariyam, Porras, Phil, Yegneswaran, Vinod, Gehani, Ashish.  2017.  Automated Categorization of Onion Sites for Analyzing the Darkweb Ecosystem. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. :1793–1802.

Onion sites on the darkweb operate using the Tor Hidden Service (HS) protocol to shield their locations on the Internet, which (among other features) enables these sites to host malicious and illegal content while being resistant to legal action and seizure. Identifying and monitoring such illicit sites in the darkweb is of high relevance to the Computer Security and Law Enforcement communities. We have developed an automated infrastructure that crawls and indexes content from onion sites into a large-scale data repository, called LIGHTS, with over 100M pages. In this paper we describe Automated Tool for Onion Labeling (ATOL), a novel scalable analysis service developed to conduct a thematic assessment of the content of onion sites in the LIGHTS repository. ATOL has three core components – (a) a novel keyword discovery mechanism (ATOLKeyword) which extends analyst-provided keywords for different categories by suggesting new descriptive and discriminative keywords that are relevant for the categories; (b) a classification framework (ATOLClassify) that uses the discovered keywords to map onion site content to a set of categories when sufficient labeled data is available; (c) a clustering framework (ATOLCluster) that can leverage information from multiple external heterogeneous knowledge sources, ranging from domain expertise to Bitcoin transaction data, to categorize onion content in the absence of sufficient supervised data. The paper presents empirical results of ATOL on onion datasets derived from the LIGHTS repository, and additionally benchmarks ATOL's algorithms on the publicly available 20 Newsgroups dataset to demonstrate the reproducibility of its results. On the LIGHTS dataset, ATOLClassify gives a 12% performance gain over an analyst-provided baseline, while ATOLCluster gives a 7% improvement over state-of-the-art semi-supervised clustering algorithms. We also discuss how ATOL has been deployed and externally evaluated, as part of the LIGHTS system.

2018-09-12
Sachdeva, A., Kapoor, R., Sharma, A., Mishra, A..  2017.  Categorical Classification and Deletion of Spam Images on Smartphones Using Image Processing and Machine Learning. 2017 International Conference on Machine Learning and Data Science (MLDS). :23–30.

We regularly use communication apps like Facebook and WhatsApp on our smartphones, and the exchange of media, particularly images, has grown at an exponential rate. There are over 3 billion images shared every day on Whatsapp alone. In such a scenario, the management of images on a mobile device has become highly inefficient, and this leads to problems like low storage, manual deletion of images, disorganization etc. In this paper, we present a solution to tackle these issues by automatically classifying every image on a smartphone into a set of predefined categories, thereby segregating spam images from them, allowing the user to delete them seamlessly.

2018-04-02
Essra, A., Sitompul, O. S., Nasution, B. Benyamin, Rahmat, R. F..  2017.  Hierarchical Graph Neuron Scheme in Classifying Intrusion Attack. 2017 4th International Conference on Computer Applications and Information Processing Technology (CAIPT). :1–6.

Hierarchical Graph Neuron (HGN) is an extension of network-centric algorithm called Graph Neuron (GN), which is used to perform parallel distributed pattern recognition. In this research, HGN scheme is used to classify intrusion attacks in computer networks. Patterns of intrusion attacks are preprocessed in three steps: selecting attributes using information gain attribute evaluation, discretizing the selected attributes using entropy-based discretization supervised method, and selecting the training data using K-Means clustering algorithm. After the preprocessing stage, the HGN scheme is then deployed to classify intrusion attack using the KDD Cup 99 dataset. The results of the classification are measured in terms of accuracy rate, detection rate, false positive rate and true negative rate. The test result shows that the HGN scheme is promising and stable in classifying the intrusion attack patterns with accuracy rate reaches 96.27%, detection rate reaches 99.20%, true negative rate below 15.73%, and false positive rate as low as 0.80%.

2018-03-19
Thankaraj, A., Nair, A. J., Vasudevan, N., Pathari, V..  2017.  Misclassifications: The Missing Link. 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). :1719–1722.

The notion of style is pivotal to literature. The choice of a certain writing style moulds and enhances the overall character of a book. Stylometry uses statistical methods to analyze literary style. This work aims to build a recommendation system based on the similarity in stylometric cues of various authors. The problem at hand is in close proximity to the author attribution problem. It follows a supervised approach with an initial corpus of books labelled with their respective authors as training set and generate recommendations based on the misclassified books. Results in book similarity are substantiated by domain experts.

2018-12-03
Ennajjar, Ibtissam, Tabii, Youness, Benkaddour, Abdelhamid.  2017.  Securing Data in Cloud Computing by Classification. Proceedings of the 2Nd International Conference on Big Data, Cloud and Applications. :49:1–49:5.

Cloud computing is a wide architecture based on diverse models for providing different services of software and hardware. Cloud computing paradigm attracts different users because of its several benefits such as high resource elasticity, expense reduction, scalability and simplicity which provide significant preserving in terms of investment and work force. However, the new approaches introduced by the cloud, related to computation outsourcing, distributed resources, multi-tenancy concept, high dynamism of the model, data warehousing and the nontransparent style of cloud increase the security and privacy concerns and makes building and handling trust among cloud service providers and consumers a critical security challenge. This paper proposes a new approach to improve security of data in cloud computing. It suggests a classification model to categorize data before being introduced into a suitable encryption system according to the category. Since data in cloud has not the same sensitivity level, encrypting it with the same algorithms can lead to a lack of security or of resources. By this method we try to optimize the resources consumption and the computation cost while ensuring data confidentiality.

2018-06-20
Luo, J. S., Lo, D. C. T..  2017.  Binary malware image classification using machine learning with local binary pattern. 2017 IEEE International Conference on Big Data (Big Data). :4664–4667.

Malware classification is a critical part in the cyber-security. Traditional methodologies for the malware classification typically use static analysis and dynamic analysis to identify malware. In this paper, a malware classification methodology based on its binary image and extracting local binary pattern (LBP) features is proposed. First, malware images are reorganized into 3 by 3 grids which is mainly used to extract LBP feature. Second, the LBP is implemented on the malware images to extract features in that it is useful in pattern or texture classification. Finally, Tensorflow, a library for machine learning, is applied to classify malware images with the LBP feature. Performance comparison results among different classifiers with different image descriptors such as GIST, a spatial envelop, and the LBP demonstrate that our proposed approach outperforms others.

2018-05-14
2018-11-14
Repp, P..  2017.  Diagnostics and Assessment of the Industrial Network Security Expert System. 2017 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM). :1–5.
The paper dwells on the design of a diagnostic system and expert assessment of the significance of threats to the security of industrial networks. The proposed system is based on a new cyber-attacks classification and presupposes the existence of two structural blocks: the industrial network virtual model based on the scan selected nodal points and the generator of cyber-attacks sets. The diagnostic and expert assessment quality is improved by the use of the Markov chains or the Monte Carlo numerical method. The numerical algorithm of generating cyber-attacks sets is based on the LP$\tau$-sequence.
2017-12-28
Liu, H., Ditzler, G..  2017.  A fast information-theoretic approximation of joint mutual information feature selection. 2017 International Joint Conference on Neural Networks (IJCNN). :4610–4617.

Feature selection is an important step in data analysis to address the curse of dimensionality. Such dimensionality reduction techniques are particularly important when if a classification is required and the model scales in polynomial time with the size of the feature (e.g., some applications include genomics, life sciences, cyber-security, etc.). Feature selection is the process of finding the minimum subset of features that allows for the maximum predictive power. Many of the state-of-the-art information-theoretic feature selection approaches use a greedy forward search; however, there are concerns with the search in regards to the efficiency and optimality. A unified framework was recently presented for information-theoretic feature selection that tied together many of the works in over the past twenty years. The work showed that joint mutual information maximization (JMI) is generally the best options; however, the complexity of greedy search for JMI scales quadratically and it is infeasible on high dimensional datasets. In this contribution, we propose a fast approximation of JMI based on information theory. Our approach takes advantage of decomposing the calculations within JMI to speed up a typical greedy search. We benchmarked the proposed approach against JMI on several UCI datasets, and we demonstrate that the proposed approach returns feature sets that are highly consistent with JMI, while decreasing the run time required to perform feature selection.

2018-06-20
Kosmidis, Konstantinos, Kalloniatis, Christos.  2017.  Machine Learning and Images for Malware Detection and Classification. Proceedings of the 21st Pan-Hellenic Conference on Informatics. :5:1–5:6.

Detecting malicious code with exact match on collected datasets is becoming a large-scale identification problem due to the existence of new malware variants. Being able to promptly and accurately identify new attacks enables security experts to respond effectively. My proposal is to develop an automated framework for identification of unknown vulnerabilities by leveraging current neural network techniques. This has a significant and immediate value for the security field, as current anti-virus software is typically able to recognize the malware type only after its infection, and preventive measures are limited. Artificial Intelligence plays a major role in automatic malware classification: numerous machine-learning methods, both supervised and unsupervised, have been researched to try classifying malware into families based on features acquired by static and dynamic analysis. The value of automated identification is clear, as feature engineering is both a time-consuming and time-sensitive task, with new malware studied while being observed in the wild.

2017-11-20
Regainia, L., Salva, S., Ecuhcurs, C..  2016.  A classification methodology for security patterns to help fix software weaknesses. 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA). :1–8.

Security patterns are generic solutions that can be applied since early stages of software life to overcome recurrent security weaknesses. Their generic nature and growing number make their choice difficult, even for experts in system design. To help them on the pattern choice, this paper proposes a semi-automatic methodology of classification and the classification itself, which exposes relationships among software weaknesses, security principles and security patterns. It expresses which patterns remove a given weakness with respect to the security principles that have to be addressed to fix the weakness. The methodology is based on seven steps, which anatomize patterns and weaknesses into set of more precise sub-properties that are associated through a hierarchical organization of security principles. These steps provide the detailed justifications of the resulting classification and allow its upgrade. Without loss of generality, this classification has been established for Web applications and covers 185 software weaknesses, 26 security patterns and 66 security principles. Research supported by the industrial chair on Digital Confidence (http://confiance-numerique.clermont-universite.fr/index-en.html).

2017-09-05
Siddiqui, Sana, Khan, Muhammad Salman, Ferens, Ken, Kinsner, Witold.  2016.  Detecting Advanced Persistent Threats Using Fractal Dimension Based Machine Learning Classification. Proceedings of the 2016 ACM on International Workshop on Security And Privacy Analytics. :64–69.

Advanced Persistent Threats (APTs) are a new breed of internet based smart threats, which can go undetected with the existing state of-the-art internet traffic monitoring and protection systems. With the evolution of internet and cloud computing, a new generation of smart APT attacks has also evolved and signature based threat detection systems are proving to be futile and insufficient. One of the essential strategies in detecting APTs is to continuously monitor and analyze various features of a TCP/IP connection, such as the number of transferred packets, the total count of the bytes exchanged, the duration of the TCP/IP connections, and details of the number of packet flows. The current threat detection approaches make extensive use of machine learning algorithms that utilize statistical and behavioral knowledge of the traffic. However, the performance of these algorithms is far from satisfactory in terms of reducing false negatives and false positives simultaneously. Mostly, current algorithms focus on reducing false positives, only. This paper presents a fractal based anomaly classification mechanism, with the goal of reducing both false positives and false negatives, simultaneously. A comparison of the proposed fractal based method with a traditional Euclidean based machine learning algorithm (k-NN) shows that the proposed method significantly outperforms the traditional approach by reducing false positive and false negative rates, simultaneously, while improving the overall classification rates.

2017-11-27
Kuze, N., Ishikura, S., Yagi, T., Chiba, D., Murata, M..  2016.  Detection of vulnerability scanning using features of collective accesses based on information collected from multiple honeypots. NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium. :1067–1072.

Attacks against websites are increasing rapidly with the expansion of web services. An increasing number of diversified web services make it difficult to prevent such attacks due to many known vulnerabilities in websites. To overcome this problem, it is necessary to collect the most recent attacks using decoy web honeypots and to implement countermeasures against malicious threats. Web honeypots collect not only malicious accesses by attackers but also benign accesses such as those by web search crawlers. Thus, it is essential to develop a means of automatically identifying malicious accesses from mixed collected data including both malicious and benign accesses. Specifically, detecting vulnerability scanning, which is a preliminary process, is important for preventing attacks. In this study, we focused on classification of accesses for web crawling and vulnerability scanning since these accesses are too similar to be identified. We propose a feature vector including features of collective accesses, e.g., intervals of request arrivals and the dispersion of source port numbers, obtained with multiple honeypots deployed in different networks for classification. Through evaluation using data collected from 37 honeypots in a real network, we show that features of collective accesses are advantageous for vulnerability scanning and crawler classification.

2017-09-15
Ahmadi, Mansour, Ulyanov, Dmitry, Semenov, Stanislav, Trofimov, Mikhail, Giacinto, Giorgio.  2016.  Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification. Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy. :183–194.

Modern malware is designed with mutation characteristics, namely polymorphism and metamorphism, which causes an enormous growth in the number of variants of malware samples. Categorization of malware samples on the basis of their behaviors is essential for the computer security community, because they receive huge number of malware everyday, and the signature extraction process is usually based on malicious parts characterizing malware families. Microsoft released a malware classification challenge in 2015 with a huge dataset of near 0.5 terabytes of data, containing more than 20K malware samples. The analysis of this dataset inspired the development of a novel paradigm that is effective in categorizing malware variants into their actual family groups. This paradigm is presented and discussed in the present paper, where emphasis has been given to the phases related to the extraction, and selection of a set of novel features for the effective representation of malware samples. Features can be grouped according to different characteristics of malware behavior, and their fusion is performed according to a per-class weighting paradigm. The proposed method achieved a very high accuracy (\$\textbackslashapprox\$ 0.998) on the Microsoft Malware Challenge dataset.

Tripp, Omer, Pistoia, Marco, Ferrara, Pietro, Rubin, Julia.  2016.  Pinpointing Mobile Malware Using Code Analysis. Proceedings of the International Conference on Mobile Software Engineering and Systems. :275–276.

Mobile malware has recently become an acute problem. Existing solutions either base static reasoning on syntactic properties, such as exception handlers or configuration fields, or compute data-flow reachability over the program, which leads to scalability challenges. We explore a new and complementary category of features, which strikes a middleground between the above two categories. This new category focuses on security-relevant operations (communcation, lifecycle, etc) –- and in particular, their multiplicity and happens-before order –- as a means to distinguish between malicious and benign applications. Computing these features requires semantic, yet lightweight, modeling of the program's behavior. We have created a malware detection system for Android, MassDroid, that collects traces of security-relevant operations from the call graph via a scalable form of data-flow analysis. These are reduced to happens-before and multiplicity features, then fed into a supervised learning engine to obtain a malicious/benign classification. MassDroid also embodies a novel reporting interface, containing pointers into the code that serve as evidence supporting the determination. We have applied MassDroid to 35,000 Android apps from the wild. The results are highly encouraging with an F-score of 95% in standard testing, and textgreater90% when applied to previously unseen malware signatures. MassDroid is also efficient, requiring about two minutes per app. MassDroid is publicly available as a cloud service for malware detection.

2017-03-07
Farag, Mohamed, Nakate, Pranav, Fox, Edward A..  2016.  Big Data Processing of School Shooting Archives. Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. :271–272.

Web archives about school shootings consist of webpages that may or may not be relevant to the events of interest. There are 3 main goals of this work; first is to clean the webpages, which involves getting rid of the stop words and non-relevant parts of a webpage. The second goal is to select just webpages relevant to the events of interest. The third goal is to upload the cleaned and relevant webpages to Apache Solr so that they are easily accessible. We show the details of all the steps required to achieve these goals. The results show that representative Web archives are noisy, with 2% - 40% relevant content. By cleaning the archives, we aid researchers to focus on relevant content for their analysis.

2017-10-03
Majumder, Abhishek, Deb, Subhrajyoti, Roy, Sudipta.  2016.  Classification and Performance Analysis of Intra-domain Mobility Management Schemes for Wireless Mesh Network. Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies. :113:1–113:6.

Nowadays Wireless Mesh Networks (WMNs) has come up with a promising solution for modern wireless communications. But, one of the major problems with WMN is the mobility of the Mesh Clients (MCs). To offer seamless connectivity to the MCs, their mobility management is necessary. During mobility management one of the major concerns is the communication overhead incurred during handoff of the MCs. For addressing this concern, many schemes have been proposed by the researchers. In this paper, a classification of the existing intra domain mobility management schemes has been presented. The schemes have been numerically analyzed. Finally, their performance has been analyzed and compared with respect to handoff cost considering different mobility rates of the MCs.

2015-01-13
John Slankas, Maria Riaz, Jason King, Laurie Williams.  2014.  Discovering Security Requirements from Natural Language. 36th International Conference on Software Engineering.

Project documentation often contains security-relevant statements that are indicative of the security requirements of a system. However these statements may not be explicitly specified or straightforward to locate. At best, requirements analysts manually extract applicable security requirements from project documents. However, security requirements that are not explicitly stated may not be considered during implementation. The goal of this research is to aid requirements analysts in generating security requirements through identifying securityrelevant statements in project documentation and providing context-specific templates to generate security requirements. First, we identify the most prevalent security objectives from software security literature. To identify security-relevant statements in project documentation, we propose a tool-based process to classify statements as related to zero or more security objectives. We then develop a set of context-specific templates to help translate the security objectives of each statement into explicit sets of security functional requirements. We evaluate our process on six documents from the electronic healthcare software industry, identifying 46% of statements as implicitly or explicitly related to security. Our classification approach identified security objectives with a precision of .82 and recall of .79. From our total set of classified statements, we extracted 16 context-specific templates that identify 41 reusable security requirements.

2015-05-04
Swati, K., Patankar, A.J..  2014.  Effective personalized mobile search using KNN. Data Science Engineering (ICDSE), 2014 International Conference on. :157-160.

Effective Personalized Mobile Search Using KNN, implements an architecture to improve user's personalization effectiveness over large set of data maintaining security of the data. User preferences are gathered through clickthrough data. Clickthrough data obtained is sent to the server in encrypted form. Clickthrough data obtained is classified into content concepts and location concepts. To improve classification and minimize processing time, KNN(K Nearest Neighborhood) algorithm is used. Preferences identified(location and content) are merged to provide effective preferences to the user. System make use of four entropies to balance weight between content concepts and location concepts. System implements client server architecture. Role of client is to collect user queries and to maintain them in files for future reference. User preference privacy is ensured through privacy parameters and also through encryption techniques. Server is responsible to carry out the tasks like training, reranking of the search results obtained and the concept extraction. Experiments are carried out on Android based mobile. Results obtained through experiments show that system significantly gives improved results over previous algorithm for the large set of data maintaining security.

2015-05-05
Jandel, M., Svenson, P., Johansson, R..  2014.  Fusing restricted information. Information Fusion (FUSION), 2014 17th International Conference on. :1-9.

Information fusion deals with the integration and merging of data and information from multiple (heterogeneous) sources. In many cases, the information that needs to be fused has security classification. The result of the fusion process is then by necessity restricted with the strictest information security classification of the inputs. This has severe drawbacks and limits the possible dissemination of the fusion results. It leads to decreased situational awareness: the organization knows information that would enable a better situation picture, but since parts of the information is restricted, it is not possible to distribute the most correct situational information. In this paper, we take steps towards defining fusion and data mining processes that can be used even when all the underlying data that was used cannot be disseminated. The method we propose here could be used to produce a classifier where all the sensitive information has been removed and where it can be shown that an antagonist cannot even in principle obtain knowledge about the classified information by using the classifier or situation picture.
 

Lomotey, R.K., Deters, R..  2014.  Terms Mining in Document-Based NoSQL: Response to Unstructured Data. Big Data (BigData Congress), 2014 IEEE International Congress on. :661-668.

Unstructured data mining has become topical recently due to the availability of high-dimensional and voluminous digital content (known as "Big Data") across the enterprise spectrum. The Relational Database Management Systems (RDBMS) have been employed over the past decades for content storage and management, but, the ever-growing heterogeneity in today's data calls for a new storage approach. Thus, the NoSQL database has emerged as the preferred storage facility nowadays since the facility supports unstructured data storage. This creates the need to explore efficient data mining techniques from such NoSQL systems since the available tools and frameworks which are designed for RDBMS are often not directly applicable. In this paper, we focused on topics and terms mining, based on clustering, in document-based NoSQL. This is achieved by adapting the architectural design of an analytics-as-a-service framework and the proposal of the Viterbi algorithm to enhance the accuracy of the terms classification in the system. The results from the pilot testing of our work show higher accuracy in comparison to some previously proposed techniques such as the parallel search.