Biblio

List
Filter

Found 365 results

Filters: Keyword is Support vector machines [Clear All Filters]

2019-07-01

Clemente, C. J., Jaafar, F., Malik, Y.. 2018. Is Predicting Software Security Bugs Using Deep Learning Better Than the Traditional Machine Learning Algorithms? 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS). :95–102.

Software insecurity is being identified as one of the leading causes of security breaches. In this paper, we revisited one of the strategies in solving software insecurity, which is the use of software quality metrics. We utilized a multilayer deep feedforward network in examining whether there is a combination of metrics that can predict the appearance of security-related bugs. We also applied the traditional machine learning algorithms such as decision tree, random forest, naïve bayes, and support vector machines and compared the results with that of the Deep Learning technique. The results have successfully demonstrated that it was possible to develop an effective predictive model to forecast software insecurity based on the software metrics and using Deep Learning. All the models generated have shown an accuracy of more than sixty percent with Deep Learning leading the list. This finding proved that utilizing Deep Learning methods and a combination of software metrics can be tapped to create a better forecasting model thereby aiding software developers in predicting security bugs.

Perez, R. Lopez, Adamsky, F., Soua, R., Engel, T.. 2018. Machine Learning for Reliable Network Attack Detection in SCADA Systems. 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). :633–638.

Critical Infrastructures (CIs) use Supervisory Control And Data Acquisition (SCADA) systems for remote control and monitoring. Sophisticated security measures are needed to address malicious intrusions, which are steadily increasing in number and variety due to the massive spread of connectivity and standardisation of open SCADA protocols. Traditional Intrusion Detection Systems (IDSs) cannot detect attacks that are not already present in their databases. Therefore, in this paper, we assess Machine Learning (ML) for intrusion detection in SCADA systems using a real data set collected from a gas pipeline system and provided by the Mississippi State University (MSU). The contribution of this paper is two-fold: 1) The evaluation of four techniques for missing data estimation and two techniques for data normalization, 2) The performances of Support Vector Machine (SVM), and Random Forest (RF) are assessed in terms of accuracy, precision, recall and F1score for intrusion detection. Two cases are differentiated: binary and categorical classifications. Our experiments reveal that RF detect intrusions effectively, with an F1score of respectively \textbackslashtextgreater 99%.

2019-06-10

Roseline, S. A., Geetha, S.. 2018. Intelligent Malware Detection Using Oblique Random Forest Paradigm. 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI). :330-336.

With the increase in the popularity of computerized online applications, the analysis, and detection of a growing number of newly discovered stealthy malware poses a significant challenge to the security community. Signature-based and behavior-based detection techniques are becoming inefficient in detecting new unknown malware. Machine learning solutions are employed to counter such intelligent malware and allow performing more comprehensive malware detection. This capability leads to an automatic analysis of malware behavior. The proposed oblique random forest ensemble learning technique is efficient for malware classification. The effectiveness of the proposed method is demonstrated with three malware classification datasets from various sources. The results are compared with other variants of decision tree learning models. The proposed system performs better than the existing system in terms of classification accuracy and false positive rate.

Udayakumar, N., Saglani, V. J., Cupta, A. V., Subbulakshmi, T.. 2018. Malware Classification Using Machine Learning Algorithms. 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI). :1-9.

Lately, we are facing the Malware crisis due to various types of malware or malicious programs or scripts available in the huge virtual world - the Internet. But, what is malware? Malware can be a malicious software or a program or a script which can be harmful to the user's computer. These malicious programs can perform a variety of functions, including stealing, encrypting or deleting sensitive data, altering or hijacking core computing functions and monitoring users' computer activity without their permission. There are various entry points for these programs and scripts in the user environment, but only one way to remove them is to find them and kick them out of the system which isn't an easy job as these small piece of script or code can be anywhere in the user system. This paper involves the understanding of different types of malware and how we will use Machine Learning to detect these malwares.

Kalash, M., Rochan, M., Mohammed, N., Bruce, N. D. B., Wang, Y., Iqbal, F.. 2018. Malware Classification with Deep Convolutional Neural Networks. 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS). :1-5.

In this paper, we propose a deep learning framework for malware classification. There has been a huge increase in the volume of malware in recent years which poses a serious security threat to financial institutions, businesses and individuals. In order to combat the proliferation of malware, new strategies are essential to quickly identify and classify malware samples so that their behavior can be analyzed. Machine learning approaches are becoming popular for classifying malware, however, most of the existing machine learning methods for malware classification use shallow learning algorithms (e.g. SVM). Recently, Convolutional Neural Networks (CNN), a deep learning approach, have shown superior performance compared to traditional learning algorithms, especially in tasks such as image classification. Motivated by this success, we propose a CNN-based architecture to classify malware samples. We convert malware binaries to grayscale images and subsequently train a CNN for classification. Experiments on two challenging malware classification datasets, Malimg and Microsoft malware, demonstrate that our method achieves better than the state-of-the-art performance. The proposed method achieves 98.52% and 99.97% accuracy on the Malimg and Microsoft datasets respectively.

Kornish, D., Geary, J., Sansing, V., Ezekiel, S., Pearlstein, L., Njilla, L.. 2018. Malware Classification Using Deep Convolutional Neural Networks. 2018 IEEE Applied Imagery Pattern Recognition Workshop (AIPR). :1-6.

In recent years, deep convolution neural networks (DCNNs) have won many contests in machine learning, object detection, and pattern recognition. Furthermore, deep learning techniques achieved exceptional performance in image classification, reaching accuracy levels beyond human capability. Malware variants from similar categories often contain similarities due to code reuse. Converting malware samples into images can cause these patterns to manifest as image features, which can be exploited for DCNN classification. Techniques for converting malware binaries into images for visualization and classification have been reported in the literature, and while these methods do reach a high level of classification accuracy on training datasets, they tend to be vulnerable to overfitting and perform poorly on previously unseen samples. In this paper, we explore and document a variety of techniques for representing malware binaries as images with the goal of discovering a format best suited for deep learning. We implement a database for malware binaries from several families, stored in hexadecimal format. These malware samples are converted into images using various approaches and are used to train a neural network to recognize visual patterns in the input and classify malware based on the feature vectors. Each image type is assessed using a variety of learning models, such as transfer learning with existing DCNN architectures and feature extraction for support vector machine classifier training. Each technique is evaluated in terms of classification accuracy, result consistency, and time per trial. Our preliminary results indicate that improved image representation has the potential to enable more effective classification of new malware.

Sokolov, A. N., Pyatnitsky, I. A., Alabugin, S. K.. 2018. Research of Classical Machine Learning Methods and Deep Learning Models Effectiveness in Detecting Anomalies of Industrial Control System. 2018 Global Smart Industry Conference (GloSIC). :1-6.

Modern industrial control systems (ICS) act as victims of cyber attacks more often in last years. These attacks are hard to detect and their consequences can be catastrophic. Cyber attacks can cause anomalies in the work of the ICS and its technological equipment. The presence of mutual interference and noises in this equipment significantly complicates anomaly detection. Moreover, the traditional means of protection, which used in corporate solutions, require updating with each change in the structure of the industrial process. An approach based on the machine learning for anomaly detection was used to overcome these problems. It complements traditional methods and allows one to detect signal correlations and use them for anomaly detection. Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation dataset was analyzed as example of industrial process. In the course of the research, correlations between the signals of the sensors were detected and preliminary data processing was carried out. Algorithms from the most common techniques of machine learning (decision trees, linear algorithms, support vector machines) and deep learning models (neural networks) were investigated for industrial process anomaly detection task. It's shown that linear algorithms are least demanding on computational resources, but they don't achieve an acceptable result and allow a significant number of errors. Decision tree-based algorithms provided an acceptable accuracy, but the amount of RAM, required for their operations, relates polynomially with the training sample volume. The deep neural networks provided the greatest accuracy, but they require considerable computing power for internal calculations.

Eziama, E., Jaimes, L. M. S., James, A., Nwizege, K. S., Balador, A., Tepe, K.. 2018. Machine Learning-Based Recommendation Trust Model for Machine-to-Machine Communication. 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). :1-6.

The Machine Type Communication Devices (MTCDs) are usually based on Internet Protocol (IP), which can cause billions of connected objects to be part of the Internet. The enormous amount of data coming from these devices are quite heterogeneous in nature, which can lead to security issues, such as injection attacks, ballot stuffing, and bad mouthing. Consequently, this work considers machine learning trust evaluation as an effective and accurate option for solving the issues associate with security threats. In this paper, a comparative analysis is carried out with five different machine learning approaches: Naive Bayes (NB), Decision Tree (DT), Linear and Radial Support Vector Machine (SVM), KNearest Neighbor (KNN), and Random Forest (RF). As a critical element of the research, the recommendations consider different Machine-to-Machine (M2M) communication nodes with regard to their ability to identify malicious and honest information. To validate the performances of these models, two trust computation measures were used: Receiver Operating Characteristics (ROCs), Precision and Recall. The malicious data was formulated in Matlab. A scenario was created where 50% of the information were modified to be malicious. The malicious nodes were varied in the ranges of 10%, 20%, 30%, 40%, and the results were carefully analyzed.

Basomingera, R., Choi, Y.. 2019. Route Cache Based SVM Classifier for Intrusion Detection of Control Packet Attacks in Mobile Ad-Hoc Networks. 2019 International Conference on Information Networking (ICOIN). :31–36.

For the security of mobile ad-hoc networks (MANETs), a group of wireless mobile nodes needs to cooperate by forwarding packets, to implement an intrusion detection system (IDS). Some of the current IDS implementations in a clustered MANET have designed mobile nodes to wait until the cluster head is elected before scanning the network and thus nodes may be, unfortunately, exposed to several control packet attacks by which nodes identify falsified routes to reach other nodes. In order to detect control packet attacks such as route falsification, we design a route cache sharing mechanism for a non-clustered network where all one-hop routing data are collected by each node for a cooperative host-based detection. The cooperative host-based detection system uses a Support Vector Machine classifier and achieves a detection rate of around 95%. By successfully detecting the route falsification attacks, nodes are given the capability to avoid other attacks such as black-hole and gray-hole, which are in many cases a result of a successful route falsification attack.

2019-03-28

Subasi, A., Al-Marwani, K., Alghamdi, R., Kwairanga, A., Qaisar, S. M., Al-Nory, M., Rambo, K. A.. 2018. Intrusion Detection in Smart Grid Using Data Mining Techniques. 2018 21st Saudi Computer Society National Computer Conference (NCC). :1-6.

The rapid growth of population and industrialization has given rise to the way for the use of technologies like the Internet of Things (IoT). Innovations in Information and Communication Technologies (ICT) carries with it many challenges to our privacy's expectations and security. In Smart environments there are uses of security devices and smart appliances, sensors and energy meters. New requirements in security and privacy are driven by the massive growth of devices numbers that are connected to IoT which increases concerns in security and privacy. The most ubiquitous threats to the security of the smart grids (SG) ascended from infrastructural physical damages, destroying data, malwares, DoS, and intrusions. Intrusion detection comprehends illegitimate access to information and attacks which creates physical disruption in the availability of servers. This work proposes an intrusion detection system using data mining techniques for intrusion detection in smart grid environment. The results showed that the proposed random forest method with a total classification accuracy of 98.94 %, F-measure of 0.989, area under the ROC curve (AUC) of 0.999, and kappa value of 0.9865 outperforms over other classification methods. In addition, the feasibility of our method has been successfully demonstrated by comparing other classification techniques such as ANN, k-NN, SVM and Rotation Forest.

2019-03-25

Mamdouh, M., Elrukhsi, M. A. I., Khattab, A.. 2018. Securing the Internet of Things and Wireless Sensor Networks via Machine Learning: A Survey. 2018 International Conference on Computer and Applications (ICCA). :215–218.

The Internet of Things (IoT) is the network where physical devices, sensors, appliances and other different objects can communicate with each other without the need for human intervention. Wireless Sensor Networks (WSNs) are main building blocks of the IoT. Both the IoT and WSNs have many critical and non-critical applications that touch almost every aspect of our modern life. Unfortunately, these networks are prone to various types of security threats. Therefore, the security of IoT and WSNs became crucial. Furthermore, the resource limitations of the devices used in these networks complicate the problem. One of the most recent and effective approaches to address such challenges is machine learning. Machine learning inspires many solutions to secure the IoT and WSNs. In this paper, we survey the different threats that can attack both IoT and WSNs and the machine learning techniques developed to counter them.

2019-03-15

Kim, D., Shin, D., Shin, D.. 2018. Unauthorized Access Point Detection Using Machine Learning Algorithms for Information Protection. 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). :1876-1878.

With the frequent use of Wi-Fi and hotspots that provide a wireless Internet environment, awareness and threats to wireless AP (Access Point) security are steadily increasing. Especially when using unauthorized APs in company, government and military facilities, there is a high possibility of being subjected to various viruses and hacking attacks. It is necessary to detect unauthorized Aps for protection of information. In this paper, we use RTT (Round Trip Time) value data set to detect authorized and unauthorized APs in wired / wireless integrated environment, analyze them using machine learning algorithms including SVM (Support Vector Machine), C4.5, KNN (K Nearest Neighbors) and MLP (Multilayer Perceptron). Overall, KNN shows the highest accuracy.

Lin, W., Lin, H., Wang, P., Wu, B., Tsai, J.. 2018. Using Convolutional Neural Networks to Network Intrusion Detection for Cyber Threats. 2018 IEEE International Conference on Applied System Invention (ICASI). :1107-1110.

In practice, Defenders need a more efficient network detection approach which has the advantages of quick-responding learning capability of new network behavioural features for network intrusion detection purpose. In many applications the capability of Deep Learning techniques has been confirmed to outperform classic approaches. Accordingly, this study focused on network intrusion detection using convolutional neural networks (CNNs) based on LeNet-5 to classify the network threats. The experiment results show that the prediction accuracy of intrusion detection goes up to 99.65% with samples more than 10,000. The overall accuracy rate is 97.53%.

Deliu, I., Leichter, C., Franke, K.. 2018. Collecting Cyber Threat Intelligence from Hacker Forums via a Two-Stage, Hybrid Process Using Support Vector Machines and Latent Dirichlet Allocation. 2018 IEEE International Conference on Big Data (Big Data). :5008-5013.

Traditional security controls, such as firewalls, anti-virus and IDS, are ill-equipped to help IT security and response teams keep pace with the rapid evolution of the cyber threat landscape. Cyber Threat Intelligence (CTI) can help remediate this problem by exploiting non-traditional information sources, such as hacker forums and "dark-web" social platforms. Security and response teams can use the collected intelligence to identify emerging threats. Unfortunately, when manual analysis is used to extract CTI from non-traditional sources, it is a time consuming, error-prone and resource intensive process. We address these issues by using a hybrid Machine Learning model that automatically searches through hacker forum posts, identifies the posts that are most relevant to cyber security and then clusters the relevant posts into estimations of the topics that the hackers are discussing. The first (identification) stage uses Support Vector Machines and the second (clustering) stage uses Latent Dirichlet Allocation. We tested our model, using data from an actual hacker forum, to automatically extract information about various threats such as leaked credentials, malicious proxy servers, malware that evades AV detection, etc. The results demonstrate our method is an effective means for quickly extracting relevant and actionable intelligence that can be integrated with traditional security controls to increase their effectiveness.

2019-03-06

Xing, Z., Liu, L., Li, S., Liu, Y.. 2018. Analysis of Radiation Effects for Monitoring Circuit Based on Deep Belief Network and Support Vector Method. 2018 Prognostics and System Health Management Conference (PHM-Chongqing). :511-516.

The monitoring circuit is widely applied in radiation environment and it is of significance to study the circuit reliability with the radiation effects. In this paper, an intelligent analysis method based on Deep Belief Network (DBN) and Support Vector Method is proposed according to the radiation experiments analysis of the monitoring circuit. The Total Ionizing Dose (TID) of the monitoring circuit is used to identify the circuit degradation trend. Firstly, the output waveforms of the monitoring circuit are obtained by radiating with the different TID. Subsequently, the Deep Belief Network Model is trained to extract the features of the circuit signal. Finally, the Support Vector Machine (SVM) and Support Vector Regression (SVR) are applied to classify and predict the remaining useful life (RUL) of the monitoring circuit. According to the experimental results, the performance of DBN-SVM exceeds DBN method for feature extraction and classification, and SVR is effective for predicting the degradation.

2019-02-25

Popovac, M., Karanovic, M., Sladojevic, S., Arsenovic, M., Anderla, A.. 2018. Convolutional Neural Network Based SMS Spam Detection. 2018 26th Telecommunications Forum (℡FOR). :1–4.

SMS spam refers to undesired text message. Machine Learning methods for anti-spam filters have been noticeably effective in categorizing spam messages. Dataset used in this research is known as Tiago's dataset. Crucial step in the experiment was data preprocessing, which involved reducing text to lower case, tokenization, removing stopwords. Convolutional Neural Network was the proposed method for classification. Overall model's accuracy was 98.4%. Obtained model can be used as a tool in many applications.

Vishagini, V., Rajan, A. K.. 2018. An Improved Spam Detection Method with Weighted Support Vector Machine. 2018 International Conference on Data Science and Engineering (ICDSE). :1–5.

Email is the most admired method of exchanging messages using the Internet. One of the intimidations to email users is to detect the spam they receive. This can be addressed using different detection and filtering techniques. Machine learning algorithms, especially Support Vector Machine (SVM), can play vital role in spam detection. We propose the use of weighted SVM for spam filtering using weight variables obtained by KFCM algorithm. The weight variables reflect the importance of different classes. The misclassification of emails is reduced by the growth of weight value. We evaluate the impact of spam detection using SVM, WSVM with KPCM and WSVM with KFCM.UCI Repository SMS Spam base dataset is used for our experimentation.

2019-02-18

Fukushima, Keishiro, Nakamura, Toru, Ikeda, Daisuke, Kiyomoto, Shinsaku. 2018. Challenges in Classifying Privacy Policies by Machine Learning with Word-based Features. Proceedings of the 2Nd International Conference on Cryptography, Security and Privacy. :62–66.

In this paper, we discuss challenges when we try to automatically classify privacy policies using machine learning with words as the features. Since it is difficult for general public to understand privacy policies, it is necessary to support them to do that. To this end, the authors believe that machine learning is one of the promising ways because users can grasp the meaning of policies through outputs by a machine learning algorithm. Our final goal is to develop a system which automatically translates privacy policies into privacy labels [1]. Toward this goal, we classify sentences in privacy policies with category labels, using popular machine learning algorithms, such as a naive Bayes classifier.We choose these algorithms because we could use trained classifiers to evaluate keywords appropriate for privacy labels. Therefore, we adopt words as the features of those algorithms. Experimental results show about 85% accuracy. We think that much higher accuracy is necessary to achieve our final goal. By changing learning settings, we identified one reason of low accuracies such that privacy policies include many sentences which are not direct description of information about categories. It seems that such sentences are redundant but maybe they are essential in case of legal documents in order to prevent misinterpreting. Thus, it is important for machine learning algorithms to handle these redundant sentences appropriately.

Gu, Bin, Yuan, Xiao-Tong, Chen, Songcan, Huang, Heng. 2018. New Incremental Learning Algorithm for Semi-Supervised Support Vector Machine. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. :1475–1484.

Semi-supervised learning is especially important in data mining applications because it can make use of plentiful unlabeled data to train the high-quality learning models. Semi-Supervised Support Vector Machine (S3VM) is a powerful semi-supervised learning model. However, the high computational cost and non-convexity severely impede the S3VM method in large-scale applications. Although several learning algorithms were proposed for S3VM, scaling up S3VM is still an open problem. To address this challenging problem, in this paper, we propose a new incremental learning algorithm to scale up S3VM (IL-S3VM) based on the path following technique in the framework of Difference of Convex (DC) programming. The traditional DC programming based algorithms need multiple outer loops and are not suitable for incremental learning, and traditional path following algorithms are limited to convex problems. Our new IL-S3VM algorithm based on the path-following technique can directly update the solution of S3VM to converge to a local minimum within one outer loop so that the efficient incremental learning can be achieved. More importantly, we provide the finite convergence analysis for our new algorithm. To the best of our knowledge, our new IL-S3VM algorithm is the first efficient path following algorithm for a non-convex problem (i.e., S3VM) with local minimum convergence guarantee. Experimental results on a variety of benchmark datasets not only confirm the finite convergence of IL-S3VM, but also show a huge reduction of computational time compared with existing batch and incremental learning algorithms, while retaining the similar generalization performance.

Wu, KuanTing, Chou, ShingHua, Chen, ShyhWei, Tsai, ChingTsorng, Yuan, ShyanMing. 2018. Application of Machine Learning to Identify Counterfeit Website. Proceedings of the 20th International Conference on Information Integration and Web-based Applications & Services. :321–324.

Recent years the prevalence of fraudulent websites has become more severe than before. Fraudulent ecommerce websites that sell counterfeit goods not only cost financial damage to consumers but also have a great impact on Internet industry. Nowadays, there is not an effective way to confront these websites. In this paper, we look forward to achieving three goals: find the characteristics of counterfeit websites, train models for classifying ecommerce websites and provide a service to help consumers distinguish counterfeit websites from legitimate ones.

Oka, Daisuke, Balage, Don Hiroshan Lakmal, Motegi, Kazuhiro, Kobayashi, Yasuhiro, Shiraishi, Yoichi. 2018. A Combination of Support Vector Machine and Heuristics in On-line Non-Destructive Inspection System. Proceedings of the 2018 International Conference on Machine Learning and Machine Intelligence. :45–49.

This paper deals with an on-line non-destructive inspection system by using hammering sounds based on the combination of support vector machine and a heuristic algorithm. In machine learning algorithms, the perfect performance is hard to attain and it is newly suggested that a heuristic algorithm redeeming this insufficiency is connected to the support vector machine as a post-process. The experimental results show that the combination of support vector machine and the heuristic algorithm attains 100% detection of defective pieces with 18.4% of erroneous determination of non-defective pieces within the upper limit of given processing time.

Zhu, Mengeheng, Shi, Hong. 2018. A Novel Support Vector Machine Algorithm for Missing Data. Proceedings of the 2Nd International Conference on Innovation in Artificial Intelligence. :48–53.

Missing data problem often occurs in data analysis. The most common way to solve this problem is imputation. But imputation methods are only suitable for dealing with a low proportion of missing data, when assuming that missing data satisfies MCAR (Missing Completely at Random) or MAR (Missing at Random). In this paper, considering the reasons for missing data, we propose a novel support vector machine method using a new kernel function to solve the problem with a relatively large proportion of missing data. This method makes full use of observed data to reduce the error caused by filling a large number of missing values. We validate our method on 4 data sets from UCI Repository of Machine Learning. The accuracy, F-score, Kappa statistics and recall are used to evaluate the performance. Experimental results show that our method achieve significant improvement in terms of classification results compared with common imputation methods, even when the proportion of missing data is high.

Wu, Siyan, Tong, Xiaojun, Wang, Wei, Xin, Guodong, Wang, Bailing, Zhou, Qi. 2018. Website Defacements Detection Based on Support Vector Machine Classification Method. Proceedings of the 2018 International Conference on Computing and Data Engineering. :62–66.

Website defacements can inflict significant harm on the website owner through the loss of reputation, the loss of money, or the leakage of information. Due to the complexity and diversity of all kinds of web application systems, especially a lack of necessary security maintenance, website defacements increased year by year. In this paper, we focus on detecting whether the website has been defaced by extracting website features and website embedded trojan features. We use three kinds of classification learning algorithms which include Gradient Boosting Decision Tree (GBDT), Random Forest (RF) and Support Vector Machine (SVM) to do the classification experiments, and experimental results show that Support Vector Machine classifier performed better than two other classifiers. It can achieve an overall accuracy of 95%-96% in detecting website defacements.

Xu, Bowen, Shirani, Amirreza, Lo, David, Alipour, Mohammad Amin. 2018. Prediction of Relatedness in Stack Overflow: Deep Learning vs. SVM: A Reproducibility Study. Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. :21:1–21:10.

Background Xu et al. used a deep neural network (DNN) technique to classify the degree of relatedness between two knowledge units (question-answer threads) on Stack Overflow. More recently, extending Xu et al.'s work, Fu and Menzies proposed a simpler classification technique based on a fine-tuned support vector machine (SVM) that achieves similar performance but in a much shorter time. Thus, they suggested that researchers need to compare their sophisticated methods against simpler alternatives. Aim The aim of this work is to replicate the previous studies and further investigate the validity of Fu and Menzies' claim by evaluating the DNN- and SVM-based approaches on a larger dataset. We also compare the effectiveness of these two approaches against SimBow, a lightweight SVM-based method that was previously used for general community question-answering. Method We (1) collect a large dataset containing knowledge units from Stack Overflow, (2) show the value of the new dataset addressing shortcomings of the original one, (3) re-evaluate both the DNN-and SVM-based approaches on the new dataset, and (4) compare the performance of the two approaches against that of SimBow. Results We find that: (1) there are several limitations in the original dataset used in the previous studies, (2) effectiveness of both Xu et al.'s and Fu and Menzies' approaches (as measured using F1-score) drop sharply on the new dataset, (3) similar to the previous finding, performance of SVM-based approaches (Fu and Menzies' approach and SimBow) are slightly better than the DNN-based approach, (4) contrary to the previous findings, Fu and Menzies' approach runs much slower than DNN-based approach on the larger dataset - its runtime grows sharply with increase in dataset size, and (5) SimBow outperforms both Xu et al. and Fu and Menzies' approaches in terms of runtime. Conclusion We conclude that, for this task, simpler approaches based on SVM performs adequately well. We also illustrate the challenges brought by the increased size of the dataset and show the benefit of a lightweight SVM-based approach for this task.

Lu, Yunmei, Yan, Mingyuan, Han, Meng, Zhang, Qingliang, Zhang, Yanqing. 2018. Privacy Preserving Multiclass Classification for Horizontally Distributed Data. Proceedings of the 19th Annual SIG Conference on Information Technology Education. :165–165.

With the advent of the era of big data, applying data mining techniques on assembling data from multiple parties (or sources) has become a leading trend. In this work, a Privacy Preserving Multiclass Classification (PPM2C) method is proposed. Experimental results show that PPM2C is workable and stable.