Visible to the public Biblio

Filters: Keyword is Twitter  [Clear All Filters]
2021-08-11
Alshaikh, Mansour, Zohdy, Mohamed.  2020.  Sentiment Analysis for Smartphone Operating System: Privacy and Security on Twitter Data. 2020 IEEE International Conference on Electro Information Technology (EIT). :366—369.
The aim of the study was to investigate the privacy and security of the user data on Twitter. For gathering the essential information, more than two million relevant tweets through the span of two years were used to conduct the study. In addition, we are classifying sentiment of Twitter data by exhibiting results of a machine learning by using the Naive Bayes algorithm. Although this algorithm is time consuming compared to the listing method yet can lead to effective estimation relatively. The tweets are extracted and pre-processed and then categorized them in neutral, negative and positive sentiments. By applying the chosen methodology, the study would end up in identifying the most effective mobile operating systems according to the sentiments of social media users. Additionally, the application of the algorithm needs to meet the privacy and security needs of Twitter users in order to optimize the use of social media intelligence. The approach will help in assessing the competitive intelligence of the Twitter data and the challenges in the form of privacy and- security of the user content and their contextual information simultaneously. The findings of the empirical research show that users are more concerned about the privacy and security of iOS compared to Android and Windows phone.
2021-03-04
Carlini, N., Farid, H..  2020.  Evading Deepfake-Image Detectors with White- and Black-Box Attacks. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). :2804—2813.

It is now possible to synthesize highly realistic images of people who do not exist. Such content has, for example, been implicated in the creation of fraudulent socialmedia profiles responsible for dis-information campaigns. Significant efforts are, therefore, being deployed to detect synthetically-generated content. One popular forensic approach trains a neural network to distinguish real from synthetic content.We show that such forensic classifiers are vulnerable to a range of attacks that reduce the classifier to near- 0% accuracy. We develop five attack case studies on a state- of-the-art classifier that achieves an area under the ROC curve (AUC) of 0.95 on almost all existing image generators, when only trained on one generator. With full access to the classifier, we can flip the lowest bit of each pixel in an image to reduce the classifier's AUC to 0.0005; perturb 1% of the image area to reduce the classifier's AUC to 0.08; or add a single noise pattern in the synthesizer's latent space to reduce the classifier's AUC to 0.17. We also develop a black-box attack that, with no access to the target classifier, reduces the AUC to 0.22. These attacks reveal significant vulnerabilities of certain image-forensic classifiers.

2021-02-23
Krohmer, D., Schotten, H. D..  2020.  Decentralized Identifier Distribution for Moving Target Defense and Beyond. 2020 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA). :1—8.

In this work, we propose a novel approach for decentralized identifier distribution and synchronization in networks. The protocol generates network entity identifiers composed of timestamps and cryptographically secure random values with a significant reduction of collision probability. The distribution is inspired by Unique Universal Identifiers and Timestamp-based Concurrency Control algorithms originating from database applications. We defined fundamental requirements for the distribution, including: uniqueness, accuracy of distribution, optimal timing behavior, scalability, small impact on network load for different operation modes and overall compliance to common network security objectives. An implementation of the proposed approach is evaluated and the results are presented. Originally designed for a domain of proactive defense strategies known as Moving Target Defense, the general architecture of the protocol enables arbitrary applications where identifier distributions in networks have to be decentralized, rapid and secure.

2021-02-15
Drakopoulos, G., Giotopoulos, K., Giannoukou, I., Sioutas, S..  2020.  Unsupervised Discovery Of Semantically Aware Communities With Tensor Kruskal Decomposition: A Case Study In Twitter. 2020 15th International Workshop on Semantic and Social Media Adaptation and Personalization (SMA. :1–8.
Substantial empirical evidence, including the success of synthetic graph generation models as well as of analytical methodologies, suggests that large, real graphs have a recursive community structure. The latter results, in part at least, in other important properties of these graphs such as low diameter, high clustering coefficient values, heavy degree distribution tail, and clustered graph spectrum. Notice that this structure need not be official or moderated like Facebook groups, but it can also take an ad hoc and unofficial form depending on the functionality of the social network under study as for instance the follow relationship on Twitter or the connections between news aggregators on Reddit. Community discovery is paramount in numerous applications such as political campaigns, digital marketing, crowdfunding, and fact checking. Here a tensor representation for Twitter subgraphs is proposed which takes into consideration both the followfollower relationships but also the coherency in hashtags. Community structure discovery then reduces to the computation of Tucker tensor decomposition, a higher order counterpart of the well-known unsupervised learning method of singular value decomposition (SVD). Tucker decomposition clearly outperforms the SVD in terms of finding a more compact community size distribution in experiments done in Julia on a Twitter subgraph. This can be attributed to the facts that the proposed methodology combines both structural and functional Twitter elements and that hashtags carry an increased semantic weight in comparison to ordinary tweets.
2020-11-04
Khurana, N., Mittal, S., Piplai, A., Joshi, A..  2019.  Preventing Poisoning Attacks On AI Based Threat Intelligence Systems. 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP). :1—6.

As AI systems become more ubiquitous, securing them becomes an emerging challenge. Over the years, with the surge in online social media use and the data available for analysis, AI systems have been built to extract, represent and use this information. The credibility of this information extracted from open sources, however, can often be questionable. Malicious or incorrect information can cause a loss of money, reputation, and resources; and in certain situations, pose a threat to human life. In this paper, we use an ensembled semi-supervised approach to determine the credibility of Reddit posts by estimating their reputation score to ensure the validity of information ingested by AI systems. We demonstrate our approach in the cybersecurity domain, where security analysts utilize these systems to determine possible threats by analyzing the data scattered on social media websites, forums, blogs, etc.

2020-11-02
Kralevska, Katina, Gligoroski, Danilo, Jensen, Rune E., Øverby, Harald.  2018.  HashTag Erasure Codes: From Theory to Practice. IEEE Transactions on Big Data. 4:516—529.
Minimum-Storage Regenerating (MSR) codes have emerged as a viable alternative to Reed-Solomon (RS) codes as they minimize the repair bandwidth while they are still optimal in terms of reliability and storage overhead. Although several MSR constructions exist, so far they have not been practically implemented mainly due to the big number of I/O operations. In this paper, we analyze high-rate MDS codes that are simultaneously optimized in terms of storage, reliability, I/O operations, and repair-bandwidth for single and multiple failures of the systematic nodes. The codes were recently introduced in [1] without any specific name. Due to the resemblance between the hashtag sign \# and the procedure of the code construction, we call them in this paper HashTag Erasure Codes (HTECs). HTECs provide the lowest data-read and data-transfer, and thus the lowest repair time for an arbitrary sub-packetization level α, where α ≤ r⌈k/r⌉, among all existing MDS codes for distributed storage including MSR codes. The repair process is linear and highly parallel. Additionally, we show that HTECs are the first high-rate MDS codes that reduce the repair bandwidth for more than one failure. Practical implementations of HTECs in Hadoop release 3.0.0-alpha2 demonstrate their great potentials.
2020-10-12
Foreman, Zackary, Bekman, Thomas, Augustine, Thomas, Jafarian, Haadi.  2019.  PAVSS: Privacy Assessment Vulnerability Scoring System. 2019 International Conference on Computational Science and Computational Intelligence (CSCI). :160–165.
Currently, the guidelines for business entities to collect and use consumer information from online sources is guided by the Fair Information Practice Principles set forth by the Federal Trade Commission in the United States. These guidelines are inadequate, outdated, and provide little protection for consumers. Moreover, there are many techniques to anonymize the stored data that was collected by large companies and governments. However, what does not exist is a framework that is capable of evaluating and scoring the effects of this information in the event of a data breach. In this work, a framework for scoring and evaluating the vulnerability of private data is presented. This framework is created to be used in parallel with currently adopted frameworks that are used to score and evaluate other areas of deficiencies within the software, including CVSS and CWSS. It is dubbed the Privacy Assessment Vulnerability Scoring System (PAVSS) and quantifies the privacy-breach vulnerability an individual takes on when using an online platform. This framework is based on a set of hypotheses about user behavior, inherent properties of an online platform, and the usefulness of available data in performing a cyber attack. The weight each of these metrics has within our model is determined by surveying cybersecurity experts. Finally, we test the validity of our user-behavior based hypotheses, and indirectly our model by analyzing user posts from a large twitter data set.
2020-05-22
Devarakonda, Ranjeet, Giansiracusa, Michael, Kumar, Jitendra.  2018.  Machine Learning and Social Media to Mine and Disseminate Big Scientific Data. 2018 IEEE International Conference on Big Data (Big Data). :5312—5315.

One of the challenges in supplying the communities with wider access to scientific databases is the need for knowledge of database languages like Structured Query Language (SQL). Although the SQL language has been published in many forms, not everybody is able to write SQL queries. Another challenge is that it might not be practical to make the public aware of the structure of databases. There is a need for novice users to query relational databases using their natural language. To solve this problem, many natural language interfaces to structured databases have been developed. The goal is to provide a more intuitive method for generating database queries and delivering responses. Through social media, which makes it possible to interact with a wide section of the population, and with the help of natural language processing, researchers at the Atmospheric Radiation Measurement (ARM) Data Center at Oak Ridge National Laboratory (ORNL) have developed a concept to enable easy search and retrieval of data from several environmental data centers for the scientific community through social media.Using a machine learning framework that maps natural language text to thousands of datasets, instruments, variables, and data streams, the prototype system would allow users to request data through Twitter and receive a link (via tweet) to applicable data results on the project's search catalog tailored to their key words. This automated identification of relevant data from various petascale archives at ORNL could increase convenience, access, and use of the project's data by the broader community. In this paper we discuss how some data-intensive projects at ORNL are using innovative ways to help in data discovery.

2020-05-15
Jeyasudha, J., Usha, G..  2018.  Detection of Spammers in the Reconnaissance Phase by machine learning techniques. 2018 3rd International Conference on Inventive Computation Technologies (ICICT). :216—220.

Reconnaissance phase is where attackers identify their targets and how to collect information from professional social networks which can be used to select and exploit targeted employees to penetrate in an organization. Here, a framework is proposed for the early detection of attackers in the reconnaissance phase, highlighting the common characteristic behavior among attackers in professional social networks. And to create artificial honeypot profiles within the organizational social network which can be used to detect a potential incoming threat. By analyzing the dataset of social Network profiles in combination of machine learning techniques, A DspamRPfast model is proposed for the creation of a classifier system to predict the probabilities of the profiles being fake or malicious and to filter them out using XGBoost and for the faster classification and greater accuracy of 84.8%.

2020-05-08
Huang, Yifan, Chung, Wingyan, Tang, Xinlin.  2018.  A Temporal Recurrent Neural Network Approach to Detecting Market Anomaly Attacks. 2018 IEEE International Conference on Intelligence and Security Informatics (ISI). :160—162.

In recent years, the spreading of malicious social media messages about financial stocks has threatened the security of financial market. Market Anomaly Attacks is an illegal practice in the stock or commodities markets that induces investors to make purchase or sale decisions based on false information. Identifying these threats from noisy social media datasets remains challenging because of the long time sequence in these social media postings, ambiguous textual context and the difficulties for traditional deep learning approaches to handle both temporal and text dependent data such as financial social media messages. This research developed a temporal recurrent neural network (TRNN) approach to capturing both time and text sequence dependencies for intelligent detection of market anomalies. We tested the approach by using financial social media of U.S. technology companies and their stock returns. Compared with traditional neural network approaches, TRNN was found to more efficiently and effectively classify abnormal returns.

Dionísio, Nuno, Alves, Fernando, Ferreira, Pedro M., Bessani, Alysson.  2019.  Cyberthreat Detection from Twitter using Deep Neural Networks. 2019 International Joint Conference on Neural Networks (IJCNN). :1—8.

To be prepared against cyberattacks, most organizations resort to security information and event management systems to monitor their infrastructures. These systems depend on the timeliness and relevance of the latest updates, patches and threats provided by cyberthreat intelligence feeds. Open source intelligence platforms, namely social media networks such as Twitter, are capable of aggregating a vast amount of cybersecurity-related sources. To process such information streams, we require scalable and efficient tools capable of identifying and summarizing relevant information for specified assets. This paper presents the processing pipeline of a novel tool that uses deep neural networks to process cybersecurity information received from Twitter. A convolutional neural network identifies tweets containing security-related information relevant to assets in an IT infrastructure. Then, a bidirectional long short-term memory network extracts named entities from these tweets to form a security alert or to fill an indicator of compromise. The proposed pipeline achieves an average 94% true positive rate and 91% true negative rate for the classification task and an average F1-score of 92% for the named entity recognition task, across three case study infrastructures.

Chaudhary, Anshika, Mittal, Himangi, Arora, Anuja.  2019.  Anomaly Detection using Graph Neural Networks. 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon). :346—350.

Conventional methods for anomaly detection include techniques based on clustering, proximity or classification. With the rapidly growing social networks, outliers or anomalies find ingenious ways to obscure themselves in the network and making the conventional techniques inefficient. In this paper, we utilize the ability of Deep Learning over topological characteristics of a social network to detect anomalies in email network and twitter network. We present a model, Graph Neural Network, which is applied on social connection graphs to detect anomalies. The combinations of various social network statistical measures are taken into account to study the graph structure and functioning of the anomalous nodes by employing deep neural networks on it. The hidden layer of the neural network plays an important role in finding the impact of statistical measure combination in anomaly detection.

2019-09-23
Suriarachchi, I., Withana, S., Plale, B..  2018.  Big Provenance Stream Processing for Data Intensive Computations. 2018 IEEE 14th International Conference on e-Science (e-Science). :245–255.
In the business and research landscape of today, data analysis consumes public and proprietary data from numerous sources, and utilizes any one or more of popular data-parallel frameworks such as Hadoop, Spark and Flink. In the Data Lake setting these frameworks co-exist. Our earlier work has shown that data provenance in Data Lakes can aid with both traceability and management. The sheer volume of fine-grained provenance generated in a multi-framework application motivates the need for on-the-fly provenance processing. We introduce a new parallel stream processing algorithm that reduces fine-grained provenance while preserving backward and forward provenance. The algorithm is resilient to provenance events arriving out-of-order. It is evaluated using several strategies for partitioning a provenance stream. The evaluation shows that the parallel algorithm performs well in processing out-of-order provenance streams, with good scalability and accuracy.
2019-08-05
Nabipourshiri, Rouzbeh, Abu-Salih, Bilal, Wongthongtham, Pornpit.  2018.  Tree-Based Classification to Users' Trustworthiness in OSNs. Proceedings of the 2018 10th International Conference on Computer and Automation Engineering. :190-194.

In the light of the information revolution, and the propagation of big social data, the dissemination of misleading information is certainly difficult to control. This is due to the rapid and intensive flow of information through unconfirmed sources under the propaganda and tendentious rumors. This causes confusion, loss of trust between individuals and groups and even between governments and their citizens. This necessitates a consolidation of efforts to stop penetrating of false information through developing theoretical and practical methodologies aim to measure the credibility of users of these virtual platforms. This paper presents an approach to domain-based prediction to user's trustworthiness of Online Social Networks (OSNs). Through incorporating three machine learning algorithms, the experimental results verify the applicability of the proposed approach to classify and predict domain-based trustworthy users of OSNs.

2019-03-04
Aborisade, O., Anwar, M..  2018.  Classification for Authorship of Tweets by Comparing Logistic Regression and Naive Bayes Classifiers. 2018 IEEE International Conference on Information Reuse and Integration (IRI). :269–276.

At a time when all it takes to open a Twitter account is a mobile phone, the act of authenticating information encountered on social media becomes very complex, especially when we lack measures to verify digital identities in the first place. Because the platform supports anonymity, fake news generated by dubious sources have been observed to travel much faster and farther than real news. Hence, we need valid measures to identify authors of misinformation to avert these consequences. Researchers propose different authorship attribution techniques to approach this kind of problem. However, because tweets are made up of only 280 characters, finding a suitable authorship attribution technique is a challenge. This research aims to classify authors of tweets by comparing machine learning methods like logistic regression and naive Bayes. The processes of this application are fetching of tweets, pre-processing, feature extraction, and developing a machine learning model for classification. This paper illustrates the text classification for authorship process using machine learning techniques. In total, there were 46,895 tweets used as both training and testing data, and unique features specific to Twitter were extracted. Several steps were done in the pre-processing phase, including removal of short texts, removal of stop-words and punctuations, tokenizing and stemming of texts as well. This approach transforms the pre-processed data into a set of feature vector in Python. Logistic regression and naive Bayes algorithms were applied to the set of feature vectors for the training and testing of the classifier. The logistic regression based classifier gave the highest accuracy of 91.1% compared to the naive Bayes classifier with 89.8%.

2019-02-25
Cornelissen, Laurenz A., Barnett, Richard J, Schoonwinkel, Petrus, Eichstadt, Brent D., Magodla, Hluma B..  2018.  A Network Topology Approach to Bot Classification. Proceedings of the Annual Conference of the South African Institute of Computer Scientists and Information Technologists. :79-88.
Automated social agents, or bots are increasingly becoming a problem on social media platforms. There is a growing body of literature and multiple tools to aid in the detection of such agents on online social networking platforms. We propose that the social network topology of a user would be sufficient to determine whether the user is a automated agent or a human. To test this, we use a publicly available dataset containing users on Twitter labelled as either automated social agent or human. Using an unsupervised machine learning approach, we obtain a detection accuracy rate of 70%.
Xu, H., Hu, L., Liu, P., Xiao, Y., Wang, W., Dayal, J., Wang, Q., Tang, Y..  2018.  Oases: An Online Scalable Spam Detection System for Social Networks. 2018 IEEE 11th International Conference on Cloud Computing (CLOUD). :98–105.
Web-based social networks enable new community-based opportunities for participants to engage, share their thoughts, and interact with each other. Theses related activities such as searching and advertising are threatened by spammers, content polluters, and malware disseminators. We propose a scalable spam detection system, termed Oases, for uncovering social spam in social networks using an online and scalable approach. The novelty of our design lies in two key components: (1) a decentralized DHT-based tree overlay deployment for harvesting and uncovering deceptive spam from social communities; and (2) a progressive aggregation tree for aggregating the properties of these spam posts for creating new spam classifiers to actively filter out new spam. We design and implement the prototype of Oases and discuss the design considerations of the proposed approach. Our large-scale experiments using real-world Twitter data demonstrate scalability, attractive load-balancing, and graceful efficiency in online spam detection for social networks.
2019-02-18
Zhang, X., Xie, H., Lui, J. C. S..  2018.  Sybil Detection in Social-Activity Networks: Modeling, Algorithms and Evaluations. 2018 IEEE 26th International Conference on Network Protocols (ICNP). :44–54.

Detecting fake accounts (sybils) in online social networks (OSNs) is vital to protect OSN operators and their users from various malicious activities. Typical graph-based sybil detection (a mainstream methodology) assumes that sybils can make friends with only a limited (or small) number of honest users. However, recent evidences showed that this assumption does not hold in real-world OSNs, leading to low detection accuracy. To address this challenge, we explore users' activities to assist sybil detection. The intuition is that honest users are much more selective in choosing who to interact with than to befriend with. We first develop the social and activity network (SAN), a two-layer hyper-graph that unifies users' friendships and their activities, to fully utilize users' activities. We also propose a more practical sybil attack model, where sybils can launch both friendship attacks and activity attacks. We then design Sybil SAN to detect sybils via coupling three random walk-based algorithms on the SAN, and prove the convergence of Sybil SAN. We develop an efficient iterative algorithm to compute the detection metric for Sybil SAN, and derive the number of rounds needed to guarantee the convergence. We use "matrix perturbation theory" to bound the detection error when sybils launch many friendship attacks and activity attacks. Extensive experiments on both synthetic and real-world datasets show that Sybil SAN is highly robust against sybil attacks, and can detect sybils accurately under practical scenarios, where current state-of-art sybil defenses have low accuracy.

2018-11-28
Ghelani, Nimesh, Mohammed, Salman, Wang, Shine, Lin, Jimmy.  2017.  Event Detection on Curated Tweet Streams. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. :1325–1328.

We present a system for identifying interesting social media posts on Twitter and delivering them to users' mobile devices in real time as push notifications. In our problem formulation, users are interested in broad topics such as politics, sports, and entertainment: our system processes tweets in real time to identify relevant, novel, and salient content. There are three interesting aspects to our work: First, instead of attempting to tame the cacophony of unfiltered tweets, we exploit a smaller, but still sizeable, collection of curated tweet streams corresponding to the Twitter accounts of different media outlets. Second, we apply distant supervision to extract topic labels from curated streams that have a specific focus, which can then be leveraged to build high-quality topic classifiers essentially "for free". Finally, our system delivers content via Twitter direct messages, supporting in situ interactions modeled after conversations with intelligent agents. These ideas are demonstrated in an end-to-end working prototype.

2018-10-26
Al-Janabi, Mohammed, Quincey, Ed de, Andras, Peter.  2017.  Using Supervised Machine Learning Algorithms to Detect Suspicious URLs in Online Social Networks. Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017. :1104–1111.

The increasing volume of malicious content in social networks requires automated methods to detect and eliminate such content. This paper describes a supervised machine learning classification model that has been built to detect the distribution of malicious content in online social networks (ONSs). Multisource features have been used to detect social network posts that contain malicious Uniform Resource Locators (URLs). These URLs could direct users to websites that contain malicious content, drive-by download attacks, phishing, spam, and scams. For the data collection stage, the Twitter streaming application programming interface (API) was used and VirusTotal was used for labelling the dataset. A random forest classification model was used with a combination of features derived from a range of sources. The random forest model without any tuning and feature selection produced a recall value of 0.89. After further investigation and applying parameter tuning and feature selection methods, however, we were able to improve the classifier performance to 0.92 in recall.

2018-03-19
Vougioukas, Michail, Androutsopoulos, Ion, Paliouras, Georgios.  2017.  A Personalized Global Filter To Predict Retweets. Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization. :393–394.

Information shared on Twitter is ever increasing and users-recipients are overwhelmed by the number of tweets they receive, many of which of no interest. Filters that estimate the interest of each incoming post can alleviate this problem, for example by allowing users to sort incoming posts by predicted interest (e.g., "top stories" vs. "most recent" in Facebook). Global and personal filters have been used to detect interesting posts in social networks. Global filters are trained on large collections of posts and reactions to posts (e.g., retweets), aiming to predict how interesting a post is for a broad audience. In contrast, personal filters are trained on posts received by a particular user and the reactions of the particular user. Personal filters can provide recommendations tailored to a particular user's interests, which may not coincide with the interests of the majority of users that global filters are trained to predict. On the other hand, global filters are typically trained on much larger datasets compared to personal filters. Hence, global filters may work better in practice, especially with new users, for which personal filters may have very few training instances ("cold start" problem). Following Uysal and Croft, we devised a hybrid approach that combines the strengths of both global and personal filters. As in global filters, we train a single system on a large, multi-user collection of tweets. Each tweet, however, is represented as a feature vector with a number of user-specific features.

2017-11-03
Preotiuc-Pietro, Daniel, Carpenter, Jordan, Giorgi, Salvatore, Ungar, Lyle.  2016.  Studying the Dark Triad of Personality Through Twitter Behavior. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :761–770.
Research into the darker traits of human nature is growing in interest especially in the context of increased social media usage. This allows users to express themselves to a wider online audience. We study the extent to which the standard model of dark personality – the dark triad – consisting of narcissism, psychopathy and Machiavellianism, is related to observable Twitter behavior such as platform usage, posted text and profile image choice. Our results show that we can map various behaviors to psychological theory and study new aspects related to social media usage. Finally, we build a machine learning algorithm that predicts the dark triad of personality in out-of-sample users with reliable accuracy.
2017-08-22
Bouchlaghem, Rihab, Elkhelifi, Aymen, Faiz, Rim.  2016.  A Machine Learning Approach For Classifying Sentiments in Arabic Tweets. Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics. :24:1–24:6.

Nowadays, sentiment analysis methods become more and more popular especially with the proliferation of social media platform users number. In the same context, this paper presents a sentiment analysis approach which can faithfully translate the sentimental orientation of Arabic Twitter posts, based on a novel data representation and machine learning techniques. The proposed approach applied a wide range of features: lexical, surface-form, syntactic, etc. We also made use of lexicon features inferred from two Arabic sentiment words lexicons. To build our supervised sentiment analysis system, we use several standard classification methods (Support Vector Machines, K-Nearest Neighbour, Naïve Bayes, Decision Trees, Random Forest) known by their effectiveness over such classification issues. In our study, Support Vector Machines classifier outperforms other supervised algorithms in Arabic Twitter sentiment analysis. Via an ablation experiments, we show the positive impact of lexicon based features on providing higher prediction performance.

2017-08-02
Moran, Sean, McCreadie, Richard, Macdonald, Craig, Ounis, Iadh.  2016.  Enhancing First Story Detection Using Word Embeddings. Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. :821–824.

In this paper we show how word embeddings can be used to increase the effectiveness of a state-of-the art Locality Sensitive Hashing (LSH) based first story detection (FSD) system over a standard tweet corpus. Vocabulary mismatch, in which related tweets use different words, is a serious hindrance to the effectiveness of a modern FSD system. In this case, a tweet could be flagged as a first story even if a related tweet, which uses different but synonymous words, was already returned as a first story. In this work, we propose a novel approach to mitigate this problem of lexical variation, based on tweet expansion. In particular, we propose to expand tweets with semantically related paraphrases identified via automatically mined word embeddings over a background tweet corpus. Through experimentation on a large data stream comprised of 50 million tweets, we show that FSD effectiveness can be improved by 9.5% over a state-of-the-art FSD system.

2017-03-07
Burnap, P., Javed, A., Rana, O. F., Awan, M. S..  2015.  Real-time classification of malicious URLs on Twitter using machine activity data. 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). :970–977.

Massive online social networks with hundreds of millions of active users are increasingly being used by Cyber criminals to spread malicious software (malware) to exploit vulnerabilities on the machines of users for personal gain. Twitter is particularly susceptible to such activity as, with its 140 character limit, it is common for people to include URLs in their tweets to link to more detailed information, evidence, news reports and so on. URLs are often shortened so the endpoint is not obvious before a person clicks the link. Cyber criminals can exploit this to propagate malicious URLs on Twitter, for which the endpoint is a malicious server that performs unwanted actions on the person's machine. This is known as a drive-by-download. In this paper we develop a machine classification system to distinguish between malicious and benign URLs within seconds of the URL being clicked (i.e. `real-time'). We train the classifier using machine activity logs created while interacting with URLs extracted from Twitter data collected during a large global event - the Superbowl - and test it using data from another large sporting event - the Cricket World Cup. The results show that machine activity logs produce precision performances of up to 0.975 on training data from the first event and 0.747 on a test data from a second event. Furthermore, we examine the properties of the learned model to explain the relationship between machine activity and malicious software behaviour, and build a learning curve for the classifier to illustrate that very small samples of training data can be used with only a small detriment to performance.