Visible to the public Biblio

Filters: Keyword is NLP  [Clear All Filters]
2023-06-29
Abbas, Qamber, Zeshan, Muhammad Umar, Asif, Muhammad.  2022.  A CNN-RNN Based Fake News Detection Model Using Deep Learning. 2022 International Seminar on Computer Science and Engineering Technology (SCSET). :40–45.

False news has become widespread in the last decade in political, economic, and social dimensions. This has been aided by the deep entrenchment of social media networking in these dimensions. Facebook and Twitter have been known to influence the behavior of people significantly. People rely on news/information posted on their favorite social media sites to make purchase decisions. Also, news posted on mainstream and social media platforms has a significant impact on a particular country’s economic stability and social tranquility. Therefore, there is a need to develop a deceptive system that evaluates the news to avoid the repercussions resulting from the rapid dispersion of fake news on social media platforms and other online platforms. To achieve this, the proposed system uses the preprocessing stage results to assign specific vectors to words. Each vector assigned to a word represents an intrinsic characteristic of the word. The resulting word vectors are then applied to RNN models before proceeding to the LSTM model. The output of the LSTM is used to determine whether the news article/piece is fake or otherwise.

2023-04-14
Johri, Era, Dharod, Leesa, Joshi, Rasika, Kulkarni, Shreya, Kundle, Vaibhavi.  2022.  Video Captcha Proposition based on VQA, NLP, Deep Learning and Computer Vision. 2022 5th International Conference on Advances in Science and Technology (ICAST). :196–200.
Visual Question Answering or VQA is a technique used in diverse domains ranging from simple visual questions and answers on short videos to security. Here in this paper, we talk about the video captcha that will be deployed for user authentication. Randomly any short video of length 10 to 20 seconds will be displayed and automated questions and answers will be generated by the system using AI and ML. Automated Programs have maliciously affected gateways such as login, registering etc. Therefore, in today's environment it is necessary to deploy such security programs that can recognize the objects in a video and generate automated MCQs real time that can be of context like the object movements, color, background etc. The features in the video highlighted will be recorded for generating MCQs based on the short videos. These videos can be random in nature. They can be taken from any official websites or even from your own local computer with prior permission from the user. The format of the video must be kept as constant every time and must be cross checked before flashing it to the user. Once our system identifies the captcha and determines the authenticity of a user, the other website in which the user wants to login, can skip the step of captcha verification as it will be done by our system. A session will be maintained for the user, eliminating the hassle of authenticating themselves again and again for no reason. Once the video will be flashed for an IP address and if the answers marked by the user for the current video captcha are correct, we will add the information like the IP address, the video and the questions in our database to avoid repeating the same captcha for the same IP address. In this paper, we proposed the methodology of execution of the aforementioned and will discuss the benefits and limitations of video captcha along with the visual questions and answering.
2023-03-03
Tiwari, Aditya, Sengar, Neha, Yadav, Vrinda.  2022.  Next Word Prediction Using Deep Learning. 2022 IEEE Global Conference on Computing, Power and Communication Technologies (GlobConPT). :1–6.
Next Word Prediction involves guessing the next word which is most likely to come after the current word. The system suggests a few words. A user can choose a word according to their choice from a list of suggested word by system. It increases typing speed and reduces keystrokes of the user. It is also useful for disabled people to enter text slowly and for those who are not good with spellings. Previous studies focused on prediction of the next word in different languages. Some of them are Bangla, Assamese, Ukraine, Kurdish, English, and Hindi. According to Census 2011, 43.63% of the Indian population uses Hindi, the national language of India. In this work, deep learning techniques are proposed to predict the next word in Hindi language. The paper uses Long Short Term Memory and Bidirectional Long Short Term Memory as the base neural network architecture. The model proposed in this work outperformed the existing approaches and achieved the best accuracy among neural network based approaches on IITB English-Hindi parallel corpus.
2022-09-09
Raafat, Maryam A., El-Wakil, Rania Abdel-Fattah, Atia, Ayman.  2021.  Comparative study for Stylometric analysis techniques for authorship attribution. 2021 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC). :176—181.
A text is a meaningful source of information. Capturing the right patterns in written text gives metrics to measure and infer to what extent this text belongs or is relevant to a specific author. This research aims to introduce a new feature that goes more in deep in the language structure. The feature introduced is based on an attempt to differentiate stylistic changes among authors according to the different sentence structure each author uses. The study showed the effect of introducing this new feature to machine learning models to enhance their performance. It was found that the prediction of authors was enhanced by adding sentence structure as an additional feature as the f1\_scores increased by 0.3% and when normalizing the data and adding the feature it increased by 5%.
Teodorescu, Horia-Nicolai.  2021.  Applying Chemical Linguistics and Stylometry for Deriving an Author’s Scientific Profile. 2021 International Symposium on Signals, Circuits and Systems (ISSCS). :1—4.
The study exercises computational linguistics, specifically chemical linguistics methods for profiling an author. We analyze the vocabulary and the style of the titles of the most visible works of Cristofor I. Simionescu, an internationally well-known chemist, for detecting specific patterns of his research interests and methods. Somewhat surprisingly, while the tools used are elementary and there is only a small number of words in the analysis, some interesting details emerged about the work of the analyzed personality. Some of these aspects were confirmed by experts in the field. We believe this is the first study aiming to author profiling in chemical linguistics, moreover the first to question the usefulness of Google Scholar for author profiling.
2022-03-10
Pölöskei, István.  2021.  Continuous natural language processing pipeline strategy. 2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI). :000221—000224.
Natural language processing (NLP) is a division of artificial intelligence. The constructed model's quality is entirely reliant on the training dataset's quality. A data streaming pipeline is an adhesive application, completing a managed connection from data sources to machine learning methods. The recommended NLP pipeline composition has well-defined procedures. The implemented message broker design is a usual apparatus for delivering events. It makes it achievable to construct a robust training dataset for machine learning use-case and serve the model's input. The reconstructed dataset is a valid input for the machine learning processes. Based on the data pipeline's product, the model recreation and redeployment can be scheduled automatically.
2022-02-24
Paudel, Upakar, Dolan, Andy, Majumdar, Suryadipta, Ray, Indrakshi.  2021.  Context-Aware IoT Device Functionality Extraction from Specifications for Ensuring Consumer Security. 2021 IEEE Conference on Communications and Network Security (CNS). :155–163.
Internet of Thing (IoT) devices are being widely used in smart homes and organizations. An IoT device has some intended purposes, but may also have hidden functionalities. Typically, the device is installed in a home or an organization and the network traffic associated with the device is captured and analyzed to infer high-level functionality to the extent possible. However, such analysis is dynamic in nature, and requires the installation of the device and access to network data which is often hard to get for privacy and confidentiality reasons. We propose an alternative static approach which can infer the functionality of a device from vendor materials using Natural Language Processing (NLP) techniques. Information about IoT device functionality can be used in various applications, one of which is ensuring security in a smart home. We demonstrate how security policies associated with device functionality in a smart home can be formally represented using the NIST Next Generation Access Control (NGAC) model and automatically analyzed using Alloy, which is a formal verification tool. This will provide assurance to the consumer that these devices will be compliant to the home or organizational policy even before they have been purchased.
2021-05-05
Poudyal, Subash, Dasgupta, Dipankar.  2020.  AI-Powered Ransomware Detection Framework. 2020 IEEE Symposium Series on Computational Intelligence (SSCI). :1154—1161.

Ransomware attacks are taking advantage of the ongoing pandemics and attacking the vulnerable systems in business, health sector, education, insurance, bank, and government sectors. Various approaches have been proposed to combat ransomware, but the dynamic nature of malware writers often bypasses the security checkpoints. There are commercial tools available in the market for ransomware analysis and detection, but their performance is questionable. This paper aims at proposing an AI-based ransomware detection framework and designing a detection tool (AIRaD) using a combination of both static and dynamic malware analysis techniques. Dynamic binary instrumentation is done using PIN tool, function call trace is analyzed leveraging Cuckoo sandbox and Ghidra. Features extracted at DLL, function call, and assembly level are processed with NLP, association rule mining techniques and fed to different machine learning classifiers. Support vector machine and Adaboost with J48 algorithms achieved the highest accuracy of 99.54% with 0.005 false-positive rates for a multi-level combined term frequency approach.

2021-04-27
Saganowski, S..  2020.  A Three-Stage Machine Learning Network Security Solution for Public Entities. 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). :1097–1104.
In the era of universal digitization, ensuring network and data security is extremely important. As a part of the Regional Center for Cybersecurity initiative, a three-stage machine learning network security solution is being developed and will be deployed in March 2021. The solution consists of prevention, monitoring, and curation stages. As prevention, we utilize Natural Language Processing to extract the security-related information from social media, news portals, and darknet. A deep learning architecture is used to monitor the network in real-time and detect any abnormal traffic. A combination of regular expressions, pattern recognition, and heuristics are applied to the abuse reports to automatically identify intrusions that passed other security solutions. The lessons learned from the ongoing development of the system, alongside the results, extensive analysis, and discussion is provided. Additionally, a cybersecurity-related corpus is described and published within this work.
2021-02-22
Si, Y., Zhou, W., Gai, J..  2020.  Research and Implementation of Data Extraction Method Based on NLP. 2020 IEEE 14th International Conference on Anti-counterfeiting, Security, and Identification (ASID). :11–15.
In order to accurately extract the data from unstructured Chinese text, this paper proposes a rule-based method based on natural language processing and regular expression. This method makes use of the language expression rules of the data in the text and other related knowledge to form the feature word lists and rule template to match the text. Experimental results show that the accuracy of the designed algorithm is 94.09%.
2020-12-28
Yang, H., Huang, L., Luo, C., Yu, Q..  2020.  Research on Intelligent Security Protection of Privacy Data in Government Cyberspace. 2020 IEEE 5th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA). :284—288.

Based on the analysis of the difficulties and pain points of privacy protection in the opening and sharing of government data, this paper proposes a new method for intelligent discovery and protection of structured and unstructured privacy data. Based on the improvement of the existing government data masking process, this method introduces the technologies of NLP and machine learning, studies the intelligent discovery of sensitive data, the automatic recommendation of masking algorithm and the full automatic execution following the improved masking process. In addition, the dynamic masking and static masking prototype with text and database as data source are designed and implemented with agent-based intelligent masking middleware. The results show that the recognition range and protection efficiency of government privacy data, especially government unstructured text have been significantly improved.

2020-08-28
Perry, Lior, Shapira, Bracha, Puzis, Rami.  2019.  NO-DOUBT: Attack Attribution Based On Threat Intelligence Reports. 2019 IEEE International Conference on Intelligence and Security Informatics (ISI). :80—85.

The task of attack attribution, i.e., identifying the entity responsible for an attack, is complicated and usually requires the involvement of an experienced security expert. Prior attempts to automate attack attribution apply various machine learning techniques on features extracted from the malware's code and behavior in order to identify other similar malware whose authors are known. However, the same malware can be reused by multiple actors, and the actor who performed an attack using a malware might differ from the malware's author. Moreover, information collected during an incident may contain many clues about the identity of the attacker in addition to the malware used. In this paper, we propose a method of attack attribution based on textual analysis of threat intelligence reports, using state of the art algorithms and models from the fields of machine learning and natural language processing (NLP). We have developed a new text representation algorithm which captures the context of the words and requires minimal feature engineering. Our approach relies on vector space representation of incident reports derived from a small collection of labeled reports and a large corpus of general security literature. Both datasets have been made available to the research community. Experimental results show that the proposed representation can attribute attacks more accurately than the baselines' representations. In addition, we show how the proposed approach can be used to identify novel previously unseen threat actors and identify similarities between known threat actors.

2020-01-27
Teodorescu, Horia-Nicolai, Bolea, Speranta Cecilia.  2019.  Text Sectioning Based on Stylometric Distances. 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD). :1–6.
This article continues the stylometric study started in a previous one; we focus on stylometric distances between text segments and the use of these distances in text sectioning based on maximizing the distances between text parts. We refine the method previously introduced and improve on the results. Applications include the automation of stylistic analysis of texts, with implication on text summarization, historical analysis, and authorship analysis.
2019-03-04
Husari, G., Niu, X., Chu, B., Al-Shaer, E..  2018.  Using Entropy and Mutual Information to Extract Threat Actions from Cyber Threat Intelligence. 2018 IEEE International Conference on Intelligence and Security Informatics (ISI). :1–6.
With the rapid growth of the cyber attacks, cyber threat intelligence (CTI) sharing becomes essential for providing advance threat notice and enabling timely response to cyber attacks. Our goal in this paper is to develop an approach to extract low-level cyber threat actions from publicly available CTI sources in an automated manner to enable timely defense decision making. Specifically, we innovatively and successfully used the metrics of entropy and mutual information from Information Theory to analyze the text in the cybersecurity domain. Combined with some basic NLP techniques, our framework, called ActionMiner has achieved higher precision and recall than the state-of-the-art Stanford typed dependency parser, which usually works well in general English but not cybersecurity texts.
2019-02-22
Petrík, Juraj, Chudá, Daniela.  2018.  Source Code Authorship Approaches Natural Language Processing. Proceedings of the 19th International Conference on Computer Systems and Technologies. :58-61.

This paper proposed method for source code authorship attribution using modern natural language processing methods. Our method based on text embedding with convolutional recurrent neural network reaches 94.5% accuracy within 500 authors in one dataset, which outperformed many state of the art models for authorship attribution. Our approach is dealing with source code as with natural language texts, so it is potentially programming language independent with more potential of future improving.

2019-01-21
Ayoade, G., Chandra, S., Khan, L., Hamlen, K., Thuraisingham, B..  2018.  Automated Threat Report Classification over Multi-Source Data. 2018 IEEE 4th International Conference on Collaboration and Internet Computing (CIC). :236–245.

With an increase in targeted attacks such as advanced persistent threats (APTs), enterprise system defenders require comprehensive frameworks that allow them to collaborate and evaluate their defense systems against such attacks. MITRE has developed a framework which includes a database of different kill-chains, tactics, techniques, and procedures that attackers employ to perform these attacks. In this work, we leverage natural language processing techniques to extract attacker actions from threat report documents generated by different organizations and automatically classify them into standardized tactics and techniques, while providing relevant mitigation advisories for each attack. A naïve method to achieve this is by training a machine learning model to predict labels that associate the reports with relevant categories. In practice, however, sufficient labeled data for model training is not always readily available, so that training and test data come from different sources, resulting in bias. A naïve model would typically underperform in such a situation. We address this major challenge by incorporating an importance weighting scheme called bias correction that efficiently utilizes available labeled data, given threat reports, whose categories are to be automatically predicted. We empirically evaluated our approach on 18,257 real-world threat reports generated between year 2000 and 2018 from various computer security organizations to demonstrate its superiority by comparing its performance with an existing approach.

2018-01-10
Buber, E., Dırı, B., Sahingoz, O. K..  2017.  Detecting phishing attacks from URL by using NLP techniques. 2017 International Conference on Computer Science and Engineering (UBMK). :337–342.

Nowadays, cyber attacks affect many institutions and individuals, and they result in a serious financial loss for them. Phishing Attack is one of the most common types of cyber attacks which is aimed at exploiting people's weaknesses to obtain confidential information about them. This type of cyber attack threats almost all internet users and institutions. To reduce the financial loss caused by this type of attacks, there is a need for awareness of the users as well as applications with the ability to detect them. In the last quarter of 2016, Turkey appears to be second behind China with an impact rate of approximately 43% in the Phishing Attack Analysis report between 45 countries. In this study, firstly, the characteristics of this type of attack are explained, and then a machine learning based system is proposed to detect them. In the proposed system, some features were extracted by using Natural Language Processing (NLP) techniques. The system was implemented by examining URLs used in Phishing Attacks before opening them with using some extracted features. Many tests have been applied to the created system, and it is seen that the best algorithm among the tested ones is the Random Forest algorithm with a success rate of 89.9%.

Zheng, Y., Shi, Y., Guo, K., Li, W., Zhu, L..  2017.  Enhanced word embedding with multiple prototypes. 2017 4th International Conference on Industrial Economics System and Industrial Security Engineering (IEIS). :1–5.

Word representation is one of the basic word repressentation methods in natural language processing, which mapped a word into a dense real-valued vector space based on a hypothesis: words with similar context have similar meanings. Models like NNLM, C&W, CBOW, Skip-gram have been designed for word embeddings learning, and get widely used in many NLP tasks. However, these models assume that one word had only one semantics meaning which is contrary to the real language rules. In this paper we pro-pose a new word unit with multiple meanings and an algorithm to distinguish them by it's context. This new unit can be embedded in most language models and get series of efficient representations by learning variable embeddings. We evaluate a new model MCBOW that integrate CBOW with our word unit on word similarity evaluation task and some downstream experiments, the result indicated our new model can learn different meanings of a word and get a better result on some other tasks.

2017-12-12
Ktob, A., Li, Z..  2017.  The Arabic Knowledge Graph: Opportunities and Challenges. 2017 IEEE 11th International Conference on Semantic Computing (ICSC). :48–52.

Semantic Web has brought forth the idea of computing with knowledge, hence, attributing the ability of thinking to machines. Knowledge Graphs represent a major advancement in the construction of the Web of Data where machines are context-aware when answering users' queries. The English Knowledge Graph was a milestone realized by Google in 2012. Even though it is a useful source of information for English users and applications, it does not offer much for the Arabic users and applications. In this paper, we investigated the different challenges and opportunities prone to the life-cycle of the construction of the Arabic Knowledge Graph (AKG) while following some best practices and techniques. Additionally, this work suggests some potential solutions to these challenges. The proprietary factor of data creates a major problem in the way of harvesting this latter. Moreover, when the Arabic data is openly available, it is generally in an unstructured form which requires further processing. The complexity of the Arabic language itself creates a further problem for any automatic or semi-automatic extraction processes. Therefore, the usage of NLP techniques is a feasible solution. Some preliminary results are presented later in this paper. The AKG has very promising outcomes for the Semantic Web in general and the Arabic community in particular. The goal of the Arabic Knowledge Graph is mainly the integration of the different isolated datasets available on the Web. Later, it can be used in both the academic (by providing a large dataset for many different research fields and enhance discovery) and commercial sectors (by improving search engines, providing metadata, interlinking businesses).

2017-05-30
Jadhao, Ankita R., Agrawal, Avinash J..  2016.  A Digital Forensics Investigation Model for Social Networking Site. Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies. :130:1–130:4.

Social Networking is fundamentally shifting the way we communicate, sharing idea and form opinions. All people try to use social media for there need, people from every age group are involved in social media site or e-commerce site. Nowadays almost every illegal activity is happened using the social network and instant messages. It means that present system is not capable to found all suspicious words. In this paper, we provided a brief description of problem and review on the different framework developed so far. Propose a better system which can be indentify criminal activity through social networking more efficiently. Use Ontology Based Information Extraction (OBIE) technique to identify domain of word and Association Rule mining to generate rules. Heuristic method checks in user database for malicious users according to predefine elements and Naïve Bayes method is use to identify the context behind the message or post. The experimental result is used for further action on victim by cyber crime department.