A brief introduction to text mining tasks in the cybersecurity domain
The previous post presented an introductory approach describing how text mining can support cybersecurity activities. In the post, we introduced some activities that can be automated through text mining. You can access the post here.
This second post summarizes two text mining tasks: text classification and information extraction. We introduce activities applications using the cybersecurity domain. It is worth noticing that text mining comprises other tasks, and you can read more details about each one in [1].
According to [2], text classification is the text mining task more applied in the cybersecurity domain. Text classification is a supervised learning task, and it is used to perform predictions based on known features or behaviors. For example, a phishing detection solution can apply text classification to assign a malicious status to a message as well as it has the same characteristics as known phishing messages.
Another example of text classification in the cybersecurity domain is Data Leakage Prevention (DLP). DLP solutions need to learn which information should be considered confidential to an organization before analyzing an electronic document. After the solution knows the features of confidential documents, it will examine new documents and assign them a label indicating if they should be considered confidential. If the analysis establishes that a document is confidential, the solution will trigger a specific action. You can find an example of an experiment using text classification to prevent data leakages in this paper [3].
Despite the predominance of text classification in cybersecurity-related academic papers, its implementation in the real world (business solutions) is not simple. As aforementioned, text classification needs a training phase in which machine learning or neural network models learn information features. This phase demands a dataset containing many examples, and each one must be associated with a class label. For example, for training a DLP solution, confidential or not-confidential labels must be assigned for each document. For a phishing solution, a training phase demands many email messages, and each one must be labeled as phishing or not.
Think about every solution, and you will find different challenges related to the cybersecurity domain, such as the evolution of phishing or the multiple features that can indicate a document as confidential. Beyond the challenges associated with the cybersecurity domain, others are related directly to text mining and machine learning. We will address them in future posts.
Another text mining task frequently applied in the cybersecurity domain is information extraction. This task comprises two sub-tasks: Named Entity Recognition (NER) and information relation. This post addresses the NER sub-task, which aims to identify specific entities in non-structured datasets. The traditional text mining literature exemplifies general entities like person and location. Still, we can use the sub-task to recognize cybersecurity-related entities like threat agents, threats, or organizations targeted by cyberattacks.
The goal of NER makes possible its application in multiple cybersecurity activities. A classic example is threat intelligence since the area needs to identify valuable information from numerous sources to produce intelligence. Using NER, an organization can understand vulnerabilities, exploits, and types of attacks launched by cybercriminals. Recognizing these entities makes intelligence production possible to support cybersecurity teams in acting proactively.
This post introduced the relationship between text mining tasks considering cybersecurity activities. In addition, the post presented challenges in integrating these domains. It is important to discuss the challenges to clarify that text mining solutions can help cybersecurity teams in many ways, but they are not a silver bullet. We will address the challenges deeply in future posts, but, for now, we intend to inform you that they exist, and it needs efforts (and study!) to find models that produce favorable results to support cybersecurity activities.
REFERENCES:
[1] MINER, Gary; et al. Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Elsevier, 2012.
[2] IGNACZAK, Luciano; et al. Text Mining in Cybersecurity: A Systematic Literature Review. ACM Computing Surveys. 2021.
[3] HUANG, J.-W.; CHIANG, C.-W.; CHANG, J.-W. Email security level classification of imbalanced data using artificial neural network: The real case in a world-leading enterprise. 2018.