Classify Cyber Crime Offenses Using Confusion Matrix

rishabhsharma
7 min readJun 5, 2021

Particularly in the last decade, Internet usage has been growing rapidly. However, as the Internet becomes a part of the day to day activities, cybercrime is also on the rise. Cybercrime will cost nearly $6 trillion per annum by 2021 as per the cybersecurity ventures report in 2020. For illegal activities, cybercriminals utilize any network computing devices as a primary means of communication with a victims’ devices, so attackers get profit in terms of finance, publicity and others by exploiting the vulnerabilities over the system. Cybercrimes are steadily increasing daily. Evaluating cybercrime attacks and providing protective measures by manual methods using existing technical approaches and also investigations has often failed to control cybercrime attacks.

Therefore, this study proposes a flexible computational tool using machine learning techniques to analyze cybercrimes rate at a state wise in a country that helps to classify cybercrimes. Security analytics with the association of data analytic approaches help us for analyzing and classifying offenses from India-based integrated data that may be either structured or unstructured. The main strength of this work is testing analysis reports, which classify the offenses accurately with 99 percent accuracy.

Proposed Methodology

At present, there is no generalized framework is available to categorize cybercrime offenses by feature extraction of the cases. In the present work, data analysis and machine learning are incorporated to build a cybercrime detection and analytics system. The proposed system’s design and implementation utilize classification, clustering and supervised algorithms.

Proposed approach to analyze cybercrime incidents

Information Gathering

In the reconnaissance phase, the integrated data (structured and unstructured) are collected from Kaggle and CERT-In. The integrated data are stored as raw data in the database.

Table depicts sample data of the dataset considered in the proposed model.

Sample dataset of the proposed model

Preprocessing

In this phase only the feature extraction process takes place. It converts the high dimensional data to low dimensional data.

Prediction Analysis

In the prediction analysis step, the cybercrime data were analyzed and used to predict which crime is occurring more in a particular year at a particular location. Through this analysis, one can predict the cybercrime data and can reduce the incarnation of cybercrime incidents.

Results and Analysis

The proposed system is designed and developed by considering the data from sources such as Kaggle and CERT-In. It consists of more than 2000 records with the eight attributes such as incident, offender, victim, harm, year, location, age of the offender and cybercrime. Incidents that occurred in India during 2012–2017 were considered. More than 2000 records are used to construct and test the proposed computational system.

The over-all occurrence of the cybercrime incidents in India during certain specified periods. It demonstrates that the occurrence of the Identity (ID) theft is more when compared to the other two attacks. We can take some countermeasures in order to reduce the existence of the ID theft attack in India. The existence of the copyright attack and hacking is also more.

Overall occurrence of cybercrime in India
Precision recall and f-1 score for the proposed model

Above figure demonstrates the precision, recall and f-1 score for our model. These can be obtained by using the confusion matrix obtained in our model and we can get the average accuracy rate for the cybercrimes that are predicted. Let’s know understand about confusion matrix.

What is Confusion Matrix and why we need it?

“Confusion Matrix is a performance measurement for machine learning classification where output can be two or more classes.”

It is a table with 4 different combinations of predicted and actual values.

It is extremely useful for measuring Recall, Precision, Specificity, Accuracy and most importantly AUC-ROC Curve.

We can use confusion matrix to calculate various metrics:

  1. Accuracy: The values of confusion matrix are used to calculate the accuracy of the model. It is the ratio of all correct predictions to overall predictions (total values)

Accuracy = (TP + TN)/(TP + TN + FP + FN)

2. Precision: It is the measure of truly predicted positive samples to the total number of positively predicted samples. If the precision score is more then it represents that our model is pretty good to classify the samples.

Precision= TP / TP + FP

3. Recall: It is the measure of truly predicted positive samples of all the samples present in the actual class as yes. It is also termed as the sensitivity of the model.

Recall = (True positives / all actual positives) = TP / TP + FN

4. F1 score: It is calculated as the weighted average of both precision and recall. Its main components (considerations) are true negatives, true positives, false negatives and false positives. F1 score is preferred more than accuracy in order to know our classifier model performance measure

F1 Score=2 × (precision × recall)

Figure depicts the confusion matrix for our model when the training size was 0.8 and the test size was 0.2. By this, we know how many cases are classified correctly and how many are classified incorrectly. It means we can find out the true negatives and true positives and false negatives and false positives classified by using the model.

Confusion matrix for the proposed model

The anomaly based intrusion detection system (IDS) is widely used based on different machine learning algorithms. The IDS is usually evaluated by its ability to make accurate predictions of attacks. In case of the binary classifier IDS four possible outcomes are possible. Attacks correctly predicted as attacks (TP), or incorrectly predicted as normal (FN). Normal correctly predicted as normal (TN), or incorrectly predicted as attack (FP).

The confusion matrix, maintains information about actual and predicted classes. Intrusion detection systems are mainly discriminate between two classes, attack class (malicious, threats or abnormal data) and normal class (normal data points). Different machine learning algorithms may be used in order to build anomaly based IDS. The anomaly based IDS aims to profile the normal behavior in order to discriminate between normal and intrusions.

Let’s understand the terms used here:

The four instances TP, TN, FP and FN are counted due to the relation between the predicted and actual classes. They are composed in the 2x2 confusion matrix as shown in Table. It shows that there are also two classes of connection points “Normal” and “Attack”.

  • True Positive (TP) : for attack detected when it is actually attack.
  • True Negative (TN) : for normal detected when it is actually normal.
  • False Positive (FP) (Type I Error): for attack detected when it is actually normal.
  • False Negative (FN) (Type II Error): for normal detected when it is actually attack (False Alarm).

Type I Error : False Positive (FP)

Type I error (False Positive)

This type of error is the most dangerous. In such cases our system predicts that we are safe and secure with no attack but in actual cyber attack actually takes place. In this case no notification would have reached the security team and nothing can be done to prevent it.

Type II Error : False Negative (FN)

Type II error — False Alarm (False Negative)

This type of error are not very dangerous as our system is protected in reality but model predicted an attack. the team would get notified and check for any malicious activity.

Conclusions and Future Scope

In the present world, cybercrime offenses are happening at an alarming rate. As the use of the Internet is increasing many offenders, make use of this as a means of communication in order to commit a crime. The framework developed in our work is essential to the creation of a model that can support analytics regarding the identification, detection and classification of the integrated cybercrime offenses (structured and unstructured). The main focus of our work is to find the attacks that take advantage of the security vulnerabilities and analyze these attacks by making use of machine learning techniques.

Thanks for reading !!!

⚡Keep Learning Keep Growing!

--

--

rishabhsharma

AWS Certified ☁️ | PySpark | DevOps | Machine Learning 🧠 | Kubernetes ☸️ | SQL 🛢