ISSN ONLINE(2320-9801) PRINT (2320-9798)

All submissions of the EM system will be redirected to Online Manuscript Submission System. Authors are requested to submit articles directly to Online Manuscript Submission System of respective journal.

Network Intrusion Detection System Using Genetic Algorithm and Fuzzy Logic

Mostaque Md. Morshedur Hassan
Assistant Professor, Department of Computer Science and IT, Lalit Chandra Bharali College, Guwahati, India
Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering

Abstract

These days Intrusion Detection System (IDS) which is defined as a solution of system security is employed to identify the abnormal activities in a computer system or network. So far different approaches have been utilized in intrusion detections, but unluckily any of the systems is not entirely ideal. Hence, the hunt of improved method goes on. In this progression, here I have designed an Intrusion Detection System (IDS), by applying genetic algorithm (GA) and fuzzy logic to efficiently detect various types of the intrusive activities within a network. The proposed fuzzy logic-based system could be able to detect the intrusive activities of the computer networks as the rule base holds a better set of rules. The experiments and evaluations of the proposed intrusion detection system are performed with the KDD Cup 99 intrusion detection benchmark dataset. The experimental results clearly show that the proposed system achieved higher accuracy rate in identifying whether the records are normal or abnormal ones and obtained reasonable detection rate.

Keywords

Intrusion Detection System (IDS), Anomaly Based Intrusion Detection, Genetic Algorithm, Fuzzy Logic, KDD Cup 99 Dataset.

INTRODUCTION

Intrusion detection is designed to monitor the malicious activities ([1], [2], [3],[4]) occurring in a computer system or network inside or outside and analysing them for signs of possible incidents, which are violations or forthcoming threats of violation of computer security policies, acceptable utilized policies, or standard security practices. Intrusion incidents to computer systems are increasing because of the commercialization of the internet and local networks [5] and new automated hacking tools. Computer systems are turning out to be more and more susceptible to attack, due to its extended network connectivity.
Nowadays, networked computer systems play an increasingly important role in our society and its economy. They have become the targets of a wide array of malicious attacks that invariably turn into actual intrusions. This is the reason computer security has become an essential concern for network administrators. Too often, intrusions cause havoc inside LANs and the time and cost to repair the damage can grow to extreme proportions. Instead of using passive measures to fix and patch security holes once they have been exploited, it is more effective to adopt a proactive approach to intrusions.
Intrusion Detection Systems (IDS) are primarily focused on identifying possible incidents, logging information about them, attempting to stop them, and reporting them to security administrators [6] in real-time, or near real-time, and those that process audit data with some delay (non-real-time). The latter approach would in turn delay the time of detection. In addition, organizations use IDSs for other purposes, such as identifying problems with security policies, documenting existing threats, and deterring individuals from violating security policies. IDSs have become a necessary addition to the security infrastructure of nearly every organization.
IDSs typically record information related to observed events, notify security administrators of important observed events, and produce reports. Many IDSs can also respond to a detected threat by attempting to prevent it from succeeding. They use several response techniques, which involve the IDS stopping the attack itself, changing the security environment (e.g., reconfiguring a firewall), or changing the attack’s content. A typical Intrusion Detection System is shown in figure 1.
One of the major problems faced by IDS is huge number of false positive alerts, i.e. alerts that are mistakenly classified normal traffic as security violations. A perfect IDS does not generate false or irrelevant alarms. In practice, signature based IDS found to produce more false alarms than expected. This is because of the very general signatures and lack of built in verification tool to validate the success of the attack. The huge amount of false positives in the alert log makes the process of taking remedial action for the true positives, i.e. successful attacks, delayed and labour intensive.
My goal is to detect novel attacks by unauthorized users in network traffic. I consider an attack to be novel if the vulnerability is unknown to the target's owner or administrator, even if the attack is generally known and patches and detection tests are available. I am primarily interested in four types of remotely launched attacks: probes, denial of service (DOS), U2R and R2L. A DoS attack is a type of attack in which the hacker makes a computing or memory resources too busy or too full to serve legitimate networking requests and hence denying users access to a machine e.g. apache, smurf, neptune, ping of death, back, mail bomb, UDP storm etc. are all DoS attacks. A remote to user (U2R) attack is an attack in which a user sends packets to a machine over the internet, which he/she does not have access to in order to expose the machines vulnerabilities and exploit privileges which a local user would have on the computer e.g. xlock, guest, xnsnoop, phf, sendmail dictionary etc. A R2L attacks are exploitations in which the hacker starts off on the system with a normal user account and attempts to abuse vulnerabilities in the system in order to gain super user privileges e.g. perl, xterm. A probing is an attack in which the hacker scans a machine or a networking device in order to determine weaknesses or vulnerabilities that may later be exploited so as to compromise the system. This technique is commonly used in data mining e.g. saint, portsweep, mscan, nmap etc.

RELATED WORKS

The normal and abnormal behaviours [1] in networked computers are hard to predict, as the boundaries cannot be well defined. This prediction process usually generates fake alarms in many anomaly based intrusion detection systems.
With the introduction of fuzzy logic, the fake alarm rate in determining intrusive behaviour can be reduced, where a set of fuzzy rules is used to define the normal and abnormal behaviour [1] in a computer network. This system proposed a technique to generate fuzzy rules that are able to detect malicious activities and some specific intrusions.
This system presented an approach for the performance of generated fuzzy rules in classifying different types of intrusions.
In this system, I explained the attack modes and point to the impact of this attack and its threats. From an attacker’s perspective, I analyse each of the attack’s modes, benefits and suitable conditions and think how to improve the attack by introducing the concept of fuzzy logic-based technique.
Fuzzy set theory was introduced by Zadeh [10] in 1965 and it was specifically designed mathematically represent uncertainty and vagueness with formalized logical tools for dealing with the imprecision inherent in many real world problems.
Hassan [4], Baruah ([7], [8]), Neog and Sut [9] have forwarded an extended definition of fuzzy set which enables us to define the complement of a fuzzy set. Our proposed system agrees with them as this new definition satisfies all the properties regarding the complement of a fuzzy set.
Gong [2] presented an implementation of GA based approach to Network Intrusion Detection using GA and showed software implementation. The approach derived a set of classification rules and utilizes a support-confidence framework to judge fitness function.
Xia, Hariri and Yousif [3] used GA to detect anomalous network behaviours based on information theory ([17], [18]). Some network features can be identified with network attacks based on mutual information between network features and type of intrusions and then using these features a linear structure rule and also a GA is derived. The approach of using mutual information and resulting linear rule seems very effective because of the reduced complexity and higher detection rate. The only problem is that it considered only the discrete features.
Abdullah [6] showed a GA based performance evaluation algorithm to network intrusion detection. The approach uses information theory for filtering the traffic data.
Lu and Traore [12] used historical network dataset using GP to derive a set of classification [17]. They used support-confidence framework as the fitness function and accurately classified several network intrusions. But their use of genetic programming made the implementation procedure very difficult and also for training procedure more data and time is required.
Goyal and Kumar [13] described a GA based algorithm to classify all types of smurf attack using the training dataset with false positive rate is very low (at 0.2%) and detection rate is almost 100%.
Li [14] described a method using GA to detect anomalous network intrusion ([17], [18]). The approach includes both quantitative and categorical features of network data for deriving classification rules. However, the inclusion of quantitative feature can increase detection rate but no experimental results are available.

INTRUSION DETECTION OVERVIEW

The following sections give a short overview of networking attacks, classifications and various components of Intrusion Detection System.
A. NETWORKING ATTACKS
This section is an overview of the four major categories of networking attacks. Every attack on a network can comfortably be placed into one of these groups [15] –
• Denial of Service (DoS): A DoS attack is a type of attack in which the hacker makes a computing or memory resources too busy or too full to serve legitimate networking requests and hence denying users access to a machine e.g. apache, smurf, neptune, ping of death, back, mail bomb, UDP storm etc. are all DoS attacks.
• Remote to User Attacks (R2L): A remote to user attack is an attack in which a user sends packets to a machine over the internet, which he/she does not have access to in order to expose the machines vulnerabilities and exploit privileges which a local user would have on the computer e.g. xlock, guest, xnsnoop, phf, sendmail dictionary etc.
• User to Root Attacks (U2R): These attacks are exploitations in which the hacker starts off on the system with a normal user account and attempts to abuse vulnerabilities in the system in order to gain super user privileges e.g. perl, xterm.
• Probing: Probing is an attack in which the hacker scans a machine or a networking device in order to determine weaknesses or vulnerabilities that may later be exploited so as to compromise the system. This technique is commonly used in data mining e.g. saint, portsweep, mscan, nmap etc.
B. CLASSIFICATION OF INTRUSION DETECTION
Intrusions Detection can be classified into two main categories. They are as follow:
• Host Based Intrusion Detection: HIDSs evaluate information found on a single or multiple host systems, including contents of operating systems, system and application files ([11], [16]).
• Network Based Intrusion Detection: NIDSs evaluate information captured from network communications, analysing the stream of packets which travel across the network ([11], [16]).
C. COMPONENTS OF INTRUSION DETECTION SYSTEM
An intrusion detection system normally consists of three functional components [17]. The first component of an intrusion detection system, also known as the event generator, is a data source. Data sources can be categorized into four categories namely Host-based monitors, Network-based monitors, Application-based monitors and Target-based monitors.
The second component of an intrusion detection system is known as the analysis engine. This component takes information from the data source and examines the data for symptoms of attacks or other policy violations. The analysis engine can use one or both of the following analysis approaches:
• Misuse/Signature-Based Detection: This type of detection engine detects intrusions that follow Ill-known patterns of attacks (or signatures) that exploit known software vulnerabilities ([18], [19]). The main limitation of this approach is that it only looks for the known weaknesses and may not care about detecting unknown future intrusions [20].
• Anomaly/Statistical Detection: An anomaly based detection engine will search for something rare or unusual [20]. They analyses system event streams, using statistical techniques to find patterns of activity that appear to be abnormal. The primary disadvantages of this system are that they are highly expensive and they can recognize an intrusive behaviour as normal behaviour because of insufficient data
• The third component of an intrusion detection system is the response manager. In basic terms, the response manager will only act when inaccuracies (possible intrusion attacks) are found on the system, by informing someone or something in the form of a response.

GENETIC BASED IDS

A. GENETIC ALGORITHM OVERVIEW
A Genetic Algorithm (GA) is a programming technique that uses biological evolution as a problem solving strategy [21]. It is based on Darwinian’s principle of evolution and survival of fittest to optimize a population of candidate solutions towards a predefined fitness [14].
The proposed GA based intrusion detection system contains two modules where each works in a different stage. In the training stage, a set of classification rules are generated from network audit data using the GA in an offline environment. In the intrusion detection stage, the generated rules are used to classify incoming network connections in the real-time environment. Once the rules are generated, the intrusion detection system becomes simple, experienced and efficient one.
GA uses an evolution and natural selection that uses a chromosome-like data structure and evolve the chromosomes using selection, recombination and mutation operators [14]. The process usually begins with randomly generated population of chromosomes, which represent all possible solution of a problem that are considered candidate solutions. From each chromosome different positions are encoded as bits, characters or numbers. These positions could be referred to as genes. An evaluation function is used to calculate the decency of each chromosome according to the desired solution; this function is known as “Fitness Function”. During the process of evaluation “Crossover” is used to simulate natural reproduction and “Mutation” is used to mutation of species [14]. For survival and combination the selection of chromosomes is partial towards the fittest chromosomes.
When I use GA for solving various problems three factors will have vital impact on the effectiveness of the algorithm and also of the applications [2]. They are: i) the fitness function; ii) the representation of individuals; and iii) the GA parameters. The determination of these factors often depends on implementation of the system. In the following sections, I focus our discussions on deriving the set of rules using Genetic Algorithm.
B. FUZZY LOGIC
It has been shown by Baruah [7] that a fuzzy number [a, b, c] is defined with reference to a membership function μ(x) lying between 0 and 1, a ≤ x ≤ c. Further, he has extended this definition in the following way. Let μ1(x) and μ2(x) be two functions, 0 ≤ μ2(x) ≤ μ1(x) ≤ 1. He has concluded μ1(x) the fuzzy membership function, and μ2(x) a reference function, such that (μ1(x) – μ2(x)) is the fuzzy membership value for any x. Finally he has characterized such a fuzzy number by {x, μ1(x), μ2(x); x ∈ Ω}.
The complement of μx is always counted from the ground level in Zadehian’s theory [10], whereas it actually counted from the level if it is not as zero that is the surface value is not always zero. If other than zero, the problem arises and then we have to count the membership value from the surface for the complement of μx. Thus I could conclude the following statement –
Complement of μx = 1 for the entire level
Membership value for the complement of μx = 1- μx
My system forwarded a definition of complement of an extended fuzzy set where the fuzzy reference function is not always zero. The definition of complement of a fuzzy set proposed by Hassan [4], Baruah ([7], [8]), Neog and Sut [9] could be seen a particular case of what I am giving. I shall use Baruah’s definition of the complement of a normal fuzzy set in my article.
In the two classes’ classification problem, there are two classes where every object should be classified. These classes are called positive (abnormal) and negative (normal). The data set used by the learning algorithms consists of a set of objects, each object with n+1 attributes. The first n attributes define the object characteristics (monitored parameters) and the last attribute defines the class that the object belongs to the classification attribute.
A fuzzy classifier for solving the two class classification problem is a set of two rules, one for the normal class and other for the abnormal class, where the condition part is defined using only the monitored parameters and the conclusion part is an atomic expression for the classification attribute.
C. FLOWCHART
Figure 2 shows the operations of a general genetic algorithm according to which GA is implemented in our system.
algorithm
algorithm
E. FITNESS FUNCTION
The authors in ([1], [4]) used the fuzzy confusion matrix to calculate the fitness of a chromosome. In the fuzzy confusion matrix the fuzzy truth degree of the condition represented by the chromosome and the fuzzy negation operator are used directly. In our case, the fitness of a chromosome for the abnormal class is evaluated according to the following set of equations:
equation

IDS IMPLEMENTATION

To implement the algorithm and to evaluate the performance of the system, I have used the standard dataset used in KDD Cup 1999 “Computer network intrusion detection” competition.
A. KDD CUP SAMPLE DATASET
For the implementation of the proposed algorithm, I used the KDD 99 intrusion detection datasets which are based on the 1998 DARPA initiative, which provides designers of intrusion detection systems (IDS) with a benchmark on which to evaluate different methodologies ([22], [25]). Hence, a simulation is being made of a factitious military network with three ‘target’ machines running various operating systems and services. They also used three additional machines to spoof different IP addresses for generate network traffic.
A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows from a source IP address to a target IP address under some well defined protocol ([22], [23], [25]). It results in 41 features for each connection.
Finally, there is a sniffer that records all network traffic using the TCP dump format [25]. The total simulated period is seven weeks. Normal connections are created to profile that expected in a military network and attacks fall into one of four categories: User to Root; Remote to Local; Denial of Service; and Probe.
The KDD 99 intrusion detection benchmark consists of different components [24]:
kddcup.data; kddcup.data_10_percent; kddcup.newtestdata_10_percent_unlabeled;
kddcup.testdata.unlabeled; kddcup.testdata.unlabeled_10_percent; corrected.
I have used “kddcup.data_10_percent” as training dataset and “corrected” as testing dataset. In this case the training set consists of 494,021 records among which 97,280 are normal connection records, while the test set contains 311,029 records among which 60,593 are normal connection records. Table 1 shows the distribution of each intrusion type in the training and the test set.
B. IMPLEMENTATION PROCEDURE
In the calculation phase, I have made 23 groups of chromosomes according to training data. There were 23 (22+1) groups for each of attack and normal types presented in training data. Number of chromosomes in each group is variable and depends on the number of data and relationship among data in that group. Total number of chromosomes in all groups Ire tried to keep in reasonable level to optimize time consumption in testing phase.
In the testing/detection phase, for each test data, an initial population is made using the data and occurring mutation in different features. This population is compared with each chromosomes prepared in training phase. Portion of population, which are more loosely related with all training data than others, are removed. Crossover and mutation occurs in rest of the population which becomes the population of new generation. The process runs until the generation size comes down to 1 (one). The group of the chromosome which is closest relative of only surviving chromosome of test data is returned as the predicted type.
For the implementation, I have taken both continuous and discrete values from the extracted features of the datasets.

EXPERIMENT RESULTS AND ANALYSIS

A. TRAINING AND TESTING DATA
The KDD 99 intrusion detection datasets [24] is broadly used to evaluate IDSs. In this study, two subsets were extracted from the 1998 DARPA data and used as the training and testing datasets. Each record of the datasets consists of 9 network features and 1 manually assigned record type. Nine network features have been used in the GA [14], which are connection duration, protocol, flag, su_attempted, is_guest_login, same_srv_rate, dst_host_same_srv_rate, dst_host_srv_count, and count.
The record type indicates whether a record is a normal network connection or a particular network intrusion. Most network packets in the selected datasets are normal, and four kinds of network attacks are present: dos, probe, u2r, and r2l.
B. EXPERIMENTS
In the experiment, the system was trained with the training dataset, and the default fitness function and the GA parameters were used, i.e., w1=0.2, w2=0.5, w3=0.3, 10 genes of a chromosome, 2000 generations, 250 initial rules in the population, crossover rate of 0.5, two-point crossover, and mutation rate of 0.02. When the training process was finished, the top 15 best quality rules were taken as the final classification rules. The rules were then used to classify the training data and the testing data respectively.
The experimental results show that the proposed method resulted good detection rates when using the generated rules to classify the training data itself. The detection rates could be higher if the fitness function and the GA parameters were chosen more appropriately. The results are presented in Table 2.
For simplified evaluation of our system, besides the classical accuracy measure, I have used two standard metrics of detection rate and false positive rate developed for network intrusions derived in ([4], [26]). Table 3 shows these standard metrics.
Detection rate for each data type can be seen from figure 3.
equation

CONCLUSION AND FUTURE WORK

In this paper, a method of applying genetic algorithms with fuzzy logic is presented for network intrusion detection system to efficiently detect various types of network intrusions. To implement and measure the performance of the system I carried out a number of experiments using the standard KDD Cup 99 benchmark dataset and obtained reasonable detection rate. To measure the fitness of a chromosome I used the fuzzy confusion matrix where the fuzzy membership value and fuzzy membership function for the complement of a fuzzy set are two different concepts because the surface value is not always counted from the ground level. The proposed detection system can upload and update new rules to the systems as the new intrusions become known. Therefore, it is cost effective and adaptive. The method suffers from two aspects. Firstly, it generates false alarms which are very serious problem for IDS. Secondly, for high dimensional data, it is hard to generate rules that cover up all the attributes.

ACKNOWLEDGEMENT

I would like to extend our sincere thanks and gratefulness to H.K. Baruah, Professor, Department of Statistics, Gauhati University, Guwahati, India for his kind help and guidance in preparing this article.

Tables at a glance

Table icon Table icon Table icon
Table 1 Table 2 Table 3
 

Figures at a glance

Figure 1 Figure 2 Figure 3
Figure 1 Figure 2 Figure 3
 

References