INTRODUCTION

Technological advances in telecommunications boosts the creation of new services attracting more users, including malicious users known as intruders. Intruders execute unauthorized activities using telecommunications services, causing millions in losses and damaging the prestige of the affected company. The data to be analyzed in intrusion detection scenarios have particular characteristics; for example: high volume of instances, high number of features, systems under attack (this causes the intruders to modify their attacks in order to avoid detection), multiclass categorization problem and data are generated at high speeds.

There are several techniques for intrusion detection ^{2}. Some of the commonly used techniques are based on rule evaluation, automatic rule generation (ARG), anomalies detection, social networks analysis, Bayesian networks, neural networks, or hybrid systems. Such techniques can be based on supervised approach ^{3} (requires a supervised training collection) or unsupervised approach ^{4} (there is no a priori knowledge). Our proposal is based on the supervised approach.

Early intrusion detection reduces damages to the affected companies. Rule evaluation is one of the commonly used techniques, since it can be applied in real time, achieving high effectiveness in already known intrusion behaviors.

However, the techniques based on rule evaluation present some problems. One of them is that the analyst must manually define the specific rules for each class, which is an extremely complex and time consuming task ^{19}. Another problem is that they fail to detect slight changes in data ^{1}, which could indicate the occurrence of criminal activity. In order to solve the above problems, several techniques based on ARG have been proposed.

In this paper, an automatic rule generation method for intrusion detection based on frequent subgraph mining is presented. Our approach consists of two algorithms: Rule Generation (RG) algorithm and Rule Filtering (RF) algorithm. Initially, the RG algorithm processes the frequent subgraphs extracted from the training set and generates rules from them. Then, the RF algorithm performs a filtering process by selecting the best rules. Finally, new instances are classified according to a set of filtered rules, using a novel evaluation strategy.

BACKGROUND

This section describes the basic concept of rule, as well as the related work. In addition, the method used to extract patterns using frequent subgraph mining is presented.

The rule definition

Let _{
F1, F2,...Fk
} be sets whose elements are known as features. The integer *n, k* >1, represents the number of features that characterize the data to be processed. Indeed, all rule discovery algorithms process sequences or stream of instances, where each instance _{
n = (f1, f2, fk)
} is a vector belonging to the universe _{
U = F1 x
} F_{2} x ... x F_{k}, where f_{i} ∈_{
Fi
} for every 1 ≤ *i* ≤ *k*.

Let π_{1}, π_{2}, ..., π_{k} be functions such that π: *U* →_{
Fi
} so that given an instance _{
n = (f1, f2,
} ...,f_{k}) ∈ *U* to π (*n*) = _{
fi,
} for each 1 ≤ i ≤ *k.* These functions in the literature are known as projection functions.

A condition in the universe *U* is a m: *U* →_{
Fi
} {0, 1}, defined from three elements π ⊕ and f_{i} where π is a projection function, ⊕ is a comparison operator valid for ƒ_{
i
} elements (if ƒ_{
i
} is a numerical set ⊕ ∈ {<, *≤, >, ≥, =* ...}), and _{
fi ∈ Fi.
} An instance *n* satisfies the condition *m* when the comparison between π (n) and _{
fi
} performed by ⊕ returns true, in that case *m* (n) = 1, otherwise *m* (n) = 0.

A formula in the universe *U* is also a function that takes values {0, 1} which is defined recursively based on the following principles:

- If *m* is a condition in *U* then *ρ = m* is a formula.

- If *g* is a formula in *U* then *ρ =* (g) is another formula, so that *ρ* (n) = *g* (n) for all *n ∈ U.*

- Given two formulas _{
U, g1
} , g_{2}, we have:

• _{
ρ = g1
} and g_{2}, is a formula, where *ρ* (n) _{
= g1
} (n) ∧ g_{2} (n) for all *n ∈ U.*

*• ρ* = g_{1} ∨ g_{2} is a formula where *ρ* (n) = g_{1} (n) ∨ g_{2} (n) for all *n ∈ U.*

A rule in the universe *U* x *C,* is represented by , where C = {c1, c2,...c1} is a set of classes with l being an integer l ≥ 2, a formula in *U* and is an equality condition defined on the corresponding component classes (C). For simplicity, a rule is usually represented as *(r,cS*, where *c¡.* The intuitive meaning of a rule is that the fulfillment of the premise in an instance implies that such instance belongs to the class c_{
i
} .

For example, let it assume that we have a rule *r* in the form: **
if
** (

_{ f1 }≥ 0.2 andf < 5.0) andf

_{3}=

*"http" then"attack";*where

*r*f ≥ 0.2 and f

_{1}≤ 5.0) and

_{ f3 "http", }as well as

*c*

_{1}=

*"attack".*Considering and c

_{1}, an instance

*n*= (0.3, "http",

*"UDP")*is labeled as

*"attack"*, since is fulfilled in

*n.*

Related work

The ARG techniques ^{5} are useful for dealing with various challenges in many application contexts, especially for intrusion detection or fraud detection. These techniques provide a better understanding to analysts, since the outputs are defined as expressions in an understandable language ^{5}, unlike other ones considered as black boxes. The ARG methods discussed in this paper can be clustered into four categories: greedy algorithms, conceptual clustering, technique to obtain fuzzy rules and decision trees.

The greedy algorithms (AQ21 ^{7}, X2R ^{8}, RIPPER ^{9}, PART ^{6}, AntMiner ^{10}) ^{5} perform an iterative process over the training collection. The first step consists in selecting a set of instances from the training collection to generate rules. After the rules are created, a filtering process is executed, and the filtered rules are added to a rule set. Subsequently, the selected instances above are removed from the training collection and the iterative process repeats until there are no more training instances. Finally, a filtering rule set is obtained.

In general, the rules, obtained from ARG techniques, generally define unambiguous boundaries for the features of the data to be analyzed. The unambiguous boundaries can often be fraudulent activities laying, but not exceeding, the threshold. The FURIA algorithm ^{11} presents an interesting proposal for dealing with this situation. This approach performs an adaptation of the RIPPER algorithm, allowing the computation of a first rule set. The obtained rules are transformed into fuzzy rules, by processing the conditions associated to numerical attributes.

There are several algorithms based on creating rules coming from decision trees, for example C4.5 ^{12}, C5.0 ^{13} and CART ^{14}. A rule is generated for each leaf, following the path from the root to leaf. The first step of these algorithms is to generate a decision tree. The criteria used to construct the decision tree can vary depending on the proposal ^{5}. A pruning process is applied to the generated a decision tree, and a collection of trees is obtained. According to the used method in the previous step, the collection may comprise one or more trees. Afterwards, an optimization procedure is performed in order to select the optimal tree from the collection. Finally, a rule set from the selected tree is obtained.

The method presented in ^{15} combines conceptual clustering and natural induction to discover rules. In this method, the data associated to taxes are clustered by a conceptual algorithm. Then, the fraudulent instances of each cluster are used as training collection to learn new rules. The output contains simple rules describing regular and fraudulent instances.

A framework based on frequent subgraph mining. Recently, a framework for intrusion detection based on frequent subgraph mining was reported in literature ^{18}. During the training stage, this framework processes the training collection through two modules to get frequent subgraphs (see Figure 1).

The preprocessing module is the first to be executed. This step is performed to reduce the dimensionality of the training collection and relabel the values of each selected feature. The relabeling algorithm assigns a unique numerical label for each non numerical feature value. Furthermore, it creates a range of numerical values which are represented by a unique numerical label. These labels are assigned to each numerical feature value. Then, the redundant instances are removed. This module allows obtaining a smaller and quality data collection to improve the effectiveness in the patterns extraction process.

The reduced collection is then processed by the graph mining module. In this module, each instance is represented as a star graph ^{18}. In a graphical representation (see Figure 2), a star graph has a central vertex representing the instance itself and a vertex for each corresponding feature _{
fi
} connected to the center. The edge label is the feature index *i,* and the vertex label is the value of the corresponding feature. Finally, frequent subgraphs (patterns) are obtained which will represent the pattern set P.

In this paper, we use the first two modules of the training stage in this framework as a basis to obtain a pattern set *P* for rule generation.

RULE GENERATOR

In this section, the proposed automatic rule generator and the rule evaluation strategy are introduced. The proposed method comprises of two algorithms: RG algorithm and RF algorithm. Both algorithms are discussed below as well the rule evaluation strategy.

RG algorithm

After obtaining a pattern set *P* using the framework described above, the RG algorithm is performed. This algorithm starts with an iterative process (see Figure 3, line 2). For each pattern *p* in pattern set *P,* the class *c* with higher occurrence of *p* is selected by the function "Search-Most-Covered-Class" (see Figure 3, line 3).

Then, the percentage of occurrences _{
cp
} of *p* in *c* is calculated as in the function "Calculate-Percent-Covered-Class" (see Figure 3, line 4); where the variable _{
oc
} represents the number of occurrences of *p* in *c*, and the variable _{
ot
} represents the total number of occurrences of *p* in all classes.

If the percentage of occurrences _{
cp
} is greater than the occurrence threshold _{
L1,
} then a new rule *r* is created by the function "Create-Rule" according to the pattern*p* and the class *c* (see Figure 3, line 6). L_{1} can take values within range [0, 1].

Here a pattern is represented as a rule. For example, for the pattern *p* shown in the Figure 4, assuming that the most covered class *c* was "DoS", the label "601" indicates the range [0.7, 1.0], and the label "271" represents the protocol "tcp", the generated rule *r* would be: *if* (≥ 0.7 _{
and f3
} ≤ 1.0) _{
and f4
} = "tcp" then *"DoS".*

The created rule *r* is added to the rule set *R*. Once all the iterations have finished, the rule set *R* containing all the generated rules is returned.

RF algorithm

The obtained rule set *R* is processed by the RF algorithm (see Figure 5). This algorithm requires the entire training collection _{
Dt
} from where the pattern set *P* was obtained. Furthermore, it takes a precision threshold L_{2} defined by an analyst, which can take values within range [0, 1]. We use the entire training collection, because the used framework to generate the patterns set *P,* reduces the training collection. Therefore, the fact of using the entire training collection provides more information to compute the precision value *r precision* of the rule r.

This algorithm starts with an iterative process (see Figure 5, lines 1-6). For each instance d, where _{
d ∈ Dt,
} each rule *r* is evaluated, where *r* ∈ *R* (see Figure 5, lines 2-5). In the Calculate-Precision function (see Figure 5, line 3), if *r* covers *d*, then *r.precision* value

The variable r.precision represents the number of instances correctly covered by r, and the variable r.total represents the total number of covered instances by r. After calculate r.procision value, the rule r with its new precision value and the new number of instances correctly covered is updated in the rule set R.

Once all the instances in Dt are processed, another iterative process starts in order to select the most representative rules (see Figure 5, lines 8-21). For each rule r, its precision value r.precision is compared against the precision threshold L2. If r.precision exceeds L2, then the function Exist (see Figure 5, line 10) search in the filtered rule set Rf, a rule r1 (where r1 ∈ Rf) that its premise is the same as that of . If there is not a rule *r*
_{1}, then *r* is added to R_{1}. But, if there exist a rule r_{1}, then a comparison is performed between precision values. In case of *r.procision* value is greater than _{
r1.procision,
} then r_{1} is removed from _{
Rf,
} and *r* is added to _{
Rf.
}

When the iterative process concludes, the next step is to generalize rules in _{
Rf
} using the function Generalize-Rules (see Figure 5, line 22). This function searches for rules that share at least one same conditions and define the same class. When it finds two rules with these characteristics a merger takes place.

Such merger is performed by keeping the identical conditions and logically adding those different. For example, assume we have two rules: the generated rule *r* from the presented pattern in Figure 4, and a rule r_{1}: _{
if
f3
} ≥ 0.7 _{
and
f3
} ≤ 1.0) **
and
** f

_{6}

**the result of the merger between**

*= "smtp"***"DoS";***them**r*and r

_{1}is:

**(f**

*if*_{3}≥ 0.7

_{ and f3 }≤ 1.0) and (

_{ f4 }=

_{ "tcp" or f6 }= "smtp")

*then "DoS".*This procedure not only allows us to obtain more general rules, but also reduce the final set of rules, providing more effectiveness in the rules evaluation process.

Using the Sort function (see Figure 5, line 23), the obtained rule set is sorted according to the precision of the rules. Thus, the rules with highest precision value will be the first in order. If there are two rules with the same precision value, then the one with greater number of instances correctly covered during the filtering process will go first in order. Finally, a filtered rule set _{
Rf
} with the most representative rules is returned.

Rule evaluation

Like others rule-based techniques ^{19}, our method is supplemented with a rule evaluation (RE) strategy. The goal of this strategy is to assign a label to a new instance. The RE algorithm starts with an iterative process (see Figure 6, lines 3-10). Each rule r, where *r* ∈ R_{f}, is evaluated over a new instance *d,* using Evaluate-Rule function (see Figure 6, line 4).

This function returns true if *r* covers *d.* When the variable *cover* take the value *true,* then the iterative process stops and the labe l *l* takes the value of the decision part of the rule . If no rule in *R* covers *d,* then the algorithm does not assign a specific class, and *l* takes the value "abstain". Finally, the label *l* that defines the new instance *d* is returned.

EXPERIMENT

In this section, the dataset used for evaluating the proposed method is analyzed and the experimental results achieved are presented.

KDD Cup '99 dataset

In our experiment we use the KDD '99 Dataset ^{16}, which provides connection records (instances) generated by a simulation of a military network. This dataset contains two labeled collections: training with 4, 898, 431 instances and testing with 311,029 instances. Each instance contains 41 features, which can be continuous or discrete, and is labeled as either normal or an attack, with exactly one specific type of attack.

The simulated attacks are clustered into the following four categories: surveillance and other probing (Probe), denial of service (DoS), unauthorized access to local root privileges (U2R) and unauthorized access from a remote machine (R2L). Therefore, unlike other datasets which have only one specific attack type (e.g. CAIDA "DDoS Attack 2007" dataset ^{17}), the KDD '99 dataset contains instances of four different attacks types.

Experimental results

Experiments were performed on a machine with a Quad Core processor at 2.5 GHz and 4 GB of RAM. The algorithms are implemented using Python and ANSI-C programming languages. To run the RG algorithm is required a pattern set *P* and a defined occurrence threshold *L*
_{1}. The pattern set *P* is obtained by the framework based on frequent subgraph mining above. In order to obtain a substantial number of rules to represent all existing classes, we seek a small value for occurrence threshold (in our case *L*
_{1} = 0.3). Once the input parameters are ready, we proceed to generate the rule set *R.* The total number of rules generated was |R| = 3081.

The obtained rule set *R* is processed by the RF algorithm. This step is based on preserving the rules that best describe the existing classes; to do this, a value close to 1 must be defined for the precision threshold (in our case L_{2} = 0.7). To perform the RF algorithm, the same training collection _{
Dt
} used by the framework to extract patterns most be provided. Then, the rule set *R* is significantly reduced to a filtered rule set _{
Rf,
} where |R_{f}| = 41.

With the filtered rule set, we proceed to evaluate the rules on the testing collection using RE algorithm. In order to compare our results, we process the testing collection with different classifiers using the reduced training collection (see Figure 1) to build the classification models. Some of the methods discussed in Related Work Section could not be included in the comparison because they are not publicly available.

In our experiments, we evaluate the performance of classifiers such as C5.0 and others from Weka platform ^{20}, which are: decision table/naive bayes hybrid classifier (DTNB), Id3, JRip (RIPPER), nearest-neighbor-like algorithm using non-nested generalized exemplars (NNge), PART, PRISM, Ripple-Down Rule learner (RIDOR) AND SimpleCart (CART). The achieved results

over the same KDD '99 test collection are shown in the Table 1 (our approach is represented as RE). Moreover, the reported results in ^{18} using J48graft, classification via Regression and SMO classifiers are included in the Table 1. The accuracy achieved for an attack category is computed as is shown in equation (1).

The variable _{
categoryi
} indicates the *i-th* attack category to be calculated it accuracy, *true_positive* represent the number of instances from _{
categoryi
} correctly classified during the test, and *total_category_instances* is the total number of instances labeled as _{
categoryi
} in the test collection. The accuracy for "normal" category is computed using equation (2).

The variable *true_negatine* represent the number of instances from "normal" category correctly classified during the test, and *total_normal_instances* is the total number of instances labeled as "normal" in the test collection.

From the obtained results, it can be said, that none of the filtered rules _{
Rf
} are representative of U2R and R2L attack categories. This is because, the precision value of such attacks representative rules is below the precision threshold *L*
_{2}. Moreover, our approach outperforms other classifiers in terms of accuracy in DoS attack category.

Since our approach is able to abstain, misclassifications tend to decrease (see Table 2). In Table 3, can be seen that the misclassification rate in our proposal is considerably lower than those reported by other classifiers. Another important aspect to note is the overall accuracy, given by equation (3).

In such equation, the variable *total_tp* represents the number of attack instances correctly classified during the test, and *total_test_instances* indicates the number of instances in the test collection. The achieved results shown in the Table 3, prove that our proposal with 86.76 % of overall accuracy and 1.37 % of misclassification rate in general terms outperforms other analyzed classifiers. Inclusive, if abstentions are considered as misclassification, even so, the misclassification rate with 13.24% would be lower than those reported by the analyzed classifiers.

The results achieved by our approach show that it can be applied in scenarios that require minimal misclassifications. Since it is a rule-based approach ^{19}, it can be incorporated as a module of a security system to identify online criminal activities in data streams. Also, the instances labeled as "abstain", can be used to generate patterns. Then, from these patterns new rules can be generated that define new classes, or existing classes that have changed their concept.

CONCLUSIONS

In this paper, a new method for generating rules from frequent subgraphs was presented. This proposal also includes an algorithm for filtering rules, allowing it to merge the generated rules, building generalized rules and reducing the resulting output. Our rule evaluation process included a strategy to abstain, which reduces the number of misclassification.

The experimental results show the effectiveness of our proposal for detecting intruders, improving the results obtained by other classifiers in terms of overall accuracy and misclassification rate. This represents an indicator to consider, for its possible application as part of an intrusion detection system.

As future work, we intend to extend our proposal to a dynamic method. This means that the method may be able to generate new sets of rules taking into account the previously generated rules.