Data items are grouped with reference to the similarity under the clustering process. Similarity measures are used to analyze the relationship between the transactions. Vector based similarity models are suitable for low dimensional data values. High dimensional data values are clustered using subspace clustering methods. Feature selection involves identifying a subset of the most useful features that produces compatible results. A feature selection algorithm is constructed with the consideration of efficiency and effectiveness factors. The efficiency concerns the time required to find a subset of features. The effectiveness is related to the quality of the subset of features. High dimensional data clustering and feature selection process is carried out using the Fast clustering algorithm. FAST algorithm is divided into two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature is selected from each cluster to form a subset of features. Features in different clusters are relatively independent. The clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. Minimum-Spanning Tree (MST) clustering method is adopted to ensure the efficiency of FAST. Feature subset selection algorithm is used to identify the features from the clusters. Transaction similarity analysis is carried out with different type of correlation measures in the feature selection process. Dynamic feature intervals can be used to distinguish features. Redundant feature filtering mechanism is used to filter the similar features. Custom threshold is used to improve the cluster accuracy.
INTRODUCTION |
The high dimensionality of data poses a challenge to learning tasks such as classification. In the presence of many
irrelevant features, classification algorithms tend to overfit training data. Many features can be removed without
performance deterioration. Feature selection is one effective means to remove irrelevant features. Optimal feature selection
requires an exponentially large search space. Researchers often resort to various approximations to determine relevant
features. However, a single feature can be considered irrelevant based on its correlation with the class; but when combined
with other features, it becomes very relevant. Unintentional removal of these features can result in the loss of useful
information and thus may cause poor classification performance. This is studied as attribute interaction. For example,
MONK1 is a data set involving feature interaction. There are six features in MONK1 and the target concept of MONK1 is:
(A1 = A2) or (A5 = 1). Here A1 and A2 are two interacting features. Considered individually, the correlation between A1 and
the class C is zero, measured by mutual information. Hence, A1 or A2 is irrelevant when each is individually evaluated.
However, if we combine A1 with A2, they are strongly relevant in defining the target concept. An intrinsic character of
feature interaction is its irreducibility, i.e., a feature could lose its relevance due to the absence of its interacting feature(s). |
A clustering is essentially a set of such partions, usually containing all objects in the data set. Additionally, it may
specify the relationship of the clusters to each other, for example a hierarchy of clusters embedded in each other.
Clusterings can be roughly distinguished in: |
• Hard clustering: each object belongs to a cluster or not |
• Soft clustering: each object belongs to each cluster to a certain degree |
Data clustering algorithms can be hierarchical or partitional. Hierarchical algorithms find successive clusters using
previously established clusters, whereas partitional algorithms determine all clusters at once. Hierarchical algorithms can be
agglomerative or divisive. Agglomerative algorithms begin with each element as a separate cluster and merge them into
successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller
clusters. Two-way clustering, co-clustering or bi-clustering are the names for clusterings where not only the objects are
clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the row and columns are
clustered simultaneously. Another important distinction is whether the clustering uses symmetric or asymmetric distances.
A property of Euclidean space is that distances are symmetric. In other applications, this is not the case. |
The graph based clustering is also referred as hierarchical clustering scheme. Hierarchical clustering builds, or
breaks up, a hierarchy of clusters. The traditional representation of this hierarchy is a tree, with individual elements at one
end and a single cluster containing every element at the other. Agglomerative algorithms begin at the top of the tree,
whereas divisive algorithms begin at the bottom. Cutting the tree at a given height will give a clustering at a selected
precision. In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the
third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters. |
In the Machine Learning (ML) scientific community there is a need for rigorous and correct statistical analysis of
published results, due to the fact that the development or modifications of algorithms is a relatively easy task. The main
inconvenient related to this necessity is to understand and study the statistics and to know the exact techniques which can or
cannot be applied depending on the situation, that is, type of results obtained. In a recently published paper in JMLR by
Demsar, a group of useful guidelines are given in order to perform a correct analysis when we compare a set of classifiers
over multiple data sets. Demsar recommends a set of non-parametric statistical techniques for comparing classifiers under
these circumstances, given that the sample of results obtained by them does not fulfill the required conditions and it is not
large enough for making a parametric statistical analysis. He analyzed the behavior of the proposed statistics on
classification tasks and he checked that they are more convenient than parametric techniques. |
RELATED WORK |
Feature subset selection can be viewed as the process of identifying and removing as many irrelevant and
redundant features as possible. This is because 1) irrelevant features do not contribute to the predictive accuracy and 2)
redundant features do not redound to getting a better predictor for that they provide mostly information which is already
present in other feature(s). |
Of the many feature subset selection algorithms, some can effectively eliminate irrelevant features but fail to
handle redundant features [1] yet some of others can eliminate the irrelevant while taking care of the redundant features [2].
Our proposed FAST algorithm falls into the second group. Traditionally, feature subset selection research has focused on
searching for relevant features. A well-known example is Relief which weighs each feature according to its ability to
discriminate instances under different targets based on distance-based criteria function. However, Relief is ineffective at
removing redundant features as two predictive but highly correlated features are likely both to be highly weighted. Relief-F
extends, enabling this method to work with noisy and incomplete data sets and to deal with multiclass problems, but still
cannot identify redundant features. |
However, along with irrelevant features, redundant features also affect the speed and accuracy of learning
algorithms, and thus should be eliminated. FCBF [2] and CMIM [3] are examples that take into consideration the redundant
features. CFS is achieved by the hypothesis that a good feature subset is one that contains features highly correlated with
the target, yet uncorrelated with each other. FCBF ([2], [4]) is a fast filter method which can identify relevant features as
well as redundancy among relevant features without pairwise correlation analysis. CMIM [3] iteratively picks features
which maximize their mutual information with the class to predict, conditionally to the response of any feature already picked. Different from these algorithms, our proposed the FAST algorithm employs the clustering-based method to choose
features. |
Recently, hierarchical clustering has been adopted in word selection in the context of text classification (e.g., [5]).
Distributional clustering has been used to cluster words into groups based either on their participation in particular
grammatical relations with other words by Pereira et al. or on the distribution of class labels associated with each word by
Baker and McCallum. As distributional clustering of words are agglomerative in nature, and result in suboptimal word
clusters and high computational cost, Dhillon et al. [5] proposed a new information-theoretic divisive algorithm for word
clustering and applied it to text classification. Butterworth et al. [8] proposed to cluster features using a special metric of
Barthelemy-Montjardet distance, and then makes use of the dendrogram of the resulting cluster hierarchy to choose the
most relevant attributes. Unfortunately, the cluster evaluation measure based on Barthelemy-Montjardet distance does not
identify a feature subset that allows the classifiers to improve their original performance accuracy. Furthermore, even
compared with other feature selection methods, the obtained accuracy is lower. |
Hierarchical clustering also has been used to select features on spectral data. Van Dijck and Van Hulle [7]
proposed a hybrid filter/wrapper feature subset selection algorithm for regression. Krier et al. [6] presented a methodology
combining hierarchical constrained clustering of spectral variables and selection of clusters by mutual information. Their
feature clustering method is similar to that of Van Dijck and Van Hulle [7] except that the former forces every cluster to
contain consecutive features only. Both methods employed agglomerative hierarchical clustering to remove redundant
features. |
Quite different from these hierarchical clustering-based algorithms, our proposed FAST algorithm uses minimum
spanning tree-based method to cluster features. Meanwhile, it does not assume that data points are grouped around centers
or separated by a regular geometric curve. Moreover, our proposed FAST does not limit to some specific types of data. |
PROBLEM STATEMENT |
Fast clustering-based feature selection algorithm (FAST) is used to cluster the high dimensional data and feature
selection process. FAST algorithm is divided into two steps. In the first step, features are divided into clusters by using
graph-theoretic clustering methods. In the second step, the most representative feature is selected from each cluster to form
a subset of features. Features in different clusters are relatively independent. The clustering-based strategy of FAST has a
high probability of producing a subset of useful and independent features. Minimum-Spanning Tree (MST) clustering
method is adopted to ensure the efficiency of FAST. Feature subset selection algorithm is used to identify the features from
the clusters. The following drawbacks are identified from the existing system. |
• Correlation measures are not optimized |
• Limited clustering accuracy |
• Feature relevance is low |
• Threshold is not optimized |
MINIMUM SPANNING TREE CLUSTERING |
The collection of bridges in a LAN can be considered a graph whose nodes are the bridges and whose edges are
the cables connecting the bridges. To break loops in the LAN while maintaining access to all LAN segments, the bridges
collectively compute a spanning tree. The spanning tree is not necessarily a minimum cost spanning tree. A network
administrator can reduce the cost of a spanning tree, if necessary, by altering some of the configuration parameters in such a
way as to affect the choice of the root of the spanning tree. The spanning tree that the bridges compute using the Spanning
Tree Protocol can be determined using the following rules. The example network at the right, below, will be used to
illustrate the rules. |
The first algorithm for finding a minimum spanning tree was developed by Czech scientist Otakar Borůvka in
1926. Its purpose was an efficient electrical coverage of Moravia. There are now two algorithms commonly used, Prim's
algorithm and Kruskal's algorithm. All three are greedy algorithms that run in polynomial time, so the problem of finding
such trees is in P. Another greedy algorithm not as commonly used is the reverse-delete algorithm, which is the reverse of
Kruskal's algorithm. The fastest minimum spanning tree algorithm to date was developed by Bernard Chazelle, and based
on Borůvka's. Its running time is O(e α(e,v)), where e is the number of edges, v refers to the number of vertices and α is the
classical functional inverse of the Ackermann function. The function α grows extremely slowly, so that for all practical
purposes it may be considered a constant no greater than 4; thus Chazelle's algorithm takes very close to O(e) time. |
More recently, research has focused on solving the minimum spanning tree problem in a highly parallelized
manner. For example, the pragmatic 2003 Fast Shared-Memory Algorithms for Computing the Minimum Spanning Forest
of Sparse Graphs" by David A. Bader and Guojing Cong demonstrates an algorithm that can compute MSTs 5 times faster
on 8 processors than an optimized sequential algorithm. Typically, parallel algorithms are based on Boruvka's algorithm
Prim’s and especially Kruskal's algorithm does not scale as well to additional processors. |
It has been shown by J. Michael Steele based on work by A. M. Frieze that given a complete graph on n vertices,
with edge weights chosen from a continuous random distribution f such that f'(0) > 0, as n approaches infinity the size of the
MST approaches ζ(3) / f'(0), where ζ is the Riemann zeta function. For uniform random weights the exact expected size of
the minimum spanning tree has been computed for small complete graphs. |
CLUSTER BASED FEATURE SELECTION SCHEME |
With the aim of choosing a subset of good features with respect to the target concepts, feature subset selection is
an effective way for reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result
comprehensibility. Many feature subset selection methods have been proposed and studied for machine learning
applications. They can be divided into four broad categories: the Embedded, Wrapper, Filter, and Hybrid approaches. The
embedded methods incorporate feature selection as a part of the training process and are usually specific to given learning
algorithms, and therefore may be more efficient than the other three categories. Traditional machine learning algorithms
like decision trees or artificial neural networks are examples of embedded approaches. The wrapper methods use the
predictive accuracy of a predetermined learning algorithm to determine the goodness of the selected subsets, the accuracy
of the learning algorithms is usually high. However, the generality of the selected features is limited and the computational
complexity is large. The filter methods are independent of learning algorithms, with good generality. Their computational
complexity is low, but the accuracy of the learning algorithms is not guaranteed. The hybrid methods are a combination of
filter and wrapper methods using a filter method to reduce search space that will be considered by the subsequent wrapper.
They mainly focus on combining filter and wrapper methods to achieve the best possible performance with a particular
learning algorithm with similar time complexity of the filter methods. The wrapper methods are computationally expensive
and tend to overfit on small training sets. The filter methods, in addition to their generality, are usually a good choice when
the number of features is very large. Thus, we will focus on the filter method in this paper. |
With respect to the filter feature selection methods, the application of cluster analysis has been demonstrated to be
more effective than traditional feature selection algorithms. Pereira et al., Baker and McCallum and Dhillonet al. employed
the distributional clustering of words to reduce the dimensionality of text data. In cluster analysis, graph-theoretic methods
have been well studied and used in many applications. Their results have, sometimes, the best agreement with human
performance. The general graph-theoretic clustering is simple: compute a neighborhood graph of instances, then delete any
edge in the graph that is much longer/shorter than its neighbors. The result is a forest and each tree in the forest represents a
cluster. In our study, we apply graph-theoretic clustering methods to features. In particular, we adopt the minimum
spanning tree (MST)-based clustering algorithms, because they do not assume that data points are grouped around centers
or separated by a regular geometric curve and have been widely used in practice. |
Based on the MST method, we propose a Fast clustering bAsed feature Selection algoriThm (FAST). The FAST
algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering
methods. In the second step, the most representative feature that is strongly related to target classes is selected from each
cluster to form the final subset of features. Features in different clusters are relatively independent the clustering based
strategy of FAST has a high probability of producing a subset of useful and independent features. The proposed feature
subset selection algorithm FAST was tested upon 35 publicly available image, microarray, and text data sets. The
experimental results show that, compared with other five different types of feature subset selection algorithms, the proposed
algorithm not only reduces the number of features, but also improves the performances of the four well-known different
types of classifiers. |
FAST CLUSTERING AND FEATURE SELECTION PROCESS |
The proposed FAST algorithm logically consists of three steps: 1) removing irrelevant features, 2) constructing an
MST from relative ones, and 3) partitioning the MST and selecting representative features. For a data set D with m features
F = {F1, F2, . . . , Fm} and class C, we compute the T-Relevance SU(Fi, C) value for each feature Fi (1 ≤ i ≤ m) in the first
step. The features whose SU(Fi, C) values are greater than a predefined threshold Ф comprise the target-relevant feature
subset F’ = {F’1, F’2, . . . , F’k} (k ≤ m). |
In the second step, we first calculate the F-Correlation SU(F’i, F’j) value for each pair of features F’i and F’j (F’i,
F’j, ε F’i ∩ i ≠ j). Then, viewing features F’I and F’j as vertices and SU(F’i, F’j) (i ≠ j) as the weight of the edge between
vertices F’i and F’j, a weighted complete graph G = (V,E) is constructed where V = { F’i| F’i ε F’i ∩ [1, k]} and E = {( F’i,
F’j)}( F’i, F’j ε F’i ∩, I, j ε [1, k] ∩ i ≠ j. As symmetric uncertainty is symmetric further the FCorrelation SU(F’i, F’j) is
symmetric as well, thus G is an undirected graph. |
The complete graph G reflects the correlations among all the target-relevant features. Unfortunately, graph G has
k vertices and k(k - 1)/2 edges. For high-dimensional data, it is heavily dense and the edges with different weights are
strongly interweaved. Moreover, the decomposition of complete graph is NP-hard. Thus for graph G, we build an MST,
which connects all vertices such that the sum of the weights of the edges is the minimum, using the wellknown Prim
algorithm. The weight of edge (F’i, F’j) is F-Correlation SU(F’i, F’j). |
After building the MST, in the third step, we first remove the edges E = {( F’i, F’j) | (F’i, F’j ε F’i ∩, I, j ε [1, k] ∩ i
≠ j, whose weights are smaller than both of the T-Relevance SU(F’i, C) and SU(F’j, C), from the MST. Each deletion results
in two disconnected trees T1 and T2. Assuming the set of vertices in any one of the final trees to be V (T), we have the
property that for each pair of vertices (F’i, F’j ε V (T)), SU(F’i, F’j) ≥ SU(F’i, C) ν SU(F’i, F’j) ≥ SU(F’j, C) always holds.
From Definition 6, we know that this property guarantees the features in V (T) are redundant. |
This can be illustrated by an example. Suppose the MST is generated from a complete graph G. In order to cluster
the features, we first traverse all the six edges, and then decide to remove the edge (F0, F4) because its weight SU(F0, F4) =
0:3 is smaller than both SU(F0, C) = 0:5 and SU(F4, C) = 0:7. This makes the MST is clustered into two clusters denoted as
V (T1) and V (T2). Each cluster is an MST as well. Take V (T1) as an example. We know that SU(F0, F1) > SU(F1, C),
SU(F1, F2) > SU(F1, C) ^ SU(F1, F2) > SU(F2, C), SU(F1, F3) > SU(F1, C) ^ SU(F1, F3) > SU(F3, C). We also observed that
there is no edge exists between F0 and F2, F0 and F3, and F2 and F3. Considering that T1 is an MST, so the SU(F0,F2) is
greater than SU(F0, F1) and SU(F1, F2), SU(F0, F3) is greater than SU(F0, F1) and SU(F1, F3), and SU(F2, F3) is greater than
SU(F1, F2) and SU(F2, F3). Thus, SU(F0, F2) > SU(F0, C)^ SU(F0, F2) > SU(F2, C), SU(F0, F3) > SU(F0, C) ^ SU(F0, F3) >
SU(F3, C), and SU(F2, F3) > SU(F2, C) ^ SU(F2, F3) > SU(F3, C) also hold. As the mutual information between any pair (Fi,
Fj)(i, j = 0, 1, 2, 3 ^ i ≠ j) of F0, F1, F2, and F3 is greater than the mutual information between class C and Fi or Fj, features
F0, F1, F2, and F3 are redundant. |
After removing all the unnecessary edges, a forest Forest is obtained. Each tree Tj ε Forest represents a cluster that
is denoted as V (Tj), which is the vertex set of Tj as well. As illustrated above, the features in each cluster are redundant, so
for each cluster V (Tj) we choose a representative feature Fj
R whose T-Relevance SU(Fj
R, C) is the greatest. All Fj
R (j = 1 . .
. |Forest|) comprise the final feature subset ỤFj
R. |
The details of the FAST algorithm is shown in Algorithm 1. |
Inputs: D(F1,F2,…,Fm, C) – the given data set θ – the T-Relevance threshold. |
Output: S – selected feature subset |
1. for i = 1 to m do |
2. T-Relevance = SU (Fi, C) |
3. If T-Relevance > θ then |
4. S = S Ụ {Fi}; |
5. G = NULL; |
6. For each pair of features {F’i, F’j} С S do |
7. F-Correlation = SU {F’i, F’j} |
8. Add F’i, and/or F’j to G with F-Correlation as the weight of the corresponding edge; |
9. minSpanTree = Prim (G); |
10. Forest = minSpanTree |
11. For each edge Eij Є Forest do |
12. If SU(F’i, F’j) < SU (F’i, C) ٨ SU (F’i, F’j) < SU (F’i, F’j ) then |
13. Forest = Forest – Eij |
14. S = ф |
15. For each tree Ti Є Forest do |
16. Fj
R = argmaxF’k Є Ti SU(F’k,C) |
17. S = S U {Fj
R}; |
18. return S |
FEATURE SELECTION USING MST CLUSTERS |
The feature selection process is improved with a set of correlation measures. Dynamic feature intervals can be
used to distinguish features. Redundant feature filtering mechanism is used to filter the similar features. Custom threshold
is used to improve the cluster accuracy. |
The graph based clustering algorithm is designed with Minimum Spanning Tree (MST). Correlation measures are
optimized to improve the cluster results. Feature selection process is enhanced with dynamic threshold values. The system
is designed with five major modules. They are data preprocess, irrelevant filtering, MST construction, cluster process and
feature selection. |
Data preprocess is designed to perform data cleaning with missing value assignment Process. Irrelevant filtering
module is designed to filter irrelevant features with correlation analysis. MST construction module is designed to construct
Minimum Spanning Tree (MST) with transactions. Cluster process module is designed to partition the MST with
boundaries. Feature selection module is designed to fetch features from cluster results. |
Data Preprocess |
Noisy data remove and missing data update operations are carried out under the data preprocess. Redundant data
values are removed from the transactional data collection. Aggregation based data substitution mechanism is used for
missing data update process. Dimensionality analysis is performed for high dimensional data values. |
Irrelevant Filtering |
Irrelevant filtering process is carried out to remove irrelevant features. Correlation measures are used in the
relevancy analysis process. Relevancy is analyzed for all features. A threshold value is used to filter the feature values. |
MST Construction |
Graph theoretic method is applied for the tree construction. Minimum Spanning Tree (MST) is constructed with
the neighborhood information. Shorter/longer edges are removed with reference to its neighbors. The MST produces a
forest with a set of trees. |
Cluster Process |
Features are divided into clusters by using graph-theoretic clustering methods. Fast clustering algorithm is used for
the data partitioning process. The Minimum Spanning Tree is used in the clustering process. Trees under the MST are
separated with interval values as clusters. |
Feature Selection |
Feature selection is applied to filter a subset of useful and independent features. Features in different clusters are
relatively independent. Similar features are filtered from the cluster results using redundant feature filtering method. The
features strongly related to target classes is selected from each cluster to form the final subset of features. |
CONCLUSION |
The high dimensional data values are grouped using the clustering technique. Feature selection methods are used
to select key elements in the transactions. FAST algorithm is used to select features from high dimensional data values.
Correlation measures are used to improve the feature selection process. The system achieves high feature selection quality.
Process time is low in the feature selection scheme. The selected features can be applied for classification process. Cluster
accuracy is high in the correlation measures based feature selection process. |
Figures at a glance |
|
Figure 1 |
|
References |
- G. Forman, “An Extensive Empirical Study of Feature Selection Metrics for Text Classification,” J. Machine Learning Research, vol. 3, pp. 1289-1305, 2003.
- L. Yu and H. Liu, “Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution,” Proc. 20th Int’l Conf. Machine Leaning,vol. 20, no. 2, pp. 856-863, 2003.
- F. Fleuret, “Fast Binary Feature Selection with Conditional Mutual Information,” J. Machine Learning Research, vol. 5, pp. 1531-1555, 2004.
- L. Yu and H. Liu, “Efficient Feature Selection via Analysis of Relevance and Redundancy,” J. Machine Learning Research, vol. 10, no. 5, pp. 1205-1224, 2004.
- I.S. Dhillon, S. Mallela, and R. Kumar, “A Divisive Information Theoretic Feature Clustering Algorithm for Text Classification,” J. Machine LearningResearch, vol. 3, pp. 1265-1287, 2003.
- C. Krier, D. Francois, F. Rossi, and M. Verleysen, “Feature Clustering and Mutual Information for the Selection of Variables in Spectral Data,”Proc.EuropeanSymp. Artificial Neural Networks Advances in Computational Intelligence and Learning, pp. 157-162, 2007.
- G. Van Dijck and M.M. Van Hulle, “Speeding Up the Wrapper Feature Subset Selection in Regression by Mutual Information Relevance and Redundancy Analysis,” Proc. Int’l Conf. Artificial Neural Networks, 2006.
- R. Butterworth, G. Piatetsky-Shapiro, and D.A. Simovici, “On Feature Selection through Clustering,” Proc. IEEE Fifth Int’l Conf. Data Mining, pp.581-584, 2005.
- Qinbao Song, Jingjie Ni, and Guangtao Wang, “A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data”, IEEETransactionsOn Knowledge And Data Engineering, Vol. 25, No. 1, January 2013.
|