Keywords

Data mining, clustering, kmeans, multiobjective, AMOSA, GenClustMOO. 
INTRODUCTION

Data mining [9] is the task of discovering interesting patterns from large amounts of data, can be stored in databases, data warehouses, or other related information repositories. Data mining is used to finding hidden information sources in a database. The overall goal of the data mining process is to extract information from a dataset and transform that process into an understandable structure for further use. Data mining is knowledge discovery from data extraction of interesting patterns of nontrivial, implicit, previously unknown value and potentially useful or knowledge from huge amount of data. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Clustering is also called as data segmentation. In some applications clustering partitions large data sets into groups based on the similarity. Clustering is a form of learning by examples. Clustering allows us to run applications on several parallel servers (cluster nodes). Clustering is crucial for scalable enterprise applications can improve the performance by adding more nodes to the cluster. Clustering applications are used extensively in various fields such as AI, pattern recognition, economics, ecology, psychiatry and marketing etc.The kMeans is a simple learning algorithm for clustering analysis. 
Kmeans[14],[15] is one of the Partitioning based technique in which it classifies the data into k groups, which together satisfy the following two needs, each group must contain atleast one object, and each object must belong to exactly one group. Kmeans clustering is an effective algorithm to extract a given number of clusters of patterns from a training set, the cluster locations can be used to classify patterns into distinct classes. The kmeans clustering algorithm takes the input parameter as k and partitions a set of n objects into k clusters the resulting intra cluster similarity is high and the inter cluster similarity is low. Cluster similarity is measured using the mean value of the objects in a cluster called as the clusters centroid or center of gravity. In the kmeans algorithm there is also drawback that is impractical to expect a user to specify the number of clusters kvalue; it may find worse local optima. To overcome these drawbacks use the multiobjective clustering techniques. 
In this paper proposed the multiobjective clustering technique. The goal of Multi Objective clustering [5] is to find clusters in a data set by applying several clustering algorithms corresponding to different objective functions .This has satisfy the three objective functions as the partitioning based on the Euclidean distance, the total symmetry of the clusters, and the last one is the cluster connectedness and processed by using AMOSA and genetic algorithm. The Multi Objective Optimization (MOO) problem has different perspective compared with one single objective. In the singleobjective optimization there is only one global optimum, but in MOO there is a set of solutions, called the Pareto Optimal (PO) set, which are considered to be equally important which means local as well as global can be combined to form the best solution. The past work has with the number of Multi Objective Evolutionary Algorithms (MOEAs) have been evolved. The Evolutionary Algorithms (EAs) had came for solving multiobjective optimization in their populationbased technique ability to find multiple optimal solutions simultaneously. 
The Genetic Algorithm (GA) [6] is an adaptive heuristic search algorithm based on the evolutionary ideas of natural selection and genetics. They provide best optimal solutions for a given objective or fitness function. This can be formed by means of complex, large, and multimodal landscapes. In GA‘s [1] the parameters are encoded in the form of strings or chromosomes in the search spaces. A fitness function is associated with each string that called as the degree of the solution that encoded in it. GA has processes different operators or parameters like selection, crossover, and mutation are used over a number of generations for generating potentially better strings as center point. 
In Simulated Annealing (SA) [2] is popular search algorithm used for solving difficult optimization problem based on principles of statistical mechanics. A measure of the amount of domination between two solutions can be used to find by the SA. SA uses the principles of statistical mechanics based on the behavior of a large number of clusters at low temperature for finding minimum cost solutions to large optimization problems by minimizing the associated energy. In statistical mechanics investigating the ground states or lowenergy states of matter is fundamental importance. These states of clusters are achieved at very low temperatures only. AMOSA which incorporates a novel concept of amount of dominance in order to determine the acceptance of a new solution. The PO solutions are stored in an archive. In Paretodomination is based AMOSA developed to accept the certain condition of rules between the current solution and a new solution with the difference between the number of solutions that they can dominate. By these in this paper see about the GAMOO and AMOSA optimization algorithm to find the best center points based on their performance. 
AMOSA AND GENCLUSTMOO WITH KMEANS

In these there are two modules are used to get the automatic kvalue that is Clustering Using GenClustMOO and k Means, Clustering Using AMOSA and k Means. 
2.1. Clustering Using GenClustMOO and k Means

In GenClustMOO each cluster is divided into several small nonoverlapping hyper spherical shaped in subclusters. Then each cluster is represented by the centers of these individual subclusters. Taken as example a particular string encodes the centers of k number of clusters and each cluster is divided into C number of subclusters. If the data set is of dimension d, then the length of the string will be C × k × d. Initialization procedure is partly random and partly based on two different singleobjective algorithms in order to obtain a good as initial solutions. Then remaining solutions in the archive are initialized after running single linkage clustering algorithm for different values of k. Let for the ith chromosome we execute single linkage clustering algorithm with k = 3 then for each of these three clusters subcluster centers will be selected randomly from the points belonging to that particular cluster formed. These solutions works well for, when clusters present in the datasets are wellseparated. Another set of solutions in the archive are generated using the kmeans algorithm. 
Assignment of Points 
For the purpose of assignment, each subcluster is considered as a separate cluster. Let the particular string contain k number of clusters. Assignment of points is done based on the Euclidean distance condition. A data point xj is assigned to the (i, l) th sub cluster as in equation (1) C is equal to the total number of subcluster centers per cluster 

Symmetry Index 
This cluster validity index is based on a newly developed point symmetry based distance dps(x ?, c ?), which is to calculated. Let a point be x. The symmetrical or reflected point of x with respect to a particular center c is 2 ×c ? − x ?. The new cluster validity function Sym is defined as equation (2) 

Measuring the points by Validity Measures 
This [7] way of measuring the connectivity among a set of points using relative neighborhood graph. The distance between a pair of points is measured in the following way based on Euclidean distance based cluster validity index [19], Iindex third objective function is an Euclidean distance based cluster validity index, Iindex equation (3). 

Crossover Operation 
In crossover mechanism it mates with two parents to produce the new string ie.offspring in the crossover set the range 0.25 up to particular range it select the data points and form the center value from that it generates the number of clusters . 
Mutation Operation 
A new string is generated from the crossover mechanism of particular range of value as 0.25 will be selected for the next step as mutation in which it kept the range value as 0.05 .According to the mutation it changes the string one bit to zero bit and vice versa then once it satisfied the objective function it gives the center value then calculates the kmeans value. 
Selection of the Best Pareto optimal Solution 
The algorithms produce a large number of nondominated solutions based on the final Pareto optimal front in moo. Each solution provides a way of clustering the given data set. All the solutions are equally needed from the logical point of view. But the user needs only a single solution. As semisupervised method of selecting a single solution from the set of solutions is now developed. 
The class labels some of the points called as test patterns are understable to us. The proposed GenClustMOO produces a set of Pareto optimal solutions. The clustering associated with each solution from the final Pareto optimal set is used to assign the cluster labels of the test patterns based on the nearest center criterion. 
2.2. Clustering Using AMOSA and k –Means

An AMOSA is a Multi Objective version of SA; several concepts have been newly integrated. AMOSA generally uses the concept of an archive where the nondominated solutions are stored. There are two types of limits are kept on the archive size: a hard limit denoted by HL, and soft limit denoted by SL, where SL > HL. The nondominated solutions are usually get stored in the archive then it get generated. In the process, if some members of the archive get dominated by the new solutions, then these are removed. If at some point of time, the size of the archive exceeds a specified value, and then the clustering process is invoked. In AMOSA, the initial temperature is set to Tmax. Then, one of the points is randomly taken from the archive. This is called as the currentpt or the initial solution. The currentpt is disturbed to generate a new solution called as newpt and then for that objective functions are computed. The domination status of the newpt is checked with the currentpt and the generated solutions which are stored in the archive. A new solution forming the dominance quantity called the amount of domination, Δdom (a, b), between two solutions a and b is defined as follows equation (4) 
(4) 
where fi (a) and fi(b) are the ith objective values of the two solutions and Ri is the range of the objective function computed from the individuals in the population. M is the number of objectives. It should satisfied the following cases. 
Case 1: Newpt is either dominated by the currentpt or it is non dominating with respect to the currentpt, but some points in the archive dominate the newpt.The newpt is accepted as currentpt with a probability 
..................(5) 
In Equation (5) Δ[dom]avg denotes the average amount of domination of the newpt by (k + 1) points, namely, the currentpt and k points of the archive. Also, as k increases, Δ[dom]avg will increase since here the dominating points that are farther away from the newpt are contributing to its value. 
Case 2: Neither the currentpt nor the points in the archive dominate the newpt. This different in F represents the currentpt and E represents the newpt, G represents the currentpt and I represent the newpt, F represents the currentpt and I represent the newpt. for all these cases, accept the newpt as the currentpt. in the archive the any points are dominated by new points then remove them from it. Add newpt in the archive. If archive size crosses the SL, apply single linkage clustering to reduce its size to HL. 
Case 2: Neither the currentpt nor the points in the archive dominate the newpt. This different in F represents the currentpt and E represents the newpt, G represents the currentpt and I represent the newpt, F represents the currentpt and I represent the newpt. for all these cases, accept the newpt as the currentpt. in the archive the any points are dominated by new points then remove them from it. Add newpt in the archive. If archive size crosses the SL, apply single linkage clustering to reduce its size to HL. 

2.3. Algorithm

GeneticClustMOO Steps 
1. Start with the generation of set of an initial population of randomly selected solutions from the sub clusters to find the center of clusters, then go to step 2. 
2. Evaluate the fitness of all individuals with checking of validility index, symmetry then go to step 3. 
3. Repeatedly do the following: 
3.1. Select fitter individuals for reproduction 
3.2. Recombine between individuals 
3.3. Mutate individuals 
3.4. Evaluate the fitness of the modified individuals 
4. Satisfied the best solution as Pareto optimal for forming corresponding centers. 
5. Then apply kmeans and find the best center point for number of clusters to be formed. AMOSA Steps 
1. From Initial solutions different random center points generated go to step2. 
2. Apply the SA which should satisfy the dominance and non dominance of cases 
2.1 Case1: Newpt is either dominated by the currentpt or it is non dominating with respect to the currentpt. 
2.2 Case 2: Neither the currentpt nor the points in the archive dominate the newpt. 
2.3 Case 3: Newpt dominates the currentpt but k points in the archive dominate the newpt. 
3. Satisfied the best solutions as Pareto optimal for forming corresponding centers go to step 4. 
4. Then apply kmeans and find the best center point for no of clusters to be formed Kmeans Steps 
1. Arbitrarily choose k dataobjects from Pareto optimal solution (D) as initial centroids; 
2. Repeat the process until it get converge, 
2.1Assign each object in the sub cluster of group di to the cluster has the closest centroid. 
2.2Calculate new mean for each cluster; 
2.3 Until convergence criteria is met. 
EXPERIMENTAL RESULTS

3.1 .Datasets

There are 6 [4] real time datasets are used it can gives the performance analysis of existing system with the better results. 
Iris: This datasets consists of 150 data points distributed over 3 clusters. Each cluster has consists of 50 points. This dataset represents differently classify irises data sets can characterized by four feature values .It has three classes Setosa, Versicolor and Virginica. It has two classes Versicolor and Virginica have a large amount of overlap and the class Setosa is linearly separable. 
Glass: This is a glass identification data consisting of 214 instances having 9 features (an Id# feature has been removed). The study of the classification of the types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence, if it is correctly identified. There are 6 categories present in this data set. 
Wine: This is the Wine recognition data consisting of 178 instances having 13 features resulting from a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. It has three classes class 1 59, class 2 71, class 3 48. 
Liver disorder: This is the Liver disorder data consisting of 345 instances having 6 features each. The data has two categories as the first 5 variables are all blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. Each line in the liver disorder data file constitutes the record of a single gender individual, it appears that drinks>5 is some selector on this database. 
Vehicle: To classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles. The number of attributes is 18 and number of classes is 4number of samples as examples is 946. 100 examples are being kept by Strathclyde for validation. So StatLog partners will receive 846 examples. 
Cancer: The Wisconsin Breast Cancer data set, it consists of 698 sample points. Each pattern has 9 features are corresponding to clump thickness, cell size uniformity, cell shape uniformity, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli and mitoses. There are two categories of the data: are malignant and benign. These two classes are known and linearly separable. 
Performance Measures

In order to evaluate the performance of all the optimization clustering algorithms with datasets .It is used a measure quantify the performance of a classification model. It taken as the class labels of each point are known. The measures are Precision, Recall, Accuracy and FMeasure. 
Precision

• P is called the proportion of the predicted positive cases .The fraction of a cluster that consists of objects of a specified class. Precision of cluster i with respect to class j is, where mij is the number of points which belong to cluster i and class j both, and mi is the total number of points in cluster i that were correct, are calculated using the equation (7). 
From these shows the performance of the precision for genetic with kmeans and AMOSA is shown in figure1and results are calculated and compared both the algorithm in table1 by which it shows the AMOSA is best. 

Recall

It is also called as sensitivity.The extent of which a cluster contains all objects of a specified class. The recall of cluster i with respect to class j is where mj is the number of objects in class j equation (8), 

From these shows the performance of the Recall for genetic with kmeans and AMOSA is shown in figure 2 and results are calculated and compared both the algorithm in table 2 by which it shows the AMOSA is best. 
FMeasure

The FMeasure computes some average of the information retrieval precision and recall metrics it can calculate the recall and precision of that cluster for each given class. A combination of both precision and recall that measures the extent to which a cluster contains only objects of a particular class and all objects of that class. The Fmeasure of cluster i with respect to class j is equation (9) 

Fmeasure (FM) is a measure of the quality of a solution given the true clustering. For Fmeasure, the optimum score i 1, with higher scores being “better”. 
From these shows the performance of the FMeasure for genetic with kmeans and AMOSA is shown in figure 3 and results are calculated and compared both the algorithm in table 3 by which it shows the AMOSA is best. 
Accuracy

It is the degree of closeness of measurements of a quantity to that quality actual (true) value.it is porporation of true results both true positives and true negative) in the population parameter test in equation (10). 
From these shows the performance of the Accuracy for genetic with kmeans and AMOSA is shown in figure 4 and results are calculated and compared both the algorithm in table 4 by which it shows the AMOSA is best. 

CONCLUSION

Thus using muti objective function it satified the automatic generation of number of clusters kvalue .It compare with the two modules shows genetic with kmeans is used as the global search and AMOSA used as the local search, when these are combined these improve the performance of the algorithm. It has generates the best solution based on Pareto optimal solution.Then it generates the automatic number of clusters kvalue using kmeans. By these it overcomes the drawback of kmeans algorithm and usage of single objective function because it uses the multiobjective optimization which uses Pareto optimal front as the best solution. Hence it satisfied by generation of new algorithm for automatic clustering in unlabeled data sets. 
But in genetic algorithm there is some drawback due to population diversity and convergence for improving these techniques can able to use other popular optimization techniques. 
ACKNOWLEDGMENT

I wish to express my heartfelt regards and sincere thanks to my Project Guide Mr. G Komarasamy, Assistant Professor (Sr. Grade), Dept of CSE who had galvanized throughout the execution of my project. I thankful to my family members and friends for their encouragement and the best support to me in all needs of my project completion. 
Tables at a glance





Table 1 
Table 2 
Table 3 
Table 4 

Figures at a glance





Figure 1 
Figure 2 
Figure 3 
Figure 4 

References

 S .Bandyopadhyay& U .Maulik, “Nonparametric genetic clustering: comparison of validity indices”, IEEE Transactions on Systems, Man andcybernetics, Part C 31 (1), vol. 31, no. 1, pp. 120–125, 2001.
 S .Bandyopadhyay, S. Saha , U .Maulik& K. Deb, “A simulated annealing based multiobjective optimization algorithm (AMOSA)”, IEEETransactions on Evolutionary Computation 12, pp. 269–283, 2008.
 Chun Sheng Li, “Cluster Center Initialization Method for Kmeans Algorithm Over Data Sets with Two Clusters” , proceedings of InternationalConference on Advances in Engineering Procedia Engineering 24 ,pp. 324 – 328, 2011.
 D.N.A., Asuncion, UCI Machine Learning Repository, http://www.ics.uci.edu/ mlearn/MLRepository.html, 2007.
 J.Handl&J.Knowles, “An evolutionary approach to multi objective clustering”, IEEE Transactions on Evolutionary Computation ,vol.11,no.1,pp.5676,2007.
 U.Maulik&S.Bandyopadhyay, “Genetic algorithmbased clustering technique”, Pattern Recognition 2, pp.1455–1465, 1999.
 S. Saha& S. Bandyopadhyay, “Application of a new symmetry based cluster validity index for satellite image segmentation”, IEEE Geosciencesand Remote Sensing Letters 5, pp.166–170, 2008.
 S. Saha&S.Bandyopadhyay, “A generalized automatic clustering algorithm in a multi objective framework”, Applied soft computing 13, pp.89108, 2013.
 Jiawei Han &MichelineKamber,“Data Mining: Concepts and Techniques” Second Edition by Elsevier Inc 2006.
 R.H. Eduardo, F.F.E. Nelson, “A genetic algorithm for cluster analysis, Intelligent Data Analysis”, 15–25, 2003.
 H.C. Chou, M.C. Su, E. Lai, “A new cluster validity measure and its application to image compression”, Pattern Analysis and Applications 7,205–220, July 2004.
 H. C Martin Law, P. Alexander Topchy& K .Anil. Jain,” Multiobjective Data Clustering”, IEEE Computer Society Conference on ComputerVision and Pattern Recognition, pp.17, 2004.
 UjjwalMaulik&, SanghamitraBandyopadhyay, “Genetic algorithmbased clustering technique”, Pattern Recognition 33 pp.14551465, 2000.
 A.M Fahim, A.M Salem , F.A Torkey& M.A Ramadan, “An efficient enhanced kmeans clustering algorithm”, Science A 7(10),pp.16261633,2006.
 JoydeepGhosh& Alexander Liu,“The kmeans Algorithm”, Bell Labs technical report, 1982.
 Paul S. Bradley, Usama M. Fayyad, “Refining Initial Points for KMeans Clustering”, 15th International Conference on Machine Learning(ICML98).
 KhaledAlsabti, Sanjay Ranka and Vineet Singh “An Efficient KMeans Clustering Algorithm”.ITL Hitachi America, Ltd.
 M.J.A Berry, G. Linoff, “Data Mining Techniques: For Marketing, Sales, and Customer Support.”, John Wiley & Sons, Berlin, 1997.
 U .Maulik, S. Bandyopadhyay, “Performance evaluation of some clustering algorithms and validity indices.” IEEE Trans. Pattern Anal. Mach.Intel. (PAMI) 24 (12), pp.1650–1654, 2002.
 H Margaret. Dunham, Data Mining Introductory and Advanced Concepts, Pearson Education, 2006.
