ISSN ONLINE(23198753)PRINT(23476710)
J.K. Jothi Kalpana^{1} and K. Venkatalakshmi^{2}

Related article at Pubmed, Scholar Google 
Visit for more related articles at International Journal of Innovative Research in Science, Engineering and Technology
Education Data Mining concerns the prediction of school failures in different levels such as primary, secondary and higher level. This paper intends to analysis the students’ performance in different categories of measurements. In this analysis categorize the college student’s academic performance for Villupuram district. Based on the clustering methods such as centroid based, distribution based and density based clustering. Cluster includes groups with small distance among the cluster members. The performance of student’s multilevel of optimization formulated by using clustering. In centroid based clustering, clusters are represented by a central vector. The number of clusters is fixed to k, kmeans clustering gives a formal definition as an optimization problem. The clustering model most closely related to statistics is based on distribution model. Experiments attempts to improve the accuracy by using the method of Gaussian mixture model. The data set is modeled with a fixed number of Gaussian distribution that is initialized randomly and the parameters are iteratively optimized to fit better to the data set. The density based clustering method is a linkage based clustering. The range parameter ε produces a hierarchical result related to that of linkage clustering. Clustering can be represents in a large range of classifications and applications. Kmeans algorithm categorizes the large dataset. In this analysis use genetically improved particle swarm optimization algorithm to model the students level. The GAIPSO algorithm searches the solution space to find the optimal result. The processing of refining use the kmeans algorithm.
Keywords 
Centroid based, Distribution based, Density based Cluster, Kmeans algorithm, Gaussian distribution. 
INTRODUCTION 
Recent years there are increasing research interests in using data mining in education. This new emerging field, called Educational Data Mining (EDM), concerned with developing methods that extract knowledge from data come from the educational context. The data can be collected from historical and operational data reside in the databases of educational institutes. The student data can be an academic. Also it can be collected from elearning systems which have a large amount of information used by most institutes. 
In earlier research of Educational Data Mining predicting school failure in different educational level such as primary, secondary and higher level [2]. This way of grasping knowledge in databases, called Educational Data Mining (EDM). The analysis of this educational mining uses many approaches and techniques such as decision tree, Rule induction, Neural network, Knearest neighbor and Naïve Bayesian. 
The main objective of higher education institutes is to provide quality education to its students and to improve the quality of managerial decisions. One way to achieve highest level of quality in higher education system is by discovering knowledge from educational data to study the main attributes that may affect the students’ performance [5]. The discovered knowledge can be used to offer a helpful and constructive recommendations to the academic planners in higher education institutes to enhance their decision making process, to improve students’ academic performance and trim down failure rate, to better understand students’ behavior, to assist instructors, to improve teaching and many other benefits [2]. 
Improved educational data mining uses many techniques such as centroid based, distribution based and density based clustering. Cluster includes groups with small distance among the cluster members. By using these techniques, many kinds of knowledge can be discovered such as Kmeans and Gaussian. 
This paper investigates the improved educational domain of data mining analysis the graduate students data collected from the college of Engineering and Technology Villupuram. The data include five years period [20082013]. It showed what kind of data could be collected, how could we preprocess the data, how to apply data mining methods on the data, and finally how can we benefited from the discovered knowledge. There are many kinds of knowledge can be discovered from the data. In this work we investigated the most common ones which are centroid based, distribution based and density based clustering. Cluster includes groups with small distance among the cluster members. The Matlab software is used for applying the methods on the Engineering student’s data set. 
Through this discovered knowledge, we need to provide a college management with a helpful and constructive recommendation to overcome the problem of low grades of graduate students, and to improve students’ academic performance. 
This paper takes into consideration of the official approval from the college of Engineering and Technology  Villupuram was obtained to have an access to the related databases for the sole use of analysis and knowledge discovery purposes. To achieve result, all and individual person data are extracted from the database before applying the data mining methods. 
The rest of this paper is organized as follows: Section 2 presents related works in educational data mining. Section 3 describes the data set and the preparation and processing methods performed. Section 4 reports our experiments about applying data mining techniques on the educational data. Finally we conclude this paper with a summary and an outlook for future work in the below Section. 
ASSOCIATED WORKS 
Although, using data mining in higher education is a recent research field, there are many works in this area. That is because of its potentials to educational institutes. The Educational Data Mining (EDM) is a promising area of research and it has a specific requirements not presented in other domains. Thus, work should be oriented towards improved educational domain of data mining. In the existing system, predicting the academic failure of students in different levels such as primary, secondary and higher. The methods to analysis these levels by Gathering students’ data, Preprocessing, Data mining and Interpretation [2]. For example the existing system presents, the numerical values of the scores obtained by students in each subject were changed to categorical values in the following way: 
The educational data mining is used to analyze students’ learning behavior. The goal of the study is to show how useful data mining can be used in higher education to improve students’ performance. He used students’ data from database course and collected all available data including academic records of students, course records. Then applied the improved data mining (IDM) techniques to discover many kinds of knowledge such as centroid based, distribution based and density based clustering. Cluster includes groups with small distance among the cluster members. Also this can clustered the student into groups using Centroid, and detected all similarities in the data mining analysis. Finally, this can prove how we can benefit from the discovered knowledge to improve the performance of student. 
The data mining techniques, particularly classified to help in improving the quality of the higher educational system by evaluating student data to study the main attributes that may affect the student performance in courses. The extracted classification rules are based on the different data the extracted classification rules are studied and evaluated. It allows students to predict the final grade in a course under study. 
The classification of data mining technique to evaluate students’ performance, they used KMeans method for classification. The goal of their study is to extract knowledge that describes students’ performance in end semester examination. They used students’ data from the students’ previous database including Attendance, Class test (Prior Class Assessment Test and Class Assessment Test), Seminar, and Assignment marks [2]. This study helps earlier in identifying the dropouts and students who need special attention and allow the teacher to provide appropriate advising. 
The mining analysis based on students’ failed courses to identifies students’ failure patterns. The goal of their study is to identify hidden relationship between the failed courses and suggests relevant causes of the failure to improve the low capacity students’ performances. The extracted density rules reveal some hidden patterns of students’ failed courses which could serve as a foundation stone for academic planners in making academic decisions and an aid in the curriculum restructuring and modification with a view to improving students’ performance and reducing failure rate [5]. 
The KMeans clustering algorithm as a data mining technique to predict students’ learning activities in a students’ database including class quizzes, mid and final exam and assignments. These correlated information will be conveyed to the class teacher before the conduction of final exam. This study helps the teachers to indicate the details of the targeted students’ performance and reduce the failing ratio by taking appropriate steps at right time and improve the performance of students. 
THE ENGINEERING STUDENTS PREPROCESSING 
The CGPA (Cumulative Grade Point Average) attribute in the data set contains a large number of continuous values. So for efficient later processing, simplified data description and understanding for data mining results, we credited this attribute to categorical one. For example, we grouped all GPAs into five categorical segments; Excellent, Very good, Good, Average and Poor. 
After applying the preprocessing and preparation methods, we try to analyze the data visually and figure out the distribution of values, specifically the grade of students. Figure 1 depicts the distribution of graduate students in period from 2008 to 2013 according to their grades, it is apparent from the figure that the average students present about 54% of the data set. 
APPLICATIONOFDATAMINING TECHNIQUES TO ENGINEERING STUDENTS DATASET: RESULTS AND DISCUSSION 
Before applying the data mining techniques on the data set, there should be a methodology that governs our work. Figure 2 depicts the work methodology used in this paper, which is based on the framework proposed. The methodology starts from the problem definition, then preprocessing which are discussed in the introduction and the data set and preprocessing sections, then we come to the data mining methods which are centroid based, distribution based and density based clustering. Cluster includes groups with small distance among the cluster members. Finally, the knowledge representation processes the steaming of data and handle large datasets. 
In this section, we describe the results of applying the data mining techniques to the data of our case study, for each of the four data mining tasks; Centroid, Distributed, Density based and Knowledge representation, and how we can benefit from the discovered knowledge. 
become clear when examples of KMeans, KHarmonic Means and EM are given in later sections. 
We write the centerbased algorithm, which minimizes the value of the performance function over 
B. Distributed Data Clustering 
Data clustering is the task of partitioning a multivariate data set into groups maximizing intragroup similarity and intergroup dissimilarity. In a distributed environment, it is usually required that data objects are not transmitted between sites for efficiency and security reasons. An approach to clustering exploits the local maxima of a density estimate (Density Estimate) to search for connected regions which are populated by similar data objects [8]. In a scheme for distributed clustering based on DE has been proposed. Every 
participating site computes a DE based on its local data only. Then, every site applies information theoretic regular multidimensional sampling to generate a finite, discrete, and approximate representation of the DE, consisting of its values at a finite number of equidistantly spaced locations. The samples computed by all sites are transmitted and summed (by location) outside the originating site, e.g., at a distinguished helper site. The resulting list of samples, which is an approximate representation of the true global DE, is transmitted to each participating site. 
Every site executes a densitybased clustering algorithm to cluster its local data with respect to the global DE, the values of which can be computed from the samples by means of a sampling series. Notice that a DE is not a bandlimited function, therefore sampling produces aliasing errors, which increase as the number of samples decreases. 
We propose to implement the approach by a society of agents. For example, in a real scenario all participating agents belong to different competing organizations, which agree to cooperate in order to achieve some common goal, without disclosing the contents of their data banks to each other. Each agent will negotiate with other agents to evaluate the advantages and risks which derive from participating to the distributed mining task. In particular, considerable security risks arise from the potential ability of the other agents to carry out inference attacks on density estimates. The resulting disclosure of sensitive information could be exploited as a competitive advantage by the organizations which own the malicious agents. Other aspects an agent has to evaluate in order to autonomously decide whether it should participate or not, include, but are not limited to, investigating a probabilistic model of trustworthiness of participating agents, the relation between trustworthiness and the topology of participating agents, and the probability of incurring coalition attacks. 
C. Density Based Cluster 
Distributed Clustering assumes that the objects to be clustered reside on different sites. Instead of transmitting all objects to a central site (also denoted as server) where we can apply standard clustering algorithms to analyze the data, the data are clustered independently on the different local sites (also denoted as clients). In a subsequent step, the central site tries to establish a global clustering based on the local models, i.e. the representatives [4]. This is a very difficult step as there might exist dependencies between objects located on different sites which are not taken into consideration by the creation of the local models. In contrast to a central clustering of the complete dataset, the central clustering of the local models can be carried out much faster. 
Distributed Clustering is carried out on two different levels, i.e. the local level and the global level. On the local level, all sites carry out a clustering independently from each other. After having completed the clustering, a local model is determined which should reflect an optimum tradeoff between complexity and accuracy. Our proposed local models consist of a set of representatives for each locally found cluster. Each representative is a concrete object from the objects stored on the local site. Furthermore, we augment each representative with a suitable ε−range value. Thus, a representative is a good approximation for all objects residing on the corresponding local site which are contained in the ε−range around this representative. Next the local model is transferred to a central site, where the local models are merged in order to form a global model. 
The global model is created by analyzing the local representatives. This analysis is similar to a new clustering of the representatives with suitable global clustering parameters. To each local representative a global clusteridentifier is assigned. This resulting global clustering is sent to all local sites. If a local object belongs to the εneighborhood of a global representative, the clusteridentifier from this representative is assigned to the local object. Thus, we can achieve that each site has the same information as if their data were clustered on a global site, together with the data of all the other sites. 
To sum up, distributed clustering consists of four different steps: 
• Local clustering 
• Determination of a local model 
• Determination of a global model, which is based on all local models 
• Updating of all local models 
1) Kmeans Algorithm 
A partition clustering algorithm splits the data points into k partitions, where each partition represents a cluster. The partitioning is done based on certain objective function. One of the criterion functions is minimizing square error criterion which is computed as shown by formula: 
2) Pseudo code for Kmeans algorithm 
3) Gaussian Processes 
Gaussian processes (GPs) are a modeling mechanism with origins in spatial statistics, particularly rigging [Journal and Huijbregts, 1992]. In contrast to global approximation techniques such as leastsquares fitting, GPs are local approximation techniques, akin to nearest neighbor procedures [7]. In contrast to function approximation techniques that place a prior on the form of the function, GP modeling techniques place a prior on the covariance structures underlying the data. 
The basic idea in GPs is to model a given dataset as a realization of a stochastic process. Formally, a GP is a set of random variables any finite subset of which have a (multivariate) normal distribution [6]. For our purposes, we can think of these variables as spatially distributed (scalar) response variables ti, one for each 2D location xi = [xi1, xi2] where we have collected a data sample. In our vector field analysis application, ti denotes the modeled response, i.e., the value of de Boor’s function at xi. Given a dataset D = {xi, ti}, i = 1 . . . n, and a new data point xn+1, a GP can be used to model the posterior P(tn+1D, xn+1) (which would also be a Gaussian). This is essentially what many Bayesian modeling techniques do (e.g., least squares approximation with normally distributed noise) but it is the specifics of how the posterior is modeled that make GPs distinct as a class of modeling techniques. 
CONCLUSION AND FUTURE WORK 
In this paper, we gave a case study in the educational data mining. It showed how useful data mining can be used in higher education particularly to improve graduate students’ performance. We used students data collected from the college of Engineering and Technology in Villupuram. The data include five years period [20082013]. We applied data mining techniques to discover knowledge. Based on the clustering methods such as centroid based, distribution based and density based clustering. Cluster includes groups with small distance among the cluster members. Also we clustered the students into groups using KMeans clustering algorithm. Finally the Distancebased Approach and DensityBased Approach are used. Each one of these tasks can be used to improve the performance of graduate student. 
Our future work include applying data mining techniques on an expanded data set with more distinctive attributes to get more accurate results. 
References 
[1] Bin Zhang, Meichu Hsu, George Forman, (2000) ‘Accurate Recasting of Parameter Estimation Algorithms Using Sufficient Statistics for Efficient Parallel Speedup – Demonstrated for Center Based Data Clustering Algorithms. [2] Carlos MarquezVera, Cristobal Romero Morales and Sebastian Ventura Soto, (2013) ‘Predicting School Failure and Dropout by Using Mining Techniques’, Vol. 8, No.1. [3]Chunfei Zhang, Zhiyi Fang, (2013) ‘An Improved Kmeans Clustering Algorithm’, Journal of Information and Computational Science 10: pp. 193199. [4] Huan Wang, Yanwei Yu, Qin Wang and Yadong Wan, (2012) ‘A Density –Based Clustering Structure Mining Algorithm for Data Streams’. [5] Mohammed M. Abu Tair, Alaa M. EIHalees, (2012) ‘Mining Educational Data to Improve Students’ Performance: A Case Study’, International Journal of Information and Communication Technology Research, Vol. 2 No.2, pp. 140146. [6]NarenRamakrishnan, Chris BaileyKellogg, SathishTadepalli, and Varun Pandey ‘Gaussian Processes for Active Data Mining of Spatial Aggregates’. [7] Neha Aggarwal, Faridabadl, and Kiriti Aggarwal . (2012) ‘A Mid Point based Kmean Clustering Algorithm for Data Mining’, International Journal on Computer Science and Engineering’, Vol.4 No. Global Journal of Computer Science and Technology, vol. 10, no. 06, pp. 11741180. [8] XiaoweiXu, Martin Ester, HanPeter Kriegel, and Jorg Sander, (1998) ‘A DistributedBased Clustering Algorithm for Mining in Large Spatial Databases’, International Conference on Data Engineering. 