Keywords 
Clustering, uncertain data, PM Algorithm, density estimation, Mixture Models. 
INTRODUCTION 
Recent years have seen a surge of interest in methods for managing and mining uncertain data [1], [2], [3]. As noted
in [1], [4], [5], uncertain data arise in many applications due to limitations of the underlying equipment (e.g., unreliable
sensors or sensor networks), use of imputation, interpolation or extrapolation techniques (to estimate, e.g., the position
of moving objects), partial or uncertain responses in surveys, etc. 
Clustering uncertain data has been well recognized as an important issue [6]. Generally, an uncertain data object can
be represented by a probability distribution [7]. The problem of clustering uncertain objects according to their
probability distributions happens in many scenarios. 
For example, in marketing research, users are asked to evaluate digital cameras by scoring on various aspects, such
as image quality, battery performance, shotting performance, and user friendliness. Each camera may be scored by
many users. Thus, the user satisfaction to a camera can be modelled as an uncertain object on the user score space.
There are often a good number of cameras under a user study. A frequent analysis task is to cluster the digital cameras
under study according to user satisfaction data. 
One challenge in this clustering task is that we need to consider the similarity between cameras not only in terms of
their score values, but also their score distributions. One camera receiving high scores is different from one receiving
low scores. At the same time, two cameras, though with the same mean score, are substantially different if their score
variances are very different. As another example, a weather station monitors weather conditions including various
measurements like temperature, precipitation amount, humidity, wind speed, and direction. The daily weather record 
varies from day to day, which can be modelled as an uncertain object represented by a distribution over the space
formed by several measurements. Can we group the weather conditions during the last month for stations in North
America? Essentially, we need to cluster the uncertain objects according to their distributions. 
In applications that require interaction with the physical world, such as locationbased services [8] and sensor
monitoring [9], data uncertainty is an inherent property due to measurement inaccuracy, sampling discrepancy,
outdated data sources, or other errors. Although much research effort has been directed towards the management of
uncertain data in databases, few researchers have addressed the issue of mining uncertain data. We note that with
uncertainty, data values are no longer atomic. To apply traditional data mining techniques, uncertain data has to be
summarized into atomic values. Unfortunately, discrepancy in the summarized recorded values and the actual values
could seriously affect the quality of the mining results. Figure 1 illustrates this problem when a clustering algorithm is
applied to moving objects with location uncertainty. If we solely rely on the recorded values, many objects could
possibly be put into wrong clusters. Even worse, each member of a cluster would change the cluster centroids, thus
resulting in more errors. 
In recent work on uncertain data mining, probability theory has often been adopted as a formal framework for
representing data uncertainty. Typically, an object is represented as a probability density function (pdf) over the
attribute space, rather than as a single point as usually assumed when uncertainty is neglected. Mining techniques that
have been proposed for such data include clustering algorithms [10], density estimation techniques [11], outlier
detection [12], support vector classification [13], decision trees [5], etc. 
Data is often associated with uncertainty because of measurement inaccuracy, sampling discrepancy, outdated data
sources, or other errors. Data uncertainty can be categorized into two types, namely existential uncertainty and value
uncertainty. In the first type it is uncertain whether the object or data tuple exists or not. For example, a tuple in a
relational database could be associated with a probability value that indicates the confidence of its presence. In value
uncertainty, a data item is modelled as a closed region which bounds its possible values, together with a probability
density function of its value. This model can be used to quantify the imprecision of location and sensor data in a
constantlyevolving environment. 
This paper is organized as follows. In Section 2, provides Related work of the Density based clustering and Kmeans++
clustering. Section 3 presents the Collective Neighbor clustering algorithm. Section 4 illustrates the
experiments and presents the results with some discussion. Finally, Conclusion our work in Section 5. 
II. RELATED WORK 
Clustering is a fundamental data mining task. Clustering certain data has been studied for years in data mining,
machine learning, pattern recognition, bioinformatics, and some other fields [14], [15]. However, there is only
preliminary research on clustering uncertain data. 
A. Clustering Based on numerical Distances 
Ngai et al. [16] proposed the UKmeans method which extends the kmeans method. The UKmeans method
measures the distance between an uncertain object and the cluster center (which is a certain point) by the expected
distance. Recently, Lee et al. [17] showed that the UKmeans method can be reduced to the kmeans method on certain
data points. UKmeans basically follows the wellknown Kmeans algorithm except that it uses expected distance when
determining which cluster an object should be assigned to. The second algorithm uses the idea of minmax distance
pruning in UKmeans with the objective of reducing the number of expected distance calculations. UKmeans starts by
randomly selecting k points as cluster representatives. To calculate the integral, the distance of each sample to assigned
cluster, and then approximate the integral by finding the sum of the distances, weighted by the corresponding
probability density of the sample points. For accuracy, thousands of samples are needed. Expected distance calculation
is thus a computationally expensive operation. 
B. Clustering Based on sharing Similarity 
The clustering distributions have appeared in the area of information retrieval when clustering documents [5]. The
major difference of this work is that we do not assume any knowledge on the types of distributions of uncertain objects.
When clustering documents, each document is modelled as a multinomial distribution in the language model. For
example, Xu and Croft [18] discussed a kmeans clustering method with KL divergence as the similarity measurement
between multinomial distributions of documents. Assuming multinomial distributions, KL divergence can be computed
using the number of occurrences of terms in documents. Dhillon et al. [19] used KL divergence to measure similarity
between words to cluster words in documents in order to reduce the number of features in document classification.
They developed a kmeans like clustering algorithm and showed that the algorithm monotonically decreases the
objective function as shown in [9], and minimizes the intracluster JensenShannon divergence while maximizing intercluster
JensenShannon divergence. As their application is on text data, each word is a discrete random variable in the
space of documents. Therefore, it is corresponding to the discrete case in our problem. The kmeans like iterative
relocation clustering algorithms based on Bregman divergences which is a general case of KL divergence. They
summarized a generalized iterative relocation clustering framework for various similarity measures from the previous
work from an information theoretical viewpoint. They showed that finding the optimal clustering is equivalent to
minimizing the loss function in Bregman information corresponding to the selected Bregman divergence used as the
underlying similarity measure. In terms of efficiency, their algorithms have linear complexity in each iteration with
respect to the number of objects. However, they did not provide methods for efficiently evaluating Bregman divergence
nor calculating the mean of a set of distributions in a cluster. For uncertain objects in our problem which can have
arbitrary discrete or continuous distributions, it is essential to solve the two problems in order to scale on large data
sets, as we can see in our experiments. 
III. PROPOSED ALGORITHM 
The proposed system is based on Maximum probability estimation from Uncertain Data clustering algorithm. We
first describe an uncertain data model in which data uncertainty is represented by belief functions; this model
encompasses probabilistic data, interval valued data and fuzzy data as special cases. The proposed system introduce an
extension of the PM (Probability–maximization) algorithm, called the evidential PM (P2M) algorithm, allowing us to
estimate parameters in parametric statistical models based on uncertain data. The proposed System shown in Fig. 1. 
A. PM algorithm 
The PM algorithm is a generally applicable mechanism for computing maximum probability estimates (MPEs) from
incomplete data, in situations where maximum probability estimation would be straightforward. The complete EMAlgorithm
Learning is listed in Algorithm 1. 
The PM algorithm approaches the problem of maximizing the observeddata log likelihood log L(θ; A) by
proceeding iteratively with the completedata log probability log L(θ; x) = log px(x; θ). In each iteration of the
algorithm involves two steps called the Probability step (Pstep) and the maximization step (Mstep). 
The Estep requires the calculation of, 

Where θ(q) denotes the current fit of θ at iteration q and Pθ(q) [·A] denotes probability with respect to the conditional
distribution of X given A, using the parameter vector θ(q). 
The Mstep then consists in maximizing Q (θ; θ(q)) with respect to θ over the parameter space Ө, i.e., finding θ(q+1)
such that Q(θ(q+1); θ(q)) ≥ Q(θ; θ(q)) for all θ Є Ө. The E and Msteps are iterated until the difference L(θ(q+1); A)–L(θ(q);
A) becomes smaller than some arbitrarily small amount. 

where L(θ(q); pl) is discrete case, and (2) in the continuous case. At iteration q, the following function is thus
computed: 
in the discrete case, or in the continuous case. 
The Mstep is unchanged and requires the maximization of Q(θ; θ(q)) with respect to θ. The E2M algorithm
alternately repeats the E and Msteps above until the increase of observeddata likelihood becomes smaller than some
threshold. 
IV. ALGORITHM 1: PSEUDO CODE 
Input: Require: X, L, A, pl, θ and q 
Output: The cluster output. 
Require: A: Available cluster space, R: Resource (Data points), i (Iteration), S: Similarities. 
Step 1: Initialize Q(θ), Q(X), px, Q(Ө), and Q(T) randomly 
Step 2: repeat 
Step 3: Update Q(X) is an arbitrary distribution on L. 
Step 4: Update Q(θ) using eq.(1) 
Step 5: Update px probability mass function using eq. (2) 
Step 6: Calculate the value of log likelihood based on the probability densities using eq. (3) 
Step 7: Create the clusters in discrete case by assigning each probability to a cluster such that the belonging
probabilities are maximized using eq. (4) 
Step 8: until convergence 
Step 9: return Q(X) and px(x) 
Step 10: End 
V. SIMULATION RESULTS 
The extensive experiments on both synthetic and real data sets to evaluate the effectiveness of PM algorithm as a
similarity measure for clustering uncertain data and the efficiency of the techniques for evaluating PM divergences.
The experiments were conducted on a computer with an Intel Core 2 Duo P8700 2.53 GHz CPU and 2 GB main
memory running windows xp (sp2). All programs ran inmemory. 
The data sets in both continuous and discrete domains. In the continuous case, an uncertain object is a sample
drawn from a continuous distribution. In the discrete case, a data set is generated by converting a data set in the
continuous case. We discretized the continuous domain by partitioning it into a grid. Every dimension is equally
divided into two parts. So, a ddimensional space is divided into 2d cells of equal size. We use the central points of
cells as values in the discrete domain. The probability of an object in a cell is the sum of the probabilities of all its
sample points in this cell. 
The above algorithm was applied to the data shown in Table 1. This dataset is composed of n = 6 observations, one
of which (for i = 4) is uncertain and depends on a coefficient. In that special case it is assumed that pl4(0) + pl4(1) = 1,
i.e., the corresponding mass function mi is Bayesian. 
The results are shown in Table 2 and 3. The algorithm was stopped when the relative increase of the likelihood
between two iterations was less than 106. Starting from the initial value θ(0) = 0:3, this condition was met after 5
iterations. The final MLE is θ’ = 0:6. This is the value of θ that minimizes the conflict between the uncertain data
given in Table 1. 
VI. CONCLUSION 
In this paper, the PM algorithm was motivated by our interest in uncertain data clustering of both discrete and
continuous phase. We explore clustering uncertain data based on the similarity between their distributions. We
advocate using the Probability–maximization algorithm as the probability similarity measurement, and systematically
define the PM divergence between objects in both the continuous and discrete cases. We integrated PM divergence into
the partitioning and densitybased clustering methods to demonstrate the effectiveness of clustering using PM
divergence. 
Our proposed method then seeks the value of the unidentified parameter that maximizes a generalized probability
criterion, which can be interpreted as an angle of agreement between the parametric model and the uncertain data. This is achieved using the evidential PM algorithm, which is a simple extension of the classical PM algorithm with proved
convergence properties. 
Tables at a glance 


Figures at a glance 




Figure 1 
Figure 2 
Figure 3 
Figure 4 


References 
 S. Abiteboul, P.C. Kanellakis, and G. Grahne, Ã¢ÂÂOn the Representation and Querying of Sets of Possible Worlds,Ã¢ÂÂ Proc. ACM SIGMODIntÃ¢ÂÂl Conf. Management of Data (SIGMOD), 1987.
 M.R. Ackermann, J. BloÃÂ¨mer, and C. Sohler, Ã¢ÂÂClustering for Metric and NonMetric Distance Measures,Ã¢ÂÂ Proc. Ann. ACMSIAM Symp.Discrete Algorithms (SODA), 2008.
 M. Ankerst, M.M. Breunig, H.P. Kriegel, and J. Sander, Ã¢ÂÂOptics: Ordering Points to Identify the Clustering Structure,Ã¢ÂÂ Proc. ACMSIGMOD IntÃ¢ÂÂl Conf. Management of Data (SIGMOD), 1999.
 A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh, Ã¢ÂÂClustering with Bregman Divergences,Ã¢ÂÂ J. Machine Learning Research, vol. 6, pp.17051749, 2005.
 D.M. Blei, A.Y. Ng, and M.I. Jordan, Ã¢ÂÂLatent Dirichlet Allocation,Ã¢ÂÂ J. Machine Learning Research, vol. 3, pp. 9931022, 2003.
 H.P. Kriegel and M. Pfeifle, Ã¢ÂÂDensityBased Clustering of Uncertain Data,Ã¢ÂÂ Proc. ACM SIGKDD IntÃ¢ÂÂl Conf. Knowledge Discovery inData Mining (KDD), 2005.
 R. Cheng, D.V. Kalashnikov, and S. Prabhakar, Ã¢ÂÂEvaluating Probabilistic Queries over Imprecise Data,Ã¢ÂÂ Proc. ACM SIGMOD IntÃ¢ÂÂl Conf.Management of Data (SIGMOD), 2003.
 Wolfson, O., Sistla, P., Chamberlain, S. and Yesha, Y.: Updating and Querying Databases that Track Mobile Units. Distributed andParallel Databases, 7(3), 1999.
 Cheng, R., Kalashnikov, D., and Prabhakar, S.: Querying Imprecise Data in Moving Object Environments. IEEE TKDE, 16(9) (2004)11121127.
 H.P. Kriegel and M. Pfeifle, Ã¢ÂÂDensitybased clustering of uncertain data,Ã¢ÂÂ in Proceedings of the eleventh ACM SIGKDD internationalconference on Knowledge discovery in data mining. Chicago, Illinois, USA: ACM, 2005, pp. 672 Ã¢ÂÂ 677.
 C. C. Aggarwal, Ã¢ÂÂOn density based transforms for uncertain data mining,Ã¢ÂÂ in IEEE 23rd International Conference on Data Engineering(ICDE 2007), Istanbul, 2007, pp. 866Ã¢ÂÂ875.
 C. C. Aggarwal and P. S. Yu, Ã¢ÂÂOutlier detection with uncertain data,Ã¢ÂÂ in Proceedings of the SIAM International Conference on DataMining (SDM 2008), Atlanta, Georgia, USA, 2008, pp. 483Ã¢ÂÂ493.
 J. Bi and T. Zhang, Ã¢ÂÂSupport vector classification with input data uncertainty,Ã¢ÂÂ in Advances in Neural Information Processing Systems 17,L. K. Saul, Y. Weiss, and L. Bottou, Eds. Cambridge, MA: MIT Press, 2005, pp. 161Ã¢ÂÂ168.
 J. Han and M. Kamber, Data Mining: Concepts and Techniques. Elsevier, 2000.
 L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, 1990.
 W.K. Ngai, B. Kao, C.K. Chui, R. Cheng, M. Chau, and K.Y. Yip, Ã¢ÂÂEfficient Clustering of Uncertain Data,Ã¢ÂÂ Proc. Sixth IntÃ¢ÂÂl Conf. DataMining (ICDM), 2006.
 S.D. Lee, B. Kao, and R. Cheng, Ã¢ÂÂReducing UkMeans to kMeans,Ã¢ÂÂ Proc. IEEE IntÃ¢ÂÂl Conf. Data Mining Workshops (ICDM), 2007.
 J. Xu and W.B. Croft, Ã¢ÂÂClusterBased Language Models for Distributed Retrieval,Ã¢ÂÂ Proc. 22nd Ann. IntÃ¢ÂÂl ACM SIGIR Conf. Researchand Development in Information Retrieval (SIGIR), 1999.
 I.S. Dhillon, S. Mallela, and R. Kumar, Ã¢ÂÂA Divisive InformationTheoretic Feature Clustering Algorithm for Text Classification,Ã¢ÂÂ J.Machine Learning Research, vol. 3, pp.12651287, 2003.
