Detection and Prevention of Leaks in Anonymized Datasets

S; eep Varma Nadimpalli; Valli Kumari Vatsavayi

Detection and Prevention of Leaks in Anonymized Datasets

Sandeep Varma Nadimpalli and Valli Kumari Vatsavayi
Department of Computer Science and Systems Engineering, Andhra University Visakhapatnam, Andhra Pradesh, India -530 003

Corresponding Author: Sandeep Varma Nadimpalli, E-mail: snvarma9@gmail.com

Related article at Pubmed, Scholar Google

Visit for more related articles at Journal of Global Research in Computer Sciences

Abstract

With a wide spread of modern technology, person specificdata dissemination has beenincreasing rapidly,leading to a global concern for preserving privacy of an individual. Several principles like k-anonymity, l-diversity etc., have been proposed to protect the person specific information during data publishing. However, the presence of dependencies in an anonymized dataset may identify the individual due to the hypothetical nature of the adversary/attacker. This paper shows how the presence of these dependencies among Quasi-Identifiers (QI), Sensitive (S) attributes and also between QI and S attributes can lead to the potential identification of an individual using Bayesian Networks. A solution Break-Merge (BM) was proposed on the fly to reduce the attacker?s inferring nature on the sensitive data. Experimentations show the efficacy of theproposed approaches.

Keywords

Preserving Data Publishing, Data Dependencies, Knowledge Breach Attack, Verification, Knowledge Breach Probability, Bayesian Networks.

INTRODUCTION

Rapid growth in hardware technology in memory and storage management increased the storage of high volumes of data. Organizations both public and private are making their data available electronically to enable data access services the World Wide Web. This may lead to disclosing of information to the private/external parties who may use the data for survey or mining purpose. These organizations data may contain person specific data which may be extremely sensitive. Also, this sensitive data may contain possible dependencies that can open the inference channels for the attacker to extract sensitive information easily. Hence the urge and insight of the researchers has begun towards the privacy issues and their struggle for inventing new principles and frameworks for strong privacy protection came into existence.

In USA, when public voterÃÂ¢Ãâ¬ÃÅ¸s registration list is combined with the health insurance information records the medical record of the governor of Massachusetts has been potentially identified [1]. This problem was termed as linking attack in the literature [6]. Consider the microdata holding the information of the census information shown in Table I. The sensitive attributes in the dataset are Government, Marital-status and Salary. These attributes are considered to be private by the individual [1].

The attribute Name is termed as explicit identifier because one can easily identify the exact tuple by knowing the name of the individual.

For instance, it is clear from Table I that Alice works in a private organization. Age, Gender, Zipcode attributes are called as quasi-identifiers(QI) because when they are combined with an external dataset there might be a possible potential leakage of the identity of an individual. To prevent this generalization technique (replacing more specific value to less specific value or to replace the value with * termed to be suppression) has been widely used to partition the QI group in the microdata.

k-anonymization is a popular generalization technique used to protect privacy of individuals from the data records. In this technique, the identifying attributes such as SSN, Name in the original data table are removed and then the individuals are formed into groups having size of k or more [1][3][5]. The 5- anonymized version of the census data is shown in Table II. However, when k is large the adversary can infer the sensitive value information with high level of probability[5]. It is shown in [6] that k-anonymity does not provide privacy. They define the anonymized table in such a way that for each quasiidentifier group, at most 1/l sensitive values must be present.

Motivation

Consider the sample anonymized adult dataset shown in Table II which satisfies 5-anonymity. Form Table II it is clear that the probability to identify an individual is at most 1/5. If an adversary, say Alice knows that Jessica is a female and belongs to the first anonymity-group then Alice can infer that the salary of Jessica is „≤ 50KÃÂ¢Ãâ¬ÃÅ¸ with probability of 4/5. If Alice has additional background knowledge about Jessica that she works in a „State-govÃÂ¢Ãâ¬ÃÅ¸ he can easily identify the marital status and the salary of Jessica with a probability of 1, i.e., 100% inference. The inference “Sate-gov→Never-married, ≤ 50K” exists even after the table is anonymized. This inference is termed as quasi- identifier to sensitive attributes association. It is also clear that there is a dependency among quasi, sensitive and quasi to sensitive attributes. In general it is hard to find out how and exactly what attributes are dependent on each other and to what extent just by looking into a large dataset after anonymization.

This being a prime motto we showed how these dependencies between quasi-identifier, sensitive attributes and on both can be discovered by constructing a belief network [41][42] . We term this dependency attack as Knowledge Breach attack (KBA). This attack is modeled using Bayesian Network to identify potential inferences in an anonymized dataset.

RELATED WORK

Several privacy principle techniques were present in the literature. We divide them categorically and present in this section.

Anonymization Operations

Many of the privacy preserving principles adopt generalization as a basic operation to anonymize the microdata. Three flavors of generalization operations are suppression[7],singledimension generalization[8][9][10][11][12] and multiple domain generalization [13][14][15]. In suppression technique the QI values in each QI-group are replaced with stars („*ÃÂ¢Ãâ¬ÃÅ¸) where as in single-dimension technique, disjoint sub-domains of QI are formed in such a way that each QI value in the microdata is mapped to the other sub domain that contains the corresponding values. For example, Table II shows a single dimension generalization satisfying 5-anonymity. To be more specific, the domain Gender is divided into two sub domains “M” and “F”. Multiple domain generalization is an extension of single dimension generalization. Here, the QI values are mapped to the overlapping sub-domains.

Off-the-shelf softwareÃÂ¢Ãâ¬ÃÅ¸s like SAS [24], SPSS [25]and STATA [26]employ suppression and single dimensional generalization techniques. Here, the advantage is the patterns can be easily generated with the help of those tools. Multi-dimensional approach suffers from query analysis, say for instance classification. To this reason, off-the-shelf-softwareÃÂ¢Ãâ¬ÃÅ¸s donÃÂ¢Ãâ¬ÃÅ¸t use multi-dimensional technique for statistical analysis instead they adopt suppression so that it can be treated as a missing value and can be processed easily by these statistical softwareÃÂ¢Ãâ¬ÃÅ¸s. Also μ-Args [20] and datafly[22] use suppression and single-domain generalization techniques.

1)Generalization based Techniques: Statstical community used randomization methods to protecting the privacy of an individual [16]. Also, they added noise to the data before the data release to preserve privacy in fraud detection. However, these methods failed in providing effificient anonymity solution and as a result it lead to data integrity failures. Sweeney proposed k-anonymity to protect privacy of the traget individuals in the microdata. The linking of external voters-list with medical data revealed the identity of the personnel [1]. To prevent this linking attack, generalization based techniques are developed [1] [8] [17] [18][19].

k-anonymity techniques fails when the adversary has potential knowledge on the sensitive attributes. To aid this technique ldiversitywas developed to protect against the inferences on the sensitive values [6]. Later (α, k)-anonymity a combined version of k-anonymity and l-diversity was proposed by Wong et al [11]. It protects both identification and sensitive information by reducing the homogeneity attack. The parameter α defines the maximum percentage of any sensitive value within any Qid-block. Another rigid form of l-diversity is m-variance technique which divides the group such that the group must have exactly m-sensitive values [21]. (c, k) safety [23] assumes a stronger background knowledge. If the attacker know k pieces of knowledge (c,k) safety guarantees the inference of sensitive values adhering to c confidence. Techniques like t-closeness [2] and (k,e)-anonymity [27] can deal with only with numerical attributes while other principles can handle categorical data also. Apart from the above principles Xia and Tao define personalized privacy where the individual is give an option to define his/her own degree of privacy level [28].

2) Permutation based Techniques: In these techniques data perturbation is absent. They publish the QI values directly Anatomy [29], k-permutation [27], bucketization [23] and ambiguity [30] techniques fall under this category. These methods help in achieving better utility than generalization based methods. Permutation based methods restrict the association of a single sensitive attribute with the respondentÃÂ¢Ãâ¬ÃÅ¸s quasi-identifier. However, this can disclose the information when the adversarial knowledge increases potentially [13].

3) Query based Privacy: In query based privacy different view of the dataset is projected. There could be a chance to reveal the private information when different views are merged [31] [32] [33] [34]. [32] [33] provide methods for better privacy and utility. They treat the sensitive values as independent in the released data. However, these techniques areNP-hard [34] [35]. Dwork proposed the concept of differential privacy [36] [37] and [38] extended the work of Dwork.

To summarize, the anonymization technique like k-anonymity fails in discovering the associations (background knowledge) among the QI and sensitive attributes. Even l-diversity cannot handle these associations when the sensitive attributes in the microdata increase. To prevent these associations Weijia and Shangteng proposed a Q-S association hiding algorithm by using association and disassociation rules and then generalize the sensitive values [39]. For determining the association rules, they construct 1-itemsets using inverted file data structure. However, in practical scenarios the sensitive attributes must not be generalized from utility perspective while extracting useful patterns from the dataset.

In this paper, Bayesian network based model was proposed for detecting dependencies [45] in an anonymized dataset. Also, a solutionis proposed for publishing the dataset such that an adversary cannot re-identify the individual [46].

ARCHITECTURE

The architecture of the proposed method is shown in Figure1. Initially a taxonomy tree and the dataset are fed to the anonymizer for anonymizing the data. After the dataset is anonymized, Bayesian networkis constructed for quasiidentifiers, sensitive attributes individually and a Bayesian net on the whole dataset to detect the dependencies in the anonymized data. Whenfound,Break-Merge technique is used to reduce the plausible inferences in the anonymized dataset. The proposed methods for detecting the dependencies and preventing the dependencies are explained in the subsequent sections.

DEPENDENCIES DETECTION

Researchers collect non-aggregate data from various organizations. However, the adversary with his/her hypothetical nature may have access to various external datasets like voters/medical list for mapping the individuals so that he/she can identify the individuals potentially. Various types of attacks and their remedies were discussed in related work. However, none of those methods focused on how the adversary could potentially identify an individual in an anonymized data. The motivating example clearly shows a dependency existence in an anonymized dataset.

Typically, dependencies among attributes express how well the attribute values are dependent on each other. One example is the SSN number where in at most one individual is associated with one unique SSN number. To detect these kinds of dependencies Bayesian network is employed [41]. In essence, a Bayes net is a directed acyclic graph (DAG) whose edges represent statistical dependencies. However, there may be conditional dependencies among nodes. Each attributes in the dataset is considered as node in the Bayes net (Definition 1). The ICS [44] search algorithm is adopted to preserve the dependency between variables into causal relationship among the nodes/attributes.

Definition 1: (Bayesian Network):

in this distribution is expressed in conditional probability tables CPT (see Table 3).

When we look at the conditional probability Table III for age and zipcode the probability of revealing the age=30-50 is 100% when the adversary knows the zipcode to be 13000- 23000. Further, if the adversary knows the zipcode and age values, the probability of finding whether he belongs to M or F group is 50% in some cases and 100% in some cases.

Dependencies among Sensitive attributes

As seen in the first case, dependencies among the quasiidentifiers were identified. In this section,the dependencies among sensitive group are identified. From Figure 3 it can be observed that the Government is independent and the remaining attributes Marital-status is dependent on Government and Salary attribute is dependent on both Government and marital status.

The conditional probabilities of the corresponding dependencies are shown in Table VI and Table VII. For instance {GovernmentMarital-status→ Salary} the risk for determining the salary is quite high (0.5<PSalary<1) when both Government and marital-status are known to the adversary. The probability for determining Marital-status when Government is known is considerably low. This signifies that the dependency {Government→Marital-status} will not hold in sensitive attributes.

Dependencies among Quasi-Identifiers and Sensitive Attributes

The first two cases show how the dependencies among QIÃÂ¢Ãâ¬ÃÅ¸s and Sensitive attribute. Now a Bayesian network for the entire dataset is show in Figure 4. In general the adversary can easily guess the quasi-identifiers since they are grouped and generalized and could be easily identified but the problem arises whenapart from QI, if the adversary has potential background knowledge on sensitive attributes, he checks what dependencies can exist among the attributes by constructing a belief network. From Figure 4, the Government attributes is dependent on the remaining attributes.

The CPT table for the dependency {Age, Gender, Zipcode, Marital-Status, Salary→Government}as given in Table VIII. It is clear that when an adversary knows the QI group, theMarital-status as never-married and earns a salary <50k he/she can conclude that an individual works in state-gov with likelihood of 100%. So the dependency {Age, Gender, Zipcode, Marital-Status, Salary→Government} hold good for Table II.

Algorithm

Algorithm 1 detects the dependencies in an anonymized dataset. The original dataset D as shown in the Table I is anonymized to DS* as shown in Table II. Initially, the algorithm assumes that there are no dependencies in the dataset. A Bayesian network is constructed for DS* using ICS search algorithm [44].

From the Bayesian net a set of dependent attributes(N) and independent attributes ( ) is found. Now the conditional probability tables CPT for each dependent attribute are calculated. If for each value of N there exist distinct values of independent attribute with respect to N with probability less than risk level α (regarding to existing values in the database instances) add the dependency rule →Ni to DEP (steps 2-7). Now examine each edge linking from each independent attributes in . If more than two independent attributes exist for the dependent attributes repeat steps (2-7). If the criterion is satisfied for each and every edge, then add all such dependency rules to DEP set (steps 8-19).

BREAK-MERGE (BM)

The pervious section showed how dependencies exist in an anonymized dataset. The presences of these dependencies may identify an individual potentially.This may in turn violate the privacy. In order to remove these associations and reduce the attackersÃÂ¢Ãâ¬ÃÅ¸ guessing nature of the individual Break-Merge technique is proposed where the anonymized table is separated into Quasi identifier table QIT (Does not hold any sensitive information) and sensitive table ST. Initially QIT is formed in such a way that each group is assigned a Group_Id in a new separate column (Table IX).

The STÃÂ¢Ãâ¬ÃÅ¸s (Tables X, XI, XII) hold marital-status, salary and Government along with the count values of their respective QI-groups are given.The Sensitive tables are represented asset of (Gid, SA, Count) where Gid is the group id of the corresponding QIDT, SA is the sensitive attribute value and count is the number of times the sensitive value is present in the corresponding Gid groups respectively. For example, whenthe Government ST is considered, it signifies that the value sate-gov is associated with two tuples in the first QIgroup, Federal-gov is associated with one tuple in the first group and one tuple in the second QI-group, Private is associated with one tuple in the first group and three tuples in the second group and so on. In this fashion all the sensitive tables are constructed.

With the proposed approach an adversary cannot infer the association between the quasi-identifiers and sensitive attribute because the QIT will not represent any of the information related to the sensitive values and the sensitive values information must be obtained from ST which further increases the probability for the adversary to identify the sensitive value of an individual and hence preserving privacy.

For instance, let us consider that the adversary knows the age of Jessica is 31 and zipcode is 13000 (Tuple id 4 as shown in the figure 1). Since the values in the QIT are generalized and no sensitive information (considering that the adversary wants to know the salary of Jessica) is available.

The only information the adversary can guess is the group id i.e., 1, since all the females fall in the first group. With the help of the group id when he looks for ST, he figures out that out of 5 records, 4 females were drawing the salary ≤50K and 1 female person is drawing a salary >50K. From this it is clear that the probability that the salary of Jessica is either 4/5 or 1/5 and hence the adversary must randomly guess from the ST but not the exact tuple.

Further the adversary has some prior background knowledge about Jessica that she works in a state-gov. If the adversary wants to know the marital-status and salary of Jessica, The probability to find the marital status of Jessica is 1/2 and probability to find the salary is 1/2 i.e, 50%. Let us extend this much further, even if the adversary knows that Jessica works in a state-gov company the probability that Jessica is nevermarried and her salary is ≤50K is 2/5 * 4/5 = 8/25 i.e, 32% likelihood or the probability that Jessica is divorced and her salary is >50 will be 1/5 * 1/5 =1/25 i.e., 4% likelihood. From this it is very clear that even if the adversary has sufficient background knowledge on Quasi-Identifier and one of the sensitive attribute, the likelihood for the adversary to infer other sensitive values will be reduced considerably. For simplicity Table XIII presents the probabilities in terms of likelihood for the adversary to guess JessicaÃÂ¢Ãâ¬ÃÅ¸s record if the sensitive attribute ‘Government’ is known (here knowing that Jessica is working in sate-Government is the background knowledge).

The probability breach is defined with the following background knowledge. This is termed as knowledge breach probability (KBP)

Case 1: If the adversary knows only the knowledge on QI group of intentional individual (here Jessica) then the knowledge breach probability is defined as follows.

Definition 3: Given QIT and STÃÂ¢Ãâ¬ÃÅ¸s with n attributes, the group id GIDl, the sensitive attribute SA, the knowledge breach probability of an individual when the adversary knows the target individual group id as follows

adversary knows only the QI group is equal to the product of knowledge breach probability of a tuple when the adversary knows only sensitive value and the knowledge breach probability of the tuple when the adversary knows both QI group and Sensitive value of the tuple. Formally,

as the input to the algorithm as shown in the Figure 2. Initially it is assumed that QIT and ST are empty. It is also assumed that the QIT will contain only quasi identifiers tuples and sensitive table contains only sensitive value data. No other data will be present in both the tables. The algorithm has 2 phases. The first phase is to assign group id to the anonymized data set (line 2-7).

Initially each tuple in the QI group iscompared with the next tuple in the group and if they match a common group id is assigned to both the tuples. This process is repeated until all thetuples in the anonymized data that forms QI partitions will have their corresponding group-ids. Once the anonymized table ADT* is produced the second phase begins. In this phase the anonymized data is divided into quasi-identifier table (QIT) and sensitive tables (STÃÂ¢Ãâ¬ÃÅ¸s).

For each tuple in the tuples are inserted into QIT which contains only the QI group associated with its corresponding group id having the form (QIDSetj, i) (step 10) and when coming to the sensitive tables, since l-diversity principle is applied on the sensitive attributes, the sensitive values are much diversified keeping l-to be large so as to decrease the adversaries inferring attack range. For each QI group the corresponding sensitive values count is calculated and then inserted into sensitive table in the form of (QID, sensitive value (v), count) (step 11-14).

Finally when the QIT, STÃÂ¢Ãâ¬ÃÅ¸s are formed and if the adversary wants to know about any particular individual he/she can reconstruct a tuple by merging QI and sensitive tables using simple natural join. With the proposed algorithm as shown in Algorithm 2 by breaking the anonymized table into QI and sensitive tables the probability breach of the adversary will decrease drastically.

COMPLEXITY

EXPERIMENTATION

Experimentations are conducted in two phases. In the first phase, the scalability is measured for constructing the Bayesian networks. In the second phase experiments were conducted for Break-Merge technique for scalability performances on different real world and synthetic datasets. A comparison analysis is also performed with [39]

Experimentation Setup

1) Phase I

Experiments were conducted on Adult dataset available at UCI machine Learning Repository (UCI). The dataset consists of 14 attributes and 48,842 tuples. The final dataset consists of 30,162 tuples after removing the missing values “?”. Out of 14 attributes age, Gender and zipcode were treated as quasiidentifiers and Government, marital status and salary were treated as sensitive attributes.

Weka tool was used to construct the Bayesian net [45]. Experimentswere conducted on both single sensitive attributes and multiple sensitive attributes. The dataset is replicated such that each equivalence size is 1000. For the construction of Bayes network it took less than 1.2 seconds and nearly 1.7 seconds for single and multiple sensitive attributes respectively (Figure 5 and Figure 6) for a dataset with 1,00,000 records.

2) Phase II

The experimental setup for break-merge technique is same as in phase I. The Break-Merge algorithm was implemented in Java 1.7 using Netbeans 7.0 IDE. A comparison study was made with [39].

Weijia et alconstruct 1-itemset to determine the presence of associations between quasi-identifier and sensitive attributes. They construct the rules until they reach the certain threshold. Once the rules were obtained the corresponding sensitive attributes were generalized accordingly.

However, our approach does not generalize the sensitive attributes instead break the table in to QIT and STÃÂ¢Ãâ¬ÃÅ¸s. This increases the utility of the dataset for deriving useful patterns. Figure 7 shows that if the probability of the attacker increases above 50% the sensitive attributes are generalized to a high level and there after remains in that high level in Q-S association but our approach do not generalizing the sensitive attributes the information loss with respect to the sensitive attributes will be zero.

Different performance measures for breaking the tables on real time adult dataset and synthetic dataset that are generated from the adult dataset were done. It took less than 5 seconds to break the dataset that contains 1,00,000 records. The remaining datasets disease and salary also took less than 5 seconds as shown in Figure 9 and Figure 10 respectively.

7. CONCLUSIONS

In this paper, solution for protecting the inferences of sensitive data in an anonymized dataset is discussed. The publisher releases the anonymized data without any verification of vulnerable nature of the adversary on the anonymized dataset.

An approach to show how dependencies still prevail in an anonymized dataset using belief networks was discussed.

For this purpose plausible risk levels were defined to show the risk levels in an anonymized dataset. For releasing the data when dependencies exists an approach Break-Merge has been proposed. This approach publishes the data by reducing the attackers inferring nature drastically. This work is of its first kind in the literature addressing the verification techniques for an anonymized datasets. The experimental evaluations show that approaches are practically feasible and scalable.

References

L. Sweeny, k-Anonymity. A model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge based systems, 10(5):557-570, 2002.
N. Li, T. Li and S. Venkatasubramanian, t-closeness. Privacy beyond k-anonymity and l-diversity. In Proc.of ICDE, pages 106-115, 2007.
P. Samarati. Protecting Respondents identities in microdata release, TKDE, 13(6):1010-1027, 2001.
P. Samarati and L. Sweeney. Generalizing Data to Provide Anonymity When Disclosing Information. In Proc. of PODS, 1998.
Meyerson and R. Williams. On the complexity of Optimalk-anonymity. In Proc. of PODS, 2004.
Machanavajjhala, J. Gehrke and D. Kifer: l-diversity. Privacy beyond k- anonymity. In Proc. Of ICDE, 2006.
N. R. Adam and J. C. Wortmann. Security-control methods for statistical databases: A comparative study. ACM Computing Surveys, 21(4):515-556, 1989.
R. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In Proc. of ICDE, pages 217-228, 2005.
C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation. In Proc. of ICDE, pages 205-216, 2005.
V. Iyengar: Transforming data to satisfy privacy constraints. In Proc. of SIGKDD, pages 279-288, 2002.
R. C.-W.Wong, J. Li, A. W.-C. Fu, and K. Wang. (α,k)- anonymity: an enhanced k-anonymity model for privacy preserving data publishing.In SIGKDD, pages754-759, 2006.
J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A.W.-C. Fu. Utility-based anonymization using local recoding. In SIGKDD, pages 785-790, 2006.
G. Ghinita, P. Karras, P. Kalnis, and N. Mamoulis: Fast data anonymization with low information loss. In VLDB, pages 758-769, 2007.
K. LeFevre, D. J. DeWitt, and R. Ramakrishnan: Mondrian multidimensional k-anonymity. In ICDE, 2006.
K. LeFevre, D. J. DeWitt, and R. Ramakrishnan: Workload-aware anonymization. In Proc.of KDD, pages 277-286, 2006.
J. Kim:Method for limiting disclosure of microdata based on random noise and transformation. In SurveyResearch Methods of the American Statistical Association, pages 328-387, 2001.
Charu C. Aggarwal: On k-Anonymity and the Curse of Dimensionality. In Proc. of VLDB, 2005.
L. Kristen, D. David and R. Ramakrishnan:Mondrian multidimensional k-Anonymity. In Proc. of ICDE, 2005.
P. Samarati, L. Sweeney: Generalizing data to provide anonymity when disclosing Information. In Proc. of PODS, 1998.
Hundepool and L. Willenborg: μ-argus: Software for statistical disclosure control. In International Seminaron StatisticalConfidentiality, 1996.
X. Xiao and Y. Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. InProc. of SIGMOD, pages 689-700, 2007.
L. Sweeney. Datafly: A system for providing anonymity in medical data. In Proc. of DBSec, pages 356-381, 1997.
J. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke, and J. Y.Halpern: Worst-case background knowledge for privacy preserving data publishing. In Proc. of ICDE, pages 126-135, 2007.
SAS Institute. SAS/STAT 9.2 UserÃÂ¢Ãâ¬ÃÅ¸s Guide. SAS Publishing, 1 edition, 2008.
SPSS Inc. SPSS 16.0 Base UserÃÂ¢Ãâ¬ÃÅ¸s Guide. SPSS Inc., 2 edition, 2007.
Stata Corporation. Stata UserÃÂ¢Ãâ¬ÃÅ¸s Guide Release 8.0.Stata Press, 1 edition, 2003.
Q. Zhang, N. Koudas, D. Srivastava, and T. Yu:Aggregate query answering on anonymized tables. In Proc. of ICDE, pages 116-125, 2007.
X. Xiao and Y. Tao. Personalized Privacy. In Proc. of SIGMOD, 2006.
X. Xiao and Y. Tao. Anatomy: Simple and effective privacy preservation. In Proc. of VLDB, pages 139-150, 2006.
Hui W Wendy: Ambiguity: Hide the Presence of Individuals and Their Privacy with Low Information Loss, In Proc. of CSI, India, 2008.
N. Dalvi, G. Miklau, D.Suciu: Asymptotic conditional probabilities for conjunctive query. In Proc. of ICDT, pages 289-305, 2005.
Deustch, Y. Papakonstantinou: Privacy in database publishing. In ICDT, pages 230-245, 2005.
Machanavajjhala, J. Gehrke: On the efficiency of checking perfect privacy. InProcs. of PODS, pages 163- 172, 2006.
Barak, K. Chaudhuri, C. Dwork, S.Kale, F.McSherry, K.Talwar,. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In Proc. of SIGACT-SIGMOD-SIGART, Symposium on Principles of Database Systems, pages 273-282, 2007.
Dwork. Differential privacy. In Proc. of ICALP, pages 1- 12, 2006.
Dwork, F. McSherry, K. Nissim, A. Smith. Calibrating noise to sensitivity in private data analysis. In Proc. of TCC, pages 265-284, 2006.
K. Nissim, S. Raskhodnikova, A. Smith. Smooth sensitivity and sampling in private data analysis. In Proc. of STOC, pages 75-84, 2007.
Y.Weijia and H. Shangten. k-Anonymization without Q-S Associations. In Proc. of WAMI, pages 753-764, 2007.
UCI Repository of Machine Learning databases, http://www.ics.uci.edu/∼mlearn/MLRepository.html.
R. E. Neapolitan, Learning Bayesian Networks, Pearson Education, 2004.
Pearl, Probabilistic Reasoning in intelligent Systems: Networks of Plausible Inference, Morgan KaufmannPublishers, San Franciso, CA, USA, 1988.
R. R. Bouckaert, Bayesian Network Classifiers in Wekafor Version 3-5-5, The University of Waikato, 2007.
T. Verma and J. Pearl.An algorithm for deciding if a set of observed independencies has a causalexplanation. In Proc. of Uncertainty in Artificial Intelligence, pages 323-330, 1992
N. SandeepVarma, V. ValliKumari. BM (Break-Merge): An Elegant Approach for Privacy Preserving Data Publishing, In Proc. of PASSAT, IEEE, pages 1202-1207, 2011.
N. SandeepVarma, V. ValliKumari. Detecting Dependencies in an anonymized dataset, In Proc. Of ICACCI, ACM, pages 82-89, 2012.