Security and privacy methods are used to protect the data values. Private data values are secured with confidentiality and integrity methods. Privacy model hides the individual identity over the public data values. Sensitive attributes are protected using anonymity methods. Discrimination is the prejudicial treatment of an individual based on their membership in a certain group or category. Antidiscrimination acts are designed to prevent discrimination on the basis of a number of attributes in various settings. Public data collections are used to train association/classification rules in view of making automated decisions. Data mining can be both a source of discrimination and a means for discovering discrimination. Automated data collection and data mining techniques such as classification rule mining are used to making automated decisions. Discriminations are divided into two types such as direct and indirect discriminations. Direct discrimination occurs when decisions are made based on sensitive attributes. Indirect discrimination occurs when decisions are made based on non sensitive attributes which are strongly correlated with biased sensitive ones. Discrimination discovery and prevention are used for anti-discrimination requirements. Direct and indirect discriminations prevention is applied on individually or both at the same time. The data values are cleaned to obtain direct and/or indirect discriminatory decision rules. Data transformation techniques are applied to prepare the data values for the discrimination prevention. Rule protection and rule generalization algorithm and direct and indirect discrimination prevention algorithm are used to protect discriminations. The discrimination prevention model is integrated with the differential privacy scheme to high privacy. Dynamic policy selection based discrimination prevention is adopted to generalize the systems for all regions. Data transformation technique is improved to increase the utility rate. Discrimination removal process is improved with rule hiding techniques.
                
  
    | Keywords | 
  
    | Discrimination, differential privacy, policy selection, rule protection, rule generalization | 
  
    | INTRODUCTION | 
  
    | Data mining and knowledge discovery in databases are two new research areas that investigate the automatic
      extraction of previously unknown patterns from large collections of data. Recent development in data collection, data
      dissemination and related technologies have inaugurated a new era of research where existing data mining algorithms
      should be reconsidered from a different point of view, this of privacy preservation. It is well documented that this new
      without limits explosion of new information through the Internet and other media, has reached to a point where threats
      against the privacy are very common on a daily basis and they deserve serious thinking. | 
  
    | Privacy preserving data mining, is a novel research direction in data mining and statistical databases, where data
      mining algorithms are analyzed for the side-effects they incur in data privacy. The main consideration in privacy preserving
      data mining is twofold. First, sensitive raw data like identifiers, gender, religion, addresses and the like should be changed
      or cut out from the original database, in order for the recipient of the data not to be able to compromise another person’s privacy. Second, sensitive data which can be mined from a database by using data mining algorithms should also be
      excluded, because such knowledge can equally well compromise data privacy. The main objective in privacy preserving
      data mining is to develop algorithms for changing the original data in some way, so that the private data and private
      knowledge remain private even after the mining process. The problem that arises when confidential information can be
      derived from released data by unauthorized users is also commonly called the “database inference” problem. | 
  
    | II. RELATED WORK | 
  
    | Despite the wide deployment of information systems based on data mining technology in decision making, the
      issue of antidiscrimination in data mining did not receive much attention until 2008 [9]. Some proposals are oriented to the
      discovery and measure of discrimination. Others deal with the prevention of discrimination. | 
  
    | The discovery of discriminatory decisions was first proposed by Pedreschi et al. [5]. The approach is based on mining
      classification rules (the inductive part) and reasoning on them (the deductive part) on the basis of quantitative measures of
      discrimination that formalize legal definitions of discrimination. For instance, the US Equal Pay Act states that: “a selection
      rate for any race, gender, or specific group which is less than four-fifths of the rate for the group with the highest rate will
      generally be regarded as evidence of adverse impact.” This approach has been extended to encompass statistical
      significance of the extracted patterns of discrimination in [3] and to reason about affirmative action and favoritism [4].
      Moreover it has been implemented as an Oracle-based tool in [6]. Current discrimination discovery methods consider each
      rule individually for measuring discrimination without considering other rules or the relation between them. However, in
      this paper we also take into account the relation between rules for discrimination discovery, based on the existence or
      nonexistence of discriminatory attributes. | 
  
    | Discrimination prevention, the other major antidiscrimination aim in data mining, consists of inducing patterns that do not
      lead to discriminatory decisions even if the original training data sets are biased. Three approaches are conceivable: | 
  
    | A. Preprocessing | 
  
    | Transform the original data in such a way that the discriminatory biases contained in the original data are
      completely trim so that no wrong decision rule can be mined from the transformed data and apply any of the standard
      data mining algorithms. The preprocessing approaches of data transformation and hierarchy-based generalization can be
      adapted from the privacy preservation literature. Along this line, [7], [8] perform a controlled distortion of the training data
      from which a classifier is learned by making minimally intrusive modifications leading to an unbiased data set. The
      preprocessing approach is useful for applications in which a data set should be published and/or in which data mining needs
      to be performed also by external parties | 
  
    | B. In processing | 
  
    | Change the data mining algorithms in such a way that the resulting models do not contain wrong decision rules.
      For example, an alternative approach to cleaning the discrimination from the original data set is proposed in [2] whereby
      the nondiscriminatory constraint is embedded into a decision tree learner by changing its splitting criterion and pruning
      strategy through a novel leaf relabeling approach. However, it is obvious that in processing discrimination prevention
      methods must rely on new special-purpose data mining algorithms; standard data mining algorithms cannot be used. | 
  
    | C. Post processing | 
  
    | Modify the resulting data mining models, instead of cleaning the original data set or changing the data mining
      algorithms. For example, in [3], a confidence-altering approach is proposed for classification rules inferred by the CPAR
      algorithm. The post processing approach does not allow the data set to be released: only the modified data mining models
      can be released (knowledge publishing), hence data mining can be performed by the data owner only.One might think of a
      straightforward preprocessing approach consisting of just removing the discriminatory attributes from the data set. | 
  
    | Although this would solve the direct discrimination problem, it would cause much information loss and in general it would
      not solve indirect discrimination. As stated in [9] there may be other attributes (e.g., Zip) that are highly correlated with the
      sensitive ones (e.g., Race) and allow inferring discriminatory rules. Hence, there are two important challenges regarding
      discrimination prevention: one challenge is to consider both direct and indirect discrimination instead of only direct
      discrimination; the other challenge is to find a good tradeoff between discrimination removal and the quality of the
      resulting training data sets and data mining models. | 
  
    | Although some methods have already been proposed for each of the above-mentioned approaches (preprocessing, inprocessing,
      post processing), discrimination prevention stays a largely unexplored research avenue. In this paper, we
      concentrate on discrimination prevention based on preprocessing, because the preprocessing approach seems the most
      flexible one: it does not require changing the standard data mining algorithms, unlike the in-processing approach, and it
      allows data releasing (rather than just knowledge is publishing), unlike the post-processing approach. | 
  
    | III. DISCRIMINATION PREVENTION SCHEMES | 
  
    | In sociology, discrimination is the prejudicial treatment of an individual based on their membership in a certain
      group or category. It involves rejecting to members of one group opportunities that are available to other groups. There is a
      list of antidiscrimination acts, which are laws designed to prevent discrimination on the basis of a number of attributes (e.g.,
      race, religion, gender, nationality, disability, marital status, and age) in various settings (e.g., employment and job, access to
      public services, credit and finance, etc.). | 
  
    | Services in the information society allow for automatic and routine collection of large amounts of data. Those data
      are often used to train association/classification rules in view of making automated decisions, like loan granting/denial,
      insurance premium computation, personnel selection, etc. At first sight, automating decisions may give a sense of fairness:
      classification rules do not guide themselves by personal preferences. However, at a closer look, one realizes that
      classification rules are actually learned by the system (e.g., loan acceptance) from the training data. If the training data are
      inherently biased for or against a particular community (e.g., black people), the learned model may show a discriminatory
      prejudiced behavior. In other words, the system may infer that just being black people is a legitimate reason for loan
      rejection. Discovering such potential biases and removing them from the training data without harming their decision
      making utility is therefore highly complex. One must prevent data mining from becoming itself a source of discrimination,
      due to data mining tasks generating discriminatory models from biased data sets as part of the automated decision making.
      In [9], it is demonstrated that data mining can be both a source of discrimination and a means for discovering
      discrimination. | 
  
    | Discrimination can be either direct or indirect (also called systematic). [1] Direct discrimination consists of rules
      or procedures that explicitly mention minority or specific group based on their sensitive discriminatory attributes related to
      group membership. Indirect discrimination consists of rules or procedures that, while not explicitly showing discriminatory
      attributes, intentionally or unintentionally could generate discriminatory decisions. Redlining by financial institutions
      (refusing to give mortgages or insurances in urban areas they consider as deteriorating) is an archetypal example of indirect
      discrimination, although certainly not the only one. With a slight abuse of culture and their membership for the sake of
      compactness, in this paper indirect discrimination will also be referred to as redlining and rules causing indirect
      discrimination will be called redlining rules [9]. Indirect discrimination could happen because of the availability of some
      background knowledge (rules), for example, that a certain zip code corresponds to a deteriorating area or an area with
      mostly black population. The background knowledge might be accessible from publicly available data (e.g., census data) or
      might be obtained from the original data set itself because of the existence of nondiscriminatory attributes that are highly
      correlated with the sensitive ones in the original data set. Discrimination prevention methods based on preprocessing
      published so far [7], [8] present some limitations, which we next highlight: | 
  
    |  They attempt to find discrimination in the original data only for one discriminatory item and based on a single
      measure. This approach cannot sure that the transformed data set is really discrimination free, because it is known
      that discriminatory behaviors can often be hidden inside several discriminatory items, and even behind
      combinations of them. | 
  
    |  They only consider direct discrimination. | 
  
    |  They do not obtain any measure to evaluate how much discrimination has been removed and how much
      information loss has been occurred. | 
  
    | IV. DISCRIMINATION PREVENTION ISSUES | 
  
    | Automated data acquisition and data mining techniques such as classification rule mining are used to making
      automated decisions. Discriminations are divided into two types such as direct and indirect discriminations. Direct
      discrimination occurs when decisions are made based on sensitive attributes. Indirect discrimination occurs when decisions
      are made based on insensitive attributes which are strongly correlated with biased sensitive ones. Discrimination discovery
      and prevention are used for anti-discrimination requirements. Direct and indirect discriminations prevention is applied on
      individually or both at the same time. The data values are cleaned to obtain direct and/or indirect discriminatory decision
      rules. Data transformation techniques are applied to prepare the data values for the discrimination prevention. Rule
      protection and rule generalization algorithm and direct and indirect discrimination prevention algorithm are used to protect
      discriminations. The following drawbacks are identified in the existing system. | 
  
    |  Static discrimination policy based scheme | 
  
    |  Limited utility ratio | 
  
    |  Low privacy assurance | 
  
    |  Privacy association is not analyzed | 
  
    | V. DIRECT AND INDIRECT DISCRIMINATION PREVENTION ALGORITHM | 
  
    | Algorithm 1 details our proposed data transformation method for simultaneous direct and indirect discrimination
      prevention. The algorithm starts with redlining rules. From each redlining rule (r : X → C), more than one indirect α-
      discriminatory rule (r’ : A, B → C) might be generated because of two reasons: 1) existence of different ways to group the
      items in X into a context item set B and a nondiscriminatory item set D correlated to some discriminatory item set A; and 2)
      existence of more than one item in DIs. Hence, as shown in Algorithm 4 (Step 5), given a redlining rule r, proper data
      transformation should be conducted for all indirect α-discriminatory rules r’ : (A C DIs), (B C X) → C ensuing from r. | 
  
    | Algorithm 1. Direct and Indirect Discrimination Prevention | 
  
    |  | 
  
    |  | 
  
    | If some rules can be extracted from DB as both direct and indirect α-discriminatory rules, it means that there is
      overlap between MR and RR; in such case, data transformation is performed until both the direct and the indirect rule
      protection requirements are satisfied (Steps 13-18). This is possible because, the same data transformation method (Method
      2 consisting of changing the class item) can provide both DRP and IRP. However, if there is no overlap between MR and
      RR, the data transformation is performed according to Method 2 for IRP, until the indirect discrimination prevention
      requirement is satisfied (Steps 19-23) for each indirect α-discriminatory rule ensuing from each redlining rule in RR, this
      can be done without any negative impact on direct discrimination prevention. Then, for each direct α-discriminatory rule r’
      Є MR\RR (that is only directly extracted from DB), data transformation for satisfying the direct discrimination prevention
      requirement is performed (Steps 26-33), based on Method 2 for DRP; this can be done without any negative impact on
      indirect discrimination prevention. Performing rule protection or generalization for each rule in MR by each of Algorithms
      1-4 has no adverse effect on protection for other rules (i.e., rule protection at Step i + x to make r’ protective cannot turn
      into discriminatory a rule r made protective at Step i) because of the two following reasons: the kind of data transformation
      for each rule is the same (change the discriminatory item set or the class item of records) and there are no two α-
      discriminatory rules r and r’ in MR such that r = r’. | 
  
    | VI. PROPOSED WORK | 
  
    | The proposed discrimination prevention model is integrated with the differential privacy scheme to high privacy
      which means. Dynamic policy selection based discrimination prevention is adopted to generalize the systems for all
      regions. Data transformation technique is improved to increase the utility rate. Discrimination removal process is improved
      with rule hiding techniques by hiding sensitive rules. | 
  
    | The discrimination prevention system is designed to protect the decisions that are derived from the rule mining process.
      The system is divided into five major modules. They are data cleaning process, privacy preservation, rule mining, rule
      hiding and discrimination prevention. | 
  
    | 6.1 Differential Privacy to Data | 
  
    | A. Formal Definition | 
  
    | K gives e-differential privacy if for all values of DB, DB’ differing in a single element, and all S in Range (K) | 
  
    |  | 
  
    | B. How to Achieve Differential Privacy | 
  
    | f: DB → Rd | 
  
    | K (f, DB) = f (DB) + [Noise]d | 
  
    | E.g., Count (P, DB) = # rows in DB with Property P | 
  
    | C. How does it work? | 
  
    | Δf = max DB, DB-Me |f(DB) – f(DB-Me)| | 
  
    | Theorem: To achieve ε-differential privacy, use scaled symmetric noise Lap(R) with R = Δf/ε. | 
  
    | D. Example | 
  
    |  | 
  
    | VII. OVERALL FUNCTIONALITIES OF PROPOSED MODEL | 
  
    | A. Data Cleaning Process | 
  
    | Data populate and missing value assignment operations are carried out in the data cleaning process. Textual data
      values are transferred into the Oracle database. Incomplete transactions are updated with alternate values. Aggregation
      based data substitution method is used for data assignment process. | 
  
    | B. Privacy Preservation | 
  
    | Privacy preservation is applied to protect sensitive attributes. Differential privacy technique is applied on sensitive
      attributes. Noise is added with the sensitive attributes. Data transformation process is applied to prepare the data for rule
      mining process. | 
  
    | C. Rule Mining | 
  
    | The rule mining process is performed to filter the frequent patterns. Candidate sets are prepared using attribute
      name and values. Support and confidence values are estimated using item sets. Frequent patterns are identified with
      minimum support and confidence values. | 
  
    | D .Rule Hiding | 
  
    | Rule hiding method is applied to protect the sensitive rules. Rules derived from sensitive attributes are not released
      directly. Rules are embedded with nearest rule intervals. | 
  
    | E. Discrimination Prevention | 
  
    | Discrimination prevention process is designed to protect decisions. Rule generalization and rule prevention
      algorithms are enhanced for dynamic policy model. Direct and indirect discrimination prevention algorithm is also tuned
      for dynamic policy scheme. Discriminations are protected with reference to sensitive and non-sensitive attributes. | 
  
    | VIII. CONCLUSION | 
  
    | Data mining techniques are applied to hidden knowledge from data bases. Discriminatory decisions are obtained
      and prevented with reference to the attributes. Direct and indirect discrimination prevention scheme is used to protect the
      decision rules. The discrimination prevention scheme is enhanced with dynamic policy selection model and differential
      privacy mechanisms. The system increases the data utility rate. Policy selection based discrimination prevention model can
      be applied for all regions. Privacy preserved rate is improved by the system. Rule privacy is optimized with rule
      generalization mechanism. | 
  
    | Tables at a glance | 
  
    | 
 
    |  |  
    | Table 1 |  | 
    |  | 
  
    | Figures at a glance | 
  
    | 
 
    |  |  |  
    | Figure 1 | Figure 2 |  | 
  
    |  |