A Genetic Programming Approach for Record
Deduplication

L.Chitra Devi; S.M.Hansa; Dr.G.N.K.Suresh Babu

A Genetic Programming Approach for Record Deduplication

L.Chitra Devi¹, S.M.Hansa¹, Dr.G.N.K.Suresh Babu²

Research Scholars, Dept. of Computer Applications, GKM College of Engineering and Technology, Chennai, Tamil Nadu, India
Professor and Head, Dept. of Computer Applications, GKM College of Engineering and Technology, Chennai, Tamil Nadu, India

Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering

Abstract

In this article we are going to discuss about how genetic programming can be used for record deduplication. Several systems that rely on the integrity of the data in order to offer high quality services, such as digital libraries and ecommerce brokers, may be affected by the existence of duplicates, quasi-replicas, or near-duplicates entries in their repositories. Because of that, there has been a huge effort from private and government organizations in developing effective methods for removing replicas from large data repositories. This is due to the fact that cleaned, replica-free repositories not only allow the retrieval of higher-quality information but also lead to a more concise data representation and to potential savings in computational time and resources to process this data. In this work, we extend the results of a GP-based approach we proposed to record deduplication by performing a comprehensive set of experiments regarding its parameterization setup. Our experiments show that some parameter choices can improve the results to up 30%. Thus, the obtained results can be used as guidelines to suggest the most effective way to set up the parameters of our GP-based approach to record deduplication.

Keywords

Genetic Programming, DBMS, Duplication, Optimisation

INTRODUCTION

Normally, organizations become conscious of practical precise disparities or inconsistencies while integrating data from diverse sources to implement a data warehouse. Such problems belong to the category called data heterogeneity. With the increase in size of the database the problem intensifies taking into account the huge amount of computational resource required for examination and removal of duplicate records. Duplicates can occur out of numerous scenarios, for instance when a large database is updated by an external source and registry numbers are not accessible or are in error. File systems often contain superfluous copies of information: identical files or sub-file regions, perhaps stored on a single host, on a shared storage cluster, or backed-up to secondary storage. Deduplicating storage systems take advantage of this redundancy to decrease the essential space needed to contain the file systems. Deduplication can work at either the sub-file or wholefile level. Data deduplication policies can be classified according to the basic data units they handle. In this context, mainly two main data deduplication strategies can be defined: File-level deduplication, in which only a single copy of each file is stored. Two or more files are known as identical if they have the same hash value. This is a very popular service imbibed in multiple products. Block-level deduplication, which breaks files into blocks and stores only a single copy of each block. The system could either use fixed-sized blocks or variable-sized chunks. The architecture of the deduplication solution is modeled by two basic approaches. In the target-based approach deduplication is managed by the target data storage perhipherals or service, while the client is ignorant of any deduplication that might occur. Source based deduplication is performed on the data at the client side only before it is transferred. Particularly, there is communication between client software and the backup server to check for the presence of files or blocks. Two well-known source de-duplication methods, source local chunk-level deduplication and source global chunk level de-duplication have been proposed in the past to address the above mentioned problem by erasing the redundant data chunks before transferring them to the remote backup destination. The studies expose that, due to the out-of-memory fingerprint accesses to massive backed-up data, chunk-level de-duplication has an inherent latency and suffers throughput problem that affects the backup performance. In source global chunk-level de-duplication, this overhead of massive disk accesses will regulates the deduplication process and which will result in increase of backup window. While in source local chunk-level de-duplication, the overhead is reduced by searching the duplicate chunks at the same client. This alleviated overhead, however, limits the compression ratio, which results in increases of backup window due to the increased data transmission cost. Therefore, there is immediate need to achieve a balance between de-duplication efficiency and deduplication overhead for the maintenance of a shorter backup window than existing solutions. There are several other methods that have been proposed for the deduplication purpose which are having efficiency and accuracy. The methods are deduplication using genetic algorithm, semantic methods, cloud services. Above mentioned problems have been solved by the deduplication methods which have been modeled using GA. This research has been done to find the optimization techniques that are having some performance superiority over these existing methods.

RELATED WORK

Record deduplication is a research topic that has attracted a lot of attention in the database field and related areas. As already mentioned, the existence of replicas in data repositories can lead to inconsistencies that may severely affect several types of service, causing damage for companies and governmental institutions. In an attempt to solve these inconsistencies, some works have proposed the creation of similarity functions capable of combining information contained in the data repositories in order to identify when a pair of records constitutes or not a replica. Elmagarmid et al. [2007] classify these approaches into two categories, exemplified next: Ad-Hoc. These are methods that usually depend on some knowledge of a particular domain or on specific distance metrics (for example, for strings characters) Training-based. These are methods that depend on some type of training, supervised or semi supervised, for the identification of replicas, such as probabilistic and machine learning approaches. In the following, we briefly describe some relevant work that illustrates these two categories of methods.

A.Ad-hoc Methods

Chaudhuri et al. [2003] have proposed a linkage algorithm that receives a record from an archive or data repository as input and searches for another record in a reference archive that "matches" with the former, according to some predefined similarity function. The matched records are selected according to some user defined minimum similarity threshold, allowing that more than one candidate record be returned as a possible answer. In this case, the user is responsible for choosing the duplicated record most similar to the original one.

An information retrieval style method is used in WHIRL [Cohen 2000], a database management system that supports similarity joins between relations that include textual attributes. In this system, textual attributes are represented according to the vector space model [Salton 1989] and term weights are calculated using the well-known TF-IDF scheme in order to determine when tuples from two relations should be joined based on the similarity of their attribute values. Carvalho and da Silva [2003] also use the vector space model for computing similarity among fields from different data sources and evaluate four distinct strategies to assign weights and combine the similarity scores of each field. As a result of their experiments, they found that using evidence extracted from individual attributes improves the result of the replica identification task.

B. Training-based Methods

Since these are more related to the work presented in this article, we cover training-based methods in more details in this section. Newcombe et al. [1959] were the first ones to address the record deduplication problem as a Bayesian inference problem and proposed the first method to automatically handle replicas. However, their method was considered empirical [Elmagarmid et al. 2007] since it lacks a deeper statistical ground. After Newcombe et al.‟s work, Fellegi and Sunter [1969] have proposed an elaborated probabilistic method to deal with the problem of evidence combination. Their proposed method requires the definition of two thresholds for replica identification. If the similarity value between two records is above the positive threshold, these records are considered as replicas; if it is below the negative threshold, the records are considered non-replicas; and if the similarity value is between these two thresholds, the records are classified as "possible replicas", requiring a manual classification by a specialist. The Fellegi and Sunter‟s method does not require explicit training examples but, on the other hand, it does require that the data administrator explicitly defines the two aforementioned thresholds, a non trivial task since this definition is repository-dependent. This method has dominated the field for more than two decades until new deduplication methods were developed by the statistical and machine learning communities. Febrl2 [Christen 2008] is one of the best known software packages that implement this method. On the other hand, pure machine learning based methods need some training examples, which should present characteristics similar to those of the duplicated data, so that they can generalize their solutions (deduplication functions) for the entire repository. The main problem with this type of method is the cost of creating such training data, which in some cases may be infeasible. Bilenko and colleagues [Bilenko et al. 2003; Bilenko and Mooney 2003] exploit a machine learning technique to improve both the similarity functions that are used to compare record fields and the way pieces of evidence are combined to identify replicas. In their work, extracted evidence is encoded as feature vectors, which are then used to train an SVM (Support Vector Machines [Joachims 2002]) classifier to better combine them for the replica identification task that is performed by a system called MARLIN (Multiply Adaptive Record Linkage with INduction). This system also uses several blocking strategies to improve its effectiveness when clustering similar records. Active Atlas [Tejada et al. 2001] is an object identification system that aims at learning mapping rules for identifying similar objects (records) from distinct data sources. The process involves two steps. First, a candidate mapping generator proposes a set of possible mappings between the two sets of objects by comparing their attribute values and computing similarity scores for the proposed mappings. Then, a mapping rule learner determines which of the proposed mappings are correct by learning the appropriate mapping rules for that specific application domain. This learning step is exectued by using a decison tree. What mainly differs the method behind this system from other machine learning based methods is that it tries to reduce the training effort by relying on user-provided information for selecting the most relevant training examples. In [de Carvalho et al. 2006], the authors follow a genetic programming approach to improve the results of the Fellegi and Sunter‟s method. They apply this machine learning technique to balance the weight vectors produced by the Fellegi and Sunter‟s method in order to generate a better evidence combination than the simple linear summation used by that probabilistic model. Following their previous work, de Carvalho et al. [2008] have proposed a new GP-based method that finds the best evidence combination for generating a deduplication function within a generic framework that is independent of any other technique. Since record deduplication is a very costly task, even for small repositories, the proposed method tries to find the best evidence combination that also maximizes performance, using for this only a subset of the repository as training data.

C.GP based Record DeDuplication Method

In this section we provide an overview of the GP-based record deduplication method used in this work and first introduced in [de Carvalho et al. 2008]. We start by presenting a brief introduction to the genetic programming technique followed by an explanation of how the record deduplication problem was modeled using this technique. We conclude with a more detailed description of the main steps of the method.

GENETIC PROGRAMMING OVERVIEW

Genetic Programming (GP) is one of the best known and mostly used techniques of evolutionary computing and can be seen as an adaptive heuristics whose basic ideas are inspired in the natural selection process. It is a direct evolution of programs or algorithms used for inductive (supervised) learning, initially applied to optimization problems. One of the main characteristics of evolutionary techniques is their capability to deal with problems with multiple objectives, normally modeled as environmental constraints during the evolutionary process. These techniques are also known for their capability of searching for solutions in large, possibly unbounded, search spaces in which the optimal solution may be unknown, thus producing solutions close to the optimum. What mainly differentiates GP from other evolutionary techniques (such as genetic algorithms and evolutionary systems) is the representation of concepts and the solution as a computer program, being the data manipulated only by these programs. Computer programs have the flexibility needed to deal with a multitude of problems. Moreover, the program structures being evolved do not have size limitations and can dynamically vary during the process according to the requirements of the problem. The most used GP representation for solutions of a specific problem are trees and graphs. Besides choosing a representation for the solution of the problem, it is also necessary to define a set of terminals and functions to perform the task of solving the problem. Terminals are inputs, constants, or nodes with no input, which are the leaves of the trees, while the set of functions comprises operators, declarations, and basic or user-defined functions used to manipulate the values of the terminals [Koza 1992]. The leaf nodes are located at the end of the branch while functions are located in the internal nodes. The search space is the space built from all possible programs that can be constructed with the functions and terminals defined for the problem domain. A GP algorithm evolves a population of tree-represented individuals, i.e., a set of possible solutions for the problem. In each generation of the evolutionary process, individuals are evaluated according to a specific quality metric that measures how well they solve the problem, calculated by a fitness function. The fitness value is used as a criterion to select the best individuals, which will transmit their characteristics to the future generations through operations like crossover and mutation. In this work, we use the ranking method [Koza 1992] as a selection method. This method chooses the k best individuals based on the fitness values to be included in the next generation and to participate in the genetic operations. While the number of new individuals is smaller than the desired population size n (a parameter to be set), two individuals are picked among those selected as described above and have their "genetic material" exchanged in the crossover operation with a certain probability (another parameter to be set) to generate a new individual that combines the genetic material of their "parents". More specifically, the crossover operation is implemented by randomly choosing one node of each of the two trees (the two individuals) and exchanging the subtrees below them. The role of the crossover operation is to combine good solutions towards the most promising solutions in the search space. Finally, also with a certain probability (yet another parameter), the mutation operation is applied, to introduce new individuals (i.e., solutions) in the population, increasing its diversity. This is useful, for example, to avoid that the whole evolutionary process gets trapped in local minima during the search process for a good solution. In this work, the mutation of an individual (i.e., a tree) is performed by first randomly selecting one of its nodes. We then replace the node (and its corresponding subtree) by a new randomly generated subtree, without exceeding a maximum tree depth d. The whole process is repeated until a target fitness value f or a maximum number of generations g is reached. In the end of the process, the best individual, i.e., the one with the best fitness value, which usually belongs to the last generation in the evolutionary process, is chosen as the final solution for the problem in hand.

GP APPLIED TO THE RECORD DEDUPLICATION PROBLEM

In this work, as in [de Carvalho et al. 2008], we use a representation based on trees for the individuals of the GP process which, in our case, represent possible record deduplication functions. More specifically, to perform the record deduplication, we use evidence combination functions in which each evidence E is a pair <attribute, similarity function> that represents the application of a specific similarity function on the values of a given attribute of the data repository. For example, to deduplicate a table from a relational database with attributes name, surname, age and address, using the Jaro- Winkler (JW) [Winkler 1999] similarity function, we would have the following pieces of evidence: E1<name, JW>, E2<surname, JW>, E3<age, JW> and E4<address, JW>.

Experiments performed in [de Carvalho et al. 2008] show that the GP-based record deduplication method is able to adapt the suggested deduplication functions according to changes in the replica identification thresholds necessary to classify pairs of records. Therefore, the user does not need to worry about setting up the "best" values for this threshold according to the data repository, given that the suggested deduplication functions can adapt automatically, maintaining the effectiveness of the solutions, despite changes in this threshold.

CONCLUSIONS AND FUTURE ENHANCEMENTS

The duplicate information is stored from different sources, which require more spaces to store replicas. The bitter experience of the several events proved that data loss is common to a modern enterprise. Different types of approaches are involved to solve those problems. SQL based data deduplication, Chunk based clustering architecture and Genetic approach models all are very tedious in their respective field of implementation because of their huge dependence on file organization or file system. Moreover all of these have indecent time complexity which is a matter of great concern. But in the case of this new proposed model as huge as be the relational database with huge number of tables (which is the most anticipated consequence) this proposed model will perform more optimistically than a tiny database. The domain identifier, key identifier, row id, column id are the key role players in this model. Apart from that the binary search and the table index position are the main optimisation factor for this model. Hence the length of a data of a particular field of a particular row and column are not the matter of concern regarding space wastage, rather it proves our model more practically, in that case. Since the entire algorithm is basically based on the row column approach that is why this can only be implemented in RDBMS initially, though our goal will be to renovate this concept in a broader sense so that it can cover up the all the branches of DBMS. Identifying and handling replicas is important to guarantee the quality of the information made available by modern data storage services. Services that rely on the integrity of the data in order to offer high quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi-replicas, or near-duplicates entries in their repositories. Thus, for this reason, there has been a huge effort in developing effective and efficient methods for removing replicas from large data repositories. This article is relevant since deduplication is a time and resource consuming task. Because of that, any gain (even very small ones) may represent a relevant saving in time and computational resources.

References

Baeza-Yates, R. A. and Ribeiro-Neto, B. A. (1999). Modern Information Retrieval.ACM Press/Addison-Wesley.

Banzhaf, W., Nordin, P., Keller, R. E., and Francone, F. D. (1998). Genetic Programming - An Introduction: on the Automatic Evolution of Computer Programs and Its Applications.Morgan Kaufmann Publishers.

Bell, R. and Dravis, F. (2006). Is you data dirty? and does that matter? Accenture Whiter Paper - http://www.accenture.com.

A.K.Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, (2007) “Duplicate record detection: A survey”, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp 1-16.

H. Biggar, (2007) “Experiencing data de-duplication: Improving efficiency and reducing capacity requirements” White Paper, the Enterprise Strategy Group.

Bilenko, M. and Mooney, R. J. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, USA, pp. 39–48, 2003.

Carvalho, J. C. P. and da Silva, A. S. Finding similar identities among objects from multiple web sources. In Proceedings of the ACM International Workshop on Web Information and Data Management. New Orleans, USA, pp. 90–93, 2003.

Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the ACM SIGMOD International Conference on Management of Data. San Diego, USA, pp. 313–324, 2003.

Geer, D. Reducing the storage burden via data deduplication. IEEE Computer 41 (12): 15–17, 2008.

Gu, L. and Baxter, R. Decision models for record linkage. Selected Papers from Australasian Data Mining Conference vol. 3755, pp. 146–160, 2006.

Joachims, T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers, Norwell, USA, 2002.

Vrable, M., S. Savage and G.M. Voelker, 2009. Cumulus: Filesystem backup to the cloud. ACM Trans. Storage. DOI: 10.1145/1629080.1629084

Winkler, W.E., 2001 Record linkage software and methods for merging administrative lists. The Pennsylvania State University.

Zhu, B., K. Li and H. Patterson, 2008. Avoiding the disk bottleneck in the data domain deduplication file system.Proceedings of the 6th USENIX Conference on File and Storage Technologies, (FAST „08), USENIX Association Berkeley, USA.