ISSN ONLINE(2320-9801) PRINT (2320-9798)

All submissions of the EM system will be redirected to Online Manuscript Submission System. Authors are requested to submit articles directly to Online Manuscript Submission System of respective journal.

Detailed Investigation on Strategies Developed for Effective Discovery of Matching Dependencies

R.Santhya1, S.Latha1, Prof.S.Balamurugan1, S.Charanyaa2
  1. Department of IT, Kalaignar Karunanidhi Institute of Technology, Coimbatore, TamilNadu, India
  2. Senior Software Engineer Mainframe Technologies Former, Larsen & Tubro (L&T) Infotech, Chennai, TamilNadu, India
Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering

Abstract

This paper details about various methods prevailing in literature for efficient discovery of matching dependencies. The concept of matching dependencies (MDs) has recently been proposed for specifying matching rules for object identification. Similar to the functional dependencies with conditions, MDs can also be applied to various data quality applications such as detecting the violations of integrity constraints. The problem of discovering similarity constraints for matching dependencies from a given database instance is taken into consideration. This survey would promote a lot of research in the area of information mining.

Keywords

Data Anonymization, Matching Dependencies(MDs), Object, Similarity Constraints, Information Mining.

INTRODUCTION

Need for publishing sensitive data to public has grown extravagantly during recent years. Recent days have seen a steep rise in preserving data quality in the database community due to the huge amount of ?dirty? data originated from different. These data often contain duplicates, inconsistencies and conflicts, due to various mistakes of men and machines. In addition to the cost of dealing with the huge volume of data, manually detecting and removing ?dirty? data is definitely out of practice because human proposed cleaning methods may introduce inconsistencies again. Therefore, data dependencies, which have been widely used in the relational database design to set up the integrity constraints. Hence protecting privacy of individuals and ensuring utility of social network data as well becomes a challenging and interesting research topic.. In this paper we have made an investigation on the attacks by matching dependencies and possible solutions proposed in literature and efficiency of the same.

EFFICIENT DISCOVERY OF SIMILARITY CONSTRAINTS FOR MATCHING DEPENDENCIES

The matching dependencies concept is that to providing the certain matching rules for the object identification. The MDs works like the conditional functional dependencies and it is applied to the various application such as identifying the any violation of integrity constraints. In database community the most popular and important thing is given to data quality because of the large amount of the data ?Dirty? data from the various resources .These type of the dirty data contains the duplication, conflict, inconsistencies because of the error introduced by the human and mechanism and the large amount of the data manually it will detect and remove the dirty data is only the out of the practice because the humans are well proposed the cleaning method it can also introduced the inconsistencies repeatedly. So the data dependencies which is used in relational database design to construct the integrity constraints. This will used to find the inconsistencies in the given data.
The data base instance which is discover the large volume of matching dependencies if we setting the similarity threshold on attribute. The traditional FD is the one of the special cases of the MDs.The setting of the threshold for MDs is not useful for every time.
The MDs with confidence is used to achieve high detection accuracy and if the user wants to recall object identification. Then MDs with high supports are used .So the MDs contains both support and confidences.
Contribution: first introduced the support and confidence measures for the utility of the MDs evaluation .Then create the exact algorithms which is used to find the various similarity threshold on attribute setting .These algorithms are travelled all the data while computation .Then propose the approximate solution for that which uses some of the data.
The traditional data dependencies are used for the schema design. These are again and again revisited for the new application privacy preserving. The conditional functional dependencies is an extension of the FD used for the data cleaning . Making the FDs is the major idea of CFDs .This will valid for a certain tuples by conditions. Canonical cover of all FDs. Due to the Inherent hardness of the discovery problem. A series of strategies are used to increase the efficiency. The problem of discovering similarity threshold are not taken in the discovering FDs.Once determined the attributes X and Y in the dependencies than this will implies the equality constraints on the each attributes.So the existing techniques are not to improve the efficiency of the similarity threshold values.
There are different ways to measures the dependencies. In measures of FDs, it may hold on a relation instances. In that the minimal number of the tuples are removed from relation instances. So the most number of the measures are defined equlity functions for FDs these are not applicable for MDs with similarity metrics.
Matching dependencies: The traditional FDs to identify the dependency relationship using the equity operator=,but it is not possible to identify the matching over the text values in the real world applications.So the MDs is based on the matching quality. The MDs using the matching operator declare the similarity constraints the operator denoted by till symbol such as edit distance ., for text values.
Measures: To evaluate the matching dependencies we can adopt the support and confidence but it is not worth for FDs measures. However the above similarity constraints consider the matching equality of all the tuples t1 and t2 for R. While evaluating the measures of MDs.pre compute the pair wised tuple matching offline and store the results for reuse.
The determination of the similarity threshold for MDs with statistical distribution based problem is very new one and it is different from the FDs discovery .The X and Y in the dependencies is given for FD and it implies the similarity threshold using the MD syntax .Matching similarity threshold setting is different from FDs .Statistical distribution is processed from the various relation, similarity threshold for MDs is properly discovered and the support and confidence are satisfied.
?To discover MD ,matching dependency with minimum support and confidence,what is Y? and what is matching quality requirements ? these two preliminary questions are addressed by the application.?
Purning strategies: The original algorithm must travel all the statistical distribution and candidate threshold pattern in Ct.Given support and confidence identify the relationship between the similarity threshold and avoid checking.
The result of this experiments demonstrates the pruning and approximation techniques can improve the efficiencies of MDs. The statistical distribution will increase the time cost but EPS can reduce the time costs by pruning candidates.
Two different MDs cover many different dependencies which will leads the problem of generating MDs set. The future work is to more exciting applications of MDa are expected discovery of the similarity constraints for novel dependencies like metric inclusion dependencies, conditional inclusion dependencies and multivalued dependencies.

ON GENERATING NEAR-OPTIMAL TABLEAU FOR CONDITIONAL FUNCTIONAL DEPENDENCIES

In this paper the author characterize the better pattern tableau , which is based on the support , confidence and parsimony . In this paper the problem is about generating an optimal tableau for a given FD is NP complete and it can be approximated in polynomial time using the greedy algorithm.
In tableau generation problem , the relation instances is the input problem and FDs should not hold exactly on the given data and the FDs knows to hold the data over some patterns . These patterns are given to the input , in that we will neglect the such patterns and assume to make the tableau should be make empty first . Then we will find the parsimonious set of patterns . This will gives the meaningful tableau .
Greedy algorithm states that , the set of candidate patterns with the ?P‘ elements , local confidence is taken into the account for eliminating the ?P‘ elements and find the patterns which covers all the possible sets . That includes atleast ?N‘ tuples from ?R‘ . This will yield tableau to statisfy the global support threshold . These minimum size tableau will meet the global support and local confidence . This greedy algorithm computes the support and confidence for all the candidate pattern .
Data quality and undocumented semantics are popular issues in the real world . There are many tools introduced to solve this problem such as CFDs . The CFD to capture the semantics of the data and identify the problems. In this paper CFDs is fully realized by defining the tableau pattern , based on the some properties like support, confidence and parsimony and also studying the complexity of the automatic generation of optimal tableau and to providing the approximate algorithm

TEXT JOIN IN AN RDBMS FOR WEB DATA INTEGRATION

In this paper the author describes the problem of web services, data integration challenges . There are lots of global identifiers are presents in the database. So the same entity might represents different textual formats. These same entity refers that matching of strings during data integration
In this paper cosine similarity metric are used for the string matches across web sources. The data integration from heterogeneous web sources is the main important thing for application. Such data is textual string obtained from the web sources. These are all gives the semantics and performance re related challenges . To identify these integration problem, one need to match the multiple textual description.
? Erroneous information
? Abbreviated ,Incomplete or missing information
? Differences in information?formating?.

COMPRESSION - BASED EVALUATION OF PARTIAL DETERMINATIONS

In this paper , the author determining the problem of partial determination and the compression based method are used to evaluate the above problem . This is viewed as generalizations of both FD and association rules . It will extending the measures used for evaluating the support and confidence .
Partial determinations are generalizations of functional dependencies . It can be expressed as X->dY . Where d is the number . The set of x will be referred as LHS , and Y will be referred as RHS . The partial determination is used for both the X->dY and pdx->dY .
The future work of this plan is to extending with other strategies like genetic algorithms and combinations of search algorithm .The new compression – based measures are used to evaluate the partial determination and this is used for the search . This partial determination is a useful form of KDD since it is more expressive .The other measures of the partial determination is MDC based functions . This will avoiding the over fitting the data .

CONCLUSION AND FUTURE WORK

This paper detailed about various methods prevailing in literature for efficient discovery of matching dependencies. The concept of matching dependencies (MDs) has recently been proposed for specifying matching rules for object identification. Similar to the functional dependencies (with conditions), MDs can also be applied to various data quality applications such as detecting the violations of integrity constraints. The problem of discovering similarity constraints for matching dependencies from a given database instance is taken into consideration. This survey would promote a lot of research in the area of information mining.

References