Incremental Violation Detection of
Inconsistencies in Distributed Cloud Data

S. Anantha Arulmary; N.S. Usha

Incremental Violation Detection of Inconsistencies in Distributed Cloud Data

S. Anantha Arulmary¹, N.S. Usha²

Student, Department of Computer Science, SIR Issac Newton College of Engineering and Technology, Pappakovil, Nagapattinam, Tamil Nadu, India
Assistant Professor, Department of Computer Science, SIR Issac Newton College of Engineering and Technology, Pappakovil, Nagapattinam, Tamil Nadu, India

Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering

Abstract

Now a days in distributed data having huge complexity to data shipment using multiple systems. In centralized or distributed, we cannot reduce the data shipment in short period. Delay has occurred when using constraints in large amount of database. In our proposed model to reduce data shipment cost using Map reducing (query processing, hashing technique) and Data partitioning algorithms.

Keywords

Map reducing algorithms, Inceremental Algorithm, Horizontal and vertical partition, Distributed Data, Elastic Compute Cloud(EC2).

INTRODUCTION

REAL-LIFE data is often dirty. To clean the data, efficient algorithms for detecting errors have to be in place. Errors in the data are typically detected as violations of constraints (data quality rules), such as functional dependencies (FDs), denial constraints, and conditional functional dependencies (CFDs). When the data is in a centralized database, it is known that two SQL queries suffice to detect its violations of a set of CFDs. It is increasingly common to find data partitioned vertically or horizontally, and distributed across different sites. This is highlighted by the recent interests in SaaS and Cloud computing, MapReduce and columnar DBMS. In the distributed settings, however, it is much harder to detect errors in the data.

In Existing model, No conditional functional dependencies and no Error detection in distributed data. SQL Queries are used to detect its violations of CFDs. Applicable only for Centralized Database, Cannot be used for Distributed Database. Various Problems occurs in Violation Detection and indexing techniques.

Heuristic algorithm

These algorithms, usually find a solution close to the best one and they find it fast and easily. Sometimes these algorithms can be accurate, that is they actually find the best solution, but the algorithm is still called heuristic until this best solution is proven to be the best. The method used from a heuristic algorithm is one of the known methods, such as greediness, but in order to be easy and fast the algorithm ignores or even suppresses some of the problem's demands.

Hashing Algorithm (MD5)

The MD5 message-digest algorithm is a widely used cryptographic hash function producing a 128-bit (16-byte) hash value, typically expressed in text format as a 32 digit hexadecimal number. MD5 has been utilized in a wide variety of cryptographic applications, and is also commonly used to verify data integrity.

MD5 was designed by Ron Rivest in 1991 to replace an earlier hash function,MD4. The source code in RFC 1321 contains a "by attribution" RSA license.

In 1996 a flaw was found in the design of MD5. While it was not deemed a fatal weakness at the time, cryptographers began recommending the use of other algorithms, such as SHA-1—which has since been found to be vulnerable as well. In 2004 it was shown that MD5 is not collision resistant. As such, MD5 is not suitable for applications like SSL certificates or digital signatures that rely on this property for digital security. Also in 2004 more serious flaws were discovered in MD5, making further use of the algorithm for security purposes questionable; specifically, a group of researchers described how to create a pair of files that share the same MD5 checksum. Further advances were made in breaking MD5 in 2005, 2006, and 2007. In December 2008, a group of researchers used this technique to fake SSL certificate validity, and CMU Software Engineering Institute now says that MD5 "should be considered cryptographically broken and unsuitable for further use", and most U.S. government applications now require the SHA- 2 family of hash functions.

Map Reducing Algorithm

MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. It was developed at Google for indexing Web pages and replaced their original indexing algorithms and heuristics in 2004.

The framework is divided into two parts:

• Map, a function that parcels out work to different nodes in the distributed cluster.

• Reduce, another function that collates the work and resolves the results into a single value.

• The MapReduce framework is fault-tolerant because each node in the cluster is expected to report back periodically with completed work and status updates. If a node remains silent for longer than the expected interval, a master node makes note and re-assigns the work to other nodes.

Incremental Algorithm

An incremental algorithm updates the solution to a problem after an incremental change is made on its input. In the application of an incremental algorithm, the initial run is conducted by an algorithm that performs the desired computation from scratch and the incremental algorithm is used in the subsequent runs (i) using information from earlier computations and (ii) to reflect the update on the network while avoiding re-computations as much as possible. The computation of between’s centrality depends on the number of shortest paths in a network and the intermediate nodes on these paths. A network update such as an edge insertion or edge cost decrease might result in creation of new shortest paths in the network. However, a considerable portion of the older paths might remain intact, especially in the unaffected parts of the network. Therefore, accurate Maintenance of the number of shortest paths and the Predecessors on the shortest paths will suffice for accurately updating between’s values in the case of dynamic network updates. This is the key observation we make in the design of our incremental betweenness centrality algorithm.

Ad Hoc Networks (MANETs) consists of a collection of mobile nodes which are not bounded in any infrastructure. Nodes in MANET can communicate with each other and can move anywhere without restriction. This non-restricted mobility and easy deployment characteristics of MANETs make them very popular and highly suitable for emergencies, natural disaster and military operations.

Nodes in MANET have limited battery power and these batteries cannot be replaced or recharged in complex scenarios. To prolong or maximize the network lifetime these batteries should be used efficiently. The energy consumption of each node varies according to its communication state: transmitting, receiving, listening or sleeping modes. Researchers and industries both are working on the mechanism to prolong the lifetime of the node’s battery. But routing algorithms plays an important role in energy efficiency because routing algorithm will decide which node has to be selected for communication.

The main purpose of energy efficient algorithm is to maximize the network lifetime. These algorithms are not just related to maximize the total energy consumption of the route but also to maximize the life time of each node in the network to increase the network lifetime. Energy efficient algorithms can be based on the two metrics: i) Minimizing total transmission energy ii) maximizing network lifetime. The first metric focuses on the total transmission energy used to send the packets from source to destination by selecting the large number of hops criteria. Second metric focuses on the residual batter energy level of entire network or individual battery energy of a node [1].

LITERATURE SURVEY

In Algorithms for computing provably near-optimal (in terms of the number of messages) local constraints. Experimental results with real-life network traffic data sets demonstrate that our technique can reduce message communication overhead by as much as 70% compared to existing data distribution-agnostic approaches. [1]The incremental algorithm over vertical partitions to reduce data shipment. They are verify experimentally, using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts.[2]

Computing an optimal global evaluation plan is shown to be NP-hard. Finally, we present an implementation of our algorithms, along with experiments that illustrate their potential not only for the optimization of exploratory queries, but also for the multi-query optimization of large batches of standard queries. [3]

Along with experiments that illustrate their potential not only for the optimization of exploratory queries, but also for the multi-query optimization of large batches of standard queries. [4]

Addresses the problem of finding efficient complete local tests for an important class of constraints that are very common in practice: constraints expressible as conjunctive queries with negated sub goals. For constraints where the predicates for the remote relations do not occur more than once, we present complete local tests under insertions and deletions to the local relations. These tests can be expressed as safe, no recursive Data log queries against the local relations. These results also apply to other constraints with negation that are not conjunctive. [5]

The MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty pet bytes of data per day. [6] In the class of integrity constraints for relational databases, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by enforcing bindings of semantically related values. For static analysis of CFDs we investigate the consistency problem which is determine whether or not, there exists a nonempty database satisfying a given set of CFDs, and the implication problem, which is to decide whether or not a set of CFDs entails another CFD.

We show that while any set of transitional FDs is trivially consistent, the consistency problem is NP-complete for CFDs, but it is in PTIME when either the database schema is predefined or no attributes involved in the CFDs have a finite domain. For the implication analysis of CFDs, we provide an inference system analogous to Armstrong's axioms for FDs, and show that the implication problem is coNP-complete for CFDs in contrast to the linear-time complexity for their traditional counterpart. We also present an algorithm for computing a minimal cover of a set of CFDs.

Since CFDs allow data bindings, in some cases CFDs may be physically large, complicating the detection of constraint violations. We develop techniques for detecting CFD violations in SQL as well as novel techniques for checking multiple constraints by a single query. We also provide incremental methods for checking CFDs in response to changes to the database. We experimentally verify the effectiveness of our CFD-based methods for inconsistency detection. This work not only yields a constraint theory for CFDs but is also a step toward a practical constraint-based method for improving data quality [7].

In top-down join enumeration algorithm that is optimal with respect to the join graph. We present performance results demonstrating that a combination of optimal enumeration with search strategies such as branch-and-bound yields an algorithm significantly faster than those previously described in the literature. Although our algorithm enumerates the search space top-down, it does not rely on transformations and thus retains much of the architecture of traditional dynamic programming. As such, this work provides a migration path for existing bottom-up optimizers to exploit topdown search without drastically changing to the transformational paradigm. [8]

PROPOSED METHOD

In our proposed system, to reduce data shipment, e.g., counters pointer and tags in base relations. While these could be incorporated into our solution, they do not yield bounded/optimal incremental detection algorithms. There has also been a host of work on query processing and multi-query optimization for distributed data. The former typically aims to generate distributed query plans, to reduce data shipment or response time and Error Deduction.

A. Description of the Proposed Method:

The steps involved in our approach are as follows:

Step 1: Data fragmentation

In relations D of schema R that are partitioned into fragments, either vertically or horizontally. In some application one wants to partition D into (D1, . . . ,Dn) horizontal partition and in some case it may be vertical. This process is mainly due to reduce the communication cost.

Step 2: CFD Violations Detection

An algorithm for detecting violations of CFDs for vertical and horizontal partitions. Leveraging the index structures, an incremental algorithm is used to detect violations in vertical partitions. At first it considers a single update for a single CFD. then extend the algorithm to multiple CFDs and batch updates.

Step 3: Partition Optimization

To reduce data shipment for error detection in vertical partitions. The idea is to identify and maximally share indices among CFDs such that when multiple CFDs demand the shipment of the same tuples, only a single copy of the data is shipped. The problem for building optimal indices is NP-complete, but provides an efficient heuristic algorithm. It also provides an incremental detection algorithm for horizontal partitions. The algorithm is also optimal, as for its vertical counterpart. A tuple may be large. To reduce its shipping cost, a natural idea is to encode the whole tuple, and then send the coding of the tuple instead of the tuple.

In this above figure shows the overall architecture of Distributed data. MD5 (Message-Digest algorithm 5) is a widely used cryptographic hash function with a 128-bit hash value. We use MD5 in our implementation to further reduce the communication cost, by sending a 128-bitMD5 code instead of an entire tuple.

CONCLUSION

The incremental CFD violation detection for distributed data, from complexity to algorithms. We have shown that the problem is NP-complete but is bounded. We have also developed optimal incremental violation detection algorithms for data partitioned vertically or horizontally, as well as optimization methods. Our experimental results have verified that these yield a promising solution to catching errors in distributed data. There is naturally much more to be done. First, we are currently experimenting with real-life datasets from different applications, to find out when incremental detection is most effective. Second, we also intend to extend our algorithms to data that is partitioned both vertically and horizontally. Third, we plan to develop MapReduce algorithms for incremental violation detection. Fourth, we are to extend our approach to support constraints defined in terms of similarity predicates (e.g., matching dependencies for record matching) beyond equality comparison, for which hash-based indices may not work and more robust indexing techniques need to be explored. Fifth, once again Hadoop Map reducing is used to compress the data in the cloud.

Figures at a glance


Figure 1	Figure 2

References

S. Agrawal, S. Deb, K. V. M. Naidu, and R. Rastogi, “Efficient detection of distributed constraint violations,” in Proc. ICDE, Istanbul,Turkey, 2007.

J. Bailey, G. Dong, M. Mohania, and X. S.Wang, “Incremental view maintenance by base relation tagging in distributed databases,”Distrib. Parall. Databases, vol. 6, no. 3, pp. 287–309, Jul. 1998.

L. F. Mackert and G. M. Lohman, “R* optimizer validation and performance evaluation for distributed queries,” in Proc. VLDB Kyoto, Japan,1986.

Kementsietsidis, F. Neven, D. Craen, and S. Vansummeren, “Scalable multi-query optimization for exploratory queries over federated scientificdatabases,” in Proc. VLDB, Auckland, New Zealand, 2008.

N. Huyn, “Maintaining global integrity constraints in distributed databases,” Constraints, vol. 2, no. ¾, , pp. 377–399, 1997.

J.Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” in Proc. OSDI, 2004.

W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional dependencies for capturing data inconsistencies,” ACM Trans.Database Syst., vol. 33, no. 2, Article 6, Jun. 2008.

D. DeHaan and F. W. Tompa, “Optimal top-down join enumeration,” in Proc. ACM SIGMOD, New York, NY, USA, 2007.