Now a days in distributed data having huge complexity to data shipment using multiple systems. In centralized or distributed, we cannot reduce the data shipment in short period. Delay has occurred when using constraints in large amount of database. In our proposed model to reduce data shipment cost using Map reducing (query processing, hashing technique) and Data partitioning algorithms.
Keywords |
Map reducing algorithms, Inceremental Algorithm, Horizontal and vertical partition, Distributed Data,
Elastic Compute Cloud(EC2). |
INTRODUCTION |
REAL-LIFE data is often dirty. To clean the data, efficient algorithms for detecting errors have to be in place. Errors in
the data are typically detected as violations of constraints (data quality rules), such as functional dependencies (FDs),
denial constraints, and conditional functional dependencies (CFDs). When the data is in a centralized database, it is
known that two SQL queries suffice to detect its violations of a set of CFDs. It is increasingly common to find data
partitioned vertically or horizontally, and distributed across different sites. This is highlighted by the recent interests in
SaaS and Cloud computing, MapReduce and columnar DBMS. In the distributed settings, however, it is much harder to
detect errors in the data. |
In Existing model, No conditional functional dependencies and no Error detection in distributed data. SQL Queries are
used to detect its violations of CFDs. Applicable only for Centralized Database, Cannot be used for Distributed
Database. Various Problems occurs in Violation Detection and indexing techniques. |
Heuristic algorithm |
These algorithms, usually find a solution close to the best one and they find it fast and easily. Sometimes these
algorithms can be accurate, that is they actually find the best solution, but the algorithm is still called heuristic until this
best solution is proven to be the best. The method used from a heuristic algorithm is one of the known methods, such as
greediness, but in order to be easy and fast the algorithm ignores or even suppresses some of the problem's demands. |
Hashing Algorithm (MD5) |
The MD5 message-digest algorithm is a widely used cryptographic hash function producing a 128-bit (16-byte) hash
value, typically expressed in text format as a 32 digit hexadecimal number. MD5 has been utilized in a wide variety of
cryptographic applications, and is also commonly used to verify data integrity. |
MD5 was designed by Ron Rivest in 1991 to replace an earlier hash function,MD4. The source code in RFC
1321 contains a "by attribution" RSA license. |
In 1996 a flaw was found in the design of MD5. While it was not deemed a fatal weakness at the time, cryptographers
began recommending the use of other algorithms, such as SHA-1—which has since been found to be vulnerable as
well. In 2004 it was shown that MD5 is not collision resistant. As such, MD5 is not suitable for applications
like SSL certificates or digital signatures that rely on this property for digital security. Also in 2004 more serious flaws
were discovered in MD5, making further use of the algorithm for security purposes questionable; specifically, a group of researchers described how to create a pair of files that share the same MD5 checksum. Further advances were made
in breaking MD5 in 2005, 2006, and 2007. In December 2008, a group of researchers used this technique to fake SSL
certificate validity, and CMU Software Engineering Institute now says that MD5 "should be considered
cryptographically broken and unsuitable for further use", and most U.S. government applications now require the SHA-
2 family of hash functions. |
Map Reducing Algorithm |
MapReduce is a software framework that allows developers to write programs that process massive amounts of
unstructured data in parallel across a distributed cluster of processors or stand-alone computers. MapReduce is a
software framework that allows developers to write programs that process massive amounts of unstructured data in
parallel across a distributed cluster of processors or stand-alone computers. It was developed at Google for indexing
Web pages and replaced their original indexing algorithms and heuristics in 2004. |
The framework is divided into two parts: |
• Map, a function that parcels out work to different nodes in the distributed cluster. |
• Reduce, another function that collates the work and resolves the results into a single value. |
• The MapReduce framework is fault-tolerant because each node in the cluster is expected to report back
periodically with completed work and status updates. If a node remains silent for longer than the expected interval, a
master node makes note and re-assigns the work to other nodes. |
Incremental Algorithm |
An incremental algorithm updates the solution to a problem after an incremental change is made on its input. In the
application of an incremental algorithm, the initial run is conducted by an algorithm that performs the desired
computation from scratch and the incremental algorithm is used in the subsequent runs (i) using information from
earlier computations and (ii) to reflect the update on the network while avoiding re-computations as much as possible.
The computation of between’s centrality depends on the number of shortest paths in a network and the intermediate
nodes on these paths. A network update such as an edge insertion or edge cost decrease might result in creation of new
shortest paths in the network. However, a considerable portion of the older paths might remain intact, especially in the
unaffected parts of the network. Therefore, accurate Maintenance of the number of shortest paths and the
Predecessors on the shortest paths will suffice for accurately updating between’s values in the case of dynamic network
updates. This is the key observation we make in the design of our incremental betweenness centrality algorithm. |
Ad Hoc Networks (MANETs) consists of a collection of mobile nodes which are not bounded in any infrastructure.
Nodes in MANET can communicate with each other and can move anywhere without restriction. This non-restricted
mobility and easy deployment characteristics of MANETs make them very popular and highly suitable for
emergencies, natural disaster and military operations. |
Nodes in MANET have limited battery power and these batteries cannot be replaced or recharged in complex scenarios.
To prolong or maximize the network lifetime these batteries should be used efficiently. The energy consumption of
each node varies according to its communication state: transmitting, receiving, listening or sleeping modes.
Researchers and industries both are working on the mechanism to prolong the lifetime of the node’s battery. But
routing algorithms plays an important role in energy efficiency because routing algorithm will decide which node has
to be selected for communication. |
The main purpose of energy efficient algorithm is to maximize the network lifetime. These algorithms are not just
related to maximize the total energy consumption of the route but also to maximize the life time of each node in the
network to increase the network lifetime. Energy efficient algorithms can be based on the two metrics: i) Minimizing
total transmission energy ii) maximizing network lifetime. The first metric focuses on the total transmission energy
used to send the packets from source to destination by selecting the large number of hops criteria. Second metric
focuses on the residual batter energy level of entire network or individual battery energy of a node [1]. |
LITERATURE SURVEY |
In Algorithms for computing provably near-optimal (in terms of the number of messages) local constraints. Experimental
results with real-life network traffic data sets demonstrate that our technique can reduce message communication
overhead by as much as 70% compared to existing data distribution-agnostic approaches. [1]The incremental algorithm over vertical partitions to reduce data shipment. They are verify experimentally, using real-life data on Amazon Elastic
Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts.[2] |
Computing an optimal global evaluation plan is shown to be NP-hard. Finally, we present an implementation of our
algorithms, along with experiments that illustrate their potential not only for the optimization of exploratory queries, but
also for the multi-query optimization of large batches of standard queries. [3] |
Along with experiments that illustrate their potential not only for the optimization of exploratory queries, but also for the
multi-query optimization of large batches of standard queries. [4] |
Addresses the problem of finding efficient complete local tests for an important class of constraints that are very
common in practice: constraints expressible as conjunctive queries with negated sub goals. For constraints where the
predicates for the remote relations do not occur more than once, we present complete local tests under insertions and
deletions to the local relations. These tests can be expressed as safe, no recursive Data log queries against the local
relations. These results also apply to other constraints with negation that are not conjunctive. [5] |
The MapReduce is a programming model and an associated implementation for processing and generating large datasets
that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and
a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale
clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the
network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have
been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce
jobs are executed on Google's clusters every day, processing a total of more than twenty pet bytes of data per day. [6]
In the class of integrity constraints for relational databases, referred to as conditional functional dependencies (CFDs),
and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed
mainly for schema design, CFDs aim at capturing the consistency of data by enforcing bindings of semantically related
values. For static analysis of CFDs we investigate the consistency problem which is determine whether or not, there
exists a nonempty database satisfying a given set of CFDs, and the implication problem, which is to decide whether or
not a set of CFDs entails another CFD. |
We show that while any set of transitional FDs is trivially consistent, the consistency problem is NP-complete for CFDs,
but it is in PTIME when either the database schema is predefined or no attributes involved in the CFDs have a finite
domain. For the implication analysis of CFDs, we provide an inference system analogous to Armstrong's axioms for
FDs, and show that the implication problem is coNP-complete for CFDs in contrast to the linear-time complexity for
their traditional counterpart. We also present an algorithm for computing a minimal cover of a set of CFDs. |
Since CFDs allow data bindings, in some cases CFDs may be physically large, complicating the detection of constraint
violations. We develop techniques for detecting CFD violations in SQL as well as novel techniques for checking
multiple constraints by a single query. We also provide incremental methods for checking CFDs in response to changes
to the database. We experimentally verify the effectiveness of our CFD-based methods for inconsistency detection. This
work not only yields a constraint theory for CFDs but is also a step toward a practical constraint-based method for
improving data quality [7]. |
In top-down join enumeration algorithm that is optimal with respect to the join graph. We present performance results
demonstrating that a combination of optimal enumeration with search strategies such as branch-and-bound yields an
algorithm significantly faster than those previously described in the literature. Although our algorithm enumerates the
search space top-down, it does not rely on transformations and thus retains much of the architecture of traditional
dynamic programming. As such, this work provides a migration path for existing bottom-up optimizers to exploit topdown
search without drastically changing to the transformational paradigm. [8] |
PROPOSED METHOD |
In our proposed system, to reduce data shipment, e.g., counters pointer and tags in base relations. While these could be
incorporated into our solution, they do not yield bounded/optimal incremental detection algorithms. There has also
been a host of work on query processing and multi-query optimization for distributed data. The former typically aims
to generate distributed query plans, to reduce data shipment or response time and Error Deduction. |
A. Description of the Proposed Method: |
The steps involved in our approach are as follows: |
Step 1: Data fragmentation |
In relations D of schema R that are partitioned into fragments, either vertically or horizontally. In some application one
wants to partition D into (D1, . . . ,Dn) horizontal partition and in some case it may be vertical. This process is
mainly due to reduce the communication cost. |
Step 2: CFD Violations Detection |
An algorithm for detecting violations of CFDs for vertical and horizontal partitions. Leveraging the index structures, an
incremental algorithm is used to detect violations in vertical partitions. At first it considers a single update for a single
CFD. then extend the algorithm to multiple CFDs and batch updates. |
Step 3: Partition Optimization |
To reduce data shipment for error detection in vertical partitions. The idea is to identify and maximally share indices
among CFDs such that when multiple CFDs demand the shipment of the same tuples, only a single copy of the data is
shipped. The problem for building optimal indices is NP-complete, but provides an efficient heuristic algorithm. It also
provides an incremental detection algorithm for horizontal partitions. The algorithm is also optimal, as for its vertical
counterpart. A tuple may be large. To reduce its shipping cost, a natural idea is to encode the whole tuple, and then
send the coding of the tuple instead of the tuple. |
In this above figure shows the overall architecture of Distributed data. MD5 (Message-Digest algorithm 5) is a widely
used cryptographic hash function with a 128-bit hash value. We use MD5 in our implementation to further reduce the
communication cost, by sending a 128-bitMD5 code instead of an entire tuple. |
CONCLUSION |
The incremental CFD violation detection for distributed data, from complexity to algorithms. We have shown that the
problem is NP-complete but is bounded. We have also developed optimal incremental violation detection algorithms
for data partitioned vertically or horizontally, as well as optimization methods. Our experimental results have verified
that these yield a promising solution to catching errors in distributed data. There is naturally much more to be done.
First, we are currently experimenting with real-life datasets from different applications, to find out when incremental
detection is most effective. Second, we also intend to extend our algorithms to data that is partitioned both vertically
and horizontally. Third, we plan to develop MapReduce algorithms for incremental violation detection. Fourth, we are
to extend our approach to support constraints defined in terms of similarity predicates (e.g., matching dependencies for
record matching) beyond equality comparison, for which hash-based indices may not work and more robust indexing
techniques need to be explored. Fifth, once again Hadoop Map reducing is used to compress the data in the cloud. |
Figures at a glance |
|
|
Figure 1 |
Figure 2 |
|
References |
- S. Agrawal, S. Deb, K. V. M. Naidu, and R. Rastogi, “Efficient detection of distributed constraint violations,” in Proc. ICDE, Istanbul,Turkey, 2007.
- J. Bailey, G. Dong, M. Mohania, and X. S.Wang, “Incremental view maintenance by base relation tagging in distributed databases,”Distrib. Parall. Databases, vol. 6, no. 3, pp. 287–309, Jul. 1998.
- L. F. Mackert and G. M. Lohman, “R* optimizer validation and performance evaluation for distributed queries,” in Proc. VLDB Kyoto, Japan,1986.
- Kementsietsidis, F. Neven, D. Craen, and S. Vansummeren, “Scalable multi-query optimization for exploratory queries over federated scientificdatabases,” in Proc. VLDB, Auckland, New Zealand, 2008.
- N. Huyn, “Maintaining global integrity constraints in distributed databases,” Constraints, vol. 2, no. ¾, , pp. 377–399, 1997.
- J.Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” in Proc. OSDI, 2004.
- W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional dependencies for capturing data inconsistencies,” ACM Trans.Database Syst., vol. 33, no. 2, Article 6, Jun. 2008.
- D. DeHaan and F. W. Tompa, “Optimal top-down join enumeration,” in Proc. ACM SIGMOD, New York, NY, USA, 2007.
|