| Keywords | 
        
            | Measurement, Ranking, Relationship, Generalized flow | 
        
            | INTRODUCTION | 
        
            | Wikipedia is habitually a better option for a user to achieve knowledge of a single object than typical       search engines. A user also might long to discover a relationship among two objects. | 
        
            | A relationship is an association between two or more people that may range in duration from brief to       enduring. Word Association is a common word game involving an exchange of words that are associated together. This       association may be based on regular business interactions, or some other type of social commitment. Interpersonal       relationships are formed in the context of social. Interpersonal relationships are dynamic systems that change       continuously during their existence. | 
        
            | A innovative method for measuring a relationship on Wikipedia by reflecting all the three concepts:       distance, connectivity, and co citation. Here relationships are measured rather than the similarities. Assignment of the       gain to each edge is important for measuring a relationship using a generalized maximum flow. It is established through       experiments that the gain function is adequate to measure relationships appropriately. The strength of the relationship       between a source object and each of its destinations objects, and rank the destination objects by the strength. The       connection and production of other words in response to a given word is done spontaneously as a game, creative       technique, or in a psychiatric evaluation. | 
        
            | The link analysis is a data-analysis technique used to evaluate relationships (connections) between       nodes. Relationships may be identified among various types of nodes. Data gathering and processing requires access to       data and has several inherent issues, including information overload and data errors. Once data is collected, it must be       transformed into a format that can be effectively used by both human and computer analysers. The user can recognize       an explicit relationship between two objects easily by reading the pages for the two objects in Wikipedia. By contrast, it is tricky for the user to determine an implicit relationship and elucidatory objects without investigating a number of       pages and links. So, it is an appealing problem to measure and explain the strength of an implicit relationship between       two objects in Wikipedia. | 
        
            | A Wikipedia is a type of content management system, it differs from a blog or most other such       systems in that the content is created without any defined owner or leader, and Wikipedia have little implicit structure,       allowing structure to emerge according to the needs of the users. The encyclopaedia project Wikipedia is the most       famous wiki on the public web, but there are many sites running many different kinds of wiki software. Wiki promotes       meaningful topic associations between different pages by making page link creation almost intuitively easy and       showing whether an intended target page exists or not. Wiki is not a carefully crafted site for casual visitors. Instead, it       seeks to involve the visitor in an ongoing process of creation and collaboration that constantly changes the Web site       landscape. | 
        
            | The nature of mining processes creates a potential negative impact on the environment both during       the mining operations and for years after the mine is closed. This impact has led to most of the world's nations adopting       regulations to moderate the negative effects of mining operations. Safety has long been a concern as well, and modern       practices have improved safety in mines significantly. The data mining task is the automatic or semi-automatic analysis       of large quantities of data to extract previously unknown interesting patterns such as groups of data records (cluster       analysis), unusual records (anomaly detection) and dependencies (association rule mining). This usually involves using       database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and       may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data       mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction       results by a decision support system. Neither the data collection, data preparation, nor the result interpretation and       reporting are part of the data mining step, but do belong to the overall knowledge discovery process as additional steps. | 
        
            | Data mining uses information from past data to analyse the outcome of a particular problem or       situation that may arise. Data mining works to analyse data stored in data warehouses that are used to store that data       that is being analysed. That particular data may come from all parts of business, from the production to the       management. Managers also use data mining to decide upon marketing strategies for their product. They can use data to       compare and contrast among competitors. Data mining interprets its data into real time analysis that can be used to       increase sales, promote new product, or delete product that is not value-added to the company. | 
        
            | Text analysis involves information retrieval, lexical analysis to study word frequency distributions,       pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association       analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis,       via application of natural language processing (NLP) and analytical methods. A typical application is to scan a set of       documents written in a natural language and either model the document set for predictive classification purposes or       populate a database or search index with the information extracted. | 
        
            | In Wikipedia, the knowledge of an object is gathered in a single page updated constantly by a number       of volunteers. Wikipedia also covers objects in a number of categories, such as people, science, geography, politic, and       history. Therefore, searching Wikipedia is usually a better choice for a user to obtain knowledge of a single object than       typical search engines. A user also might desire to discover a relationship between two objects. Classic keyword search       engines can neither measure nor explain the strength of a relationship. The main issue for measuring relationships       arises from the fact that two kinds of relationships exist:“explicit relationships” and “implicit relationships.” | 
        
            | This paper proceeds as follows. Section II describes the related work and the global operation of       system. The different leveraged metrics and components of the system are presented in Section III. Section IV presents       components. Section V explains the conclusion and future works. | 
        
            | RELATED WORK | 
        
            | To rank the relationships between two objects in Wikipedia, measure the strength of the link among two pages       that constitute the relationship. Existing system measures the relationship based on distance, connectivity and co       citation. | 
        
            | Dataset | 
        
            | Dataset in Wikipedia includes objects that have relationships between each other. The existing method implies how       to measure the strength of relationships among pages. | 
        
            | Loading a dataset | 
        
            | A data set is a collection of data, it lists values for each of the variables, such as height and weight of       an object. The query used to generate a particular data set from the selected connection or flat file profile. Multiple data       set definitions can be created for the same profile in order to generate different data set instances. To improve       classification accuracy, insignificant parameters and patient data could be deleted from the data set. | 
        
            | The schema of a Data Set can be defined programmatically, created by the Fill or Fill Schema       methods of a Data Adapter, or loaded from an XML document. To load Data Set schema information from an XML       document, use either the Read Xml Schema or the Infer Xml Schema method of the Data Set can be used. | 
        
            | Read Xml Schema allows one to load or infer Data Set schema information from the document       containing XML Schema definition language (XSD) schema, or an XML document with inline XML Schema. Infer       Xml Schema allows to infer the schema from the XML document while ignoring certain XML namespaces that is       specified. | 
        
            | Data Preprocessing | 
        
            | Data pre-processing is an important step in the data mining process. Data-gathering methods are often       loosely controlled, resulting in out-of-range values. Analyzing data that has not been carefully screened for such       problems can produce misleading results. Thus, the representation and quality of data is first and foremost before       running an analysis. | 
        
            | If there is much irrelevant and redundant information present or noisy and unreliable data, then       knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take       considerable amount of processing time. Data pre-processing includes cleaning, normalization, transformation, feature       extraction and selection, etc. | 
        
            | The product of data pre-processing is the final training set. The lacking of attribute values, lacking       certain attributes of interest, or containing only aggregate data, reduce the volume but producing the same or similar       analytical results. | 
        
            | Generalized Maximum Flow-based Method | 
        
            | The generalized maximum flow problem is identical to the classical maximum flow problem except       that every edge e has a gain γ(e)>0; the value of a flow sent along edge e is multiplied by γ(e)>0. Let f(e)≥ 0 be the       flow f on edge e, and μ(e) ≥ 0 be the capacity of edge e. The capacity constraint f(e)≤ μ(e) must hold for every edge e.       The goal of the problem is to send a flow emanating from the source vertex s into the destination vertex t to the greatest       extent possible, subject to the capacity constraints. Let generalized network G=(V,E,s,t μ, γ) be information network       (V.E) with the source s?V , the destination t?V , the capacity μ, and the       gain γ. Fig. 4 depicts an example of a generalized maximum       flow on a generalized network. | 
        
            | Here proposing a new method for measuring the strength of a relationship using the generalized       maximum flow. The value of flow f is defined as the total amount of f arriving at destination t. To measure the strength       of a relationship from object s to object t, the value of a generalized maximum flow emanating from s as the source into       t as the destination is used; a larger value signifies a stronger relationship. The vertices in the paths composing the       generalized maximum flow as the objects constituting the relationship are regarded. | 
        
            | Distance | 
        
            | In the methods based on distance, a shorter path represents a stronger relationship. For this method,       γ(e)<1 for every edge e; then a flow considerably decreases along a long path. A short path usually contributes to the       generalized maximum flow by a greater amount than a long path does. Therefore, a shorter path means a stronger       relationship in the method. | 
        
            | Connectivity | 
        
            | In methods based on connectivity, a strong relationship is represented by many vertex disjoint paths       from the source to the destination. The number of vertex disjoint paths can be computed by solving a classical       maximum flow problem. The generalized maximum flow problem is a natural extension of the classical maximum flow       problem. Therefore, it also can be used to estimate the connectivity. | 
        
            | Co citation | 
        
            | A flow emanates from the source into the destination, and therefore the flow seldom uses an edge       whose direction is opposite that from the source to the destination. On the other hand, Use of both directions is required       to estimate the co citation of two objects. Consider the relationship between two objects s and t in the network       presented in Fig. 5a. Object u is co cited by s and t. | 
        
            | This co-citation is represented by two edges (s,u) and (t,u). However, unable to send a flow from s to t along the two       edges, unless reverse the direction of the edge (t,u) to (u.t). Therefore, construct a doubled network by adding to every       original edge in G a reversed edge whose direction is opposite to the original one. For example, Fig. 5b depicts the       doubled network for the network presented in Fig. 5a. | 
        
            | METHODS TO MEASURE THE RELATIONSHIP | 
        
            | Gain Function | 
        
            | In order to determine the gain function, first consider what kinds of explicit relationships are       important in constituting an implicit relationship. In a generalized max-flow problem, a path composed of edges with       large gain scan contributes to the value of a flow. To realize such a gain assignment, construct groups of objects in       Wikipedia. | 
        
            | Categories cannot be used as groups directly because the category structure of Wikipedia is too       fractionalized. Mutation can result in several different types of change in sequences, Mutations in genes can either have       no effect, alter the product of a gene, or prevent the gene from functioning properly or completely. | 
        
            | Cycle-Free Effective Conductance (CFEC) | 
        
            | The cycle-free escape probability from s to t is the probability that a random walk originating at s will       reach t without visiting any node more than once. The transition from one state to another does not depend on the       previous state. The transition could be to the same state also. In network, proximity is the infinite number of attempts       that is made to reach from starting node to end node. | 
        
            | CFEC proximity allows to readily compute proximity graphs, which are small portions of the       network that are aimed at capturing a related proximity value. It is extension of connection graph which is capable of       presenting compact relationship between objects of a network. It can deduce relationship between more than two       endpoints, the flexibility to handle edge direction, and the fact that they are obtained by solving an intuitively tunable       optimization problem. | 
        
            | RANKING | 
        
            | A ranking is a relationship between a set of items such that, for any two items, the first is either       'ranked higher than', 'ranked lower than' or 'ranked equal to' the second. By reducing detailed measures to a sequence of       ordinal numbers, rankings make it possible to evaluate complex information according to certain criteria. t is not always possible to assign rankings uniquely. A common shorthand way to distinguish these ranking strategies is by the       ranking numbers. | 
        
            | In competition ranking, items that compare equal receive the same ranking number, and then a gap is       left in the ranking numbers. The number of ranking numbers that are left out in this gap is one less than the number of       items that compared equal. The assignment of distinct ordinal numbers to items that compare equal can be done at       random, or arbitrarily, but it is generally preferable to use a system that is arbitrary but consistent, as this gives stable       results if the ranking is done multiple times. Query-independent methods attempt to measure the estimated importance       of a page, independent of any consideration of how well it matches the specific query. | 
        
            |  | 
        
            | In above example, for the source and the destination objects, select famous person known by the participants creating       the rankings by their subjects. First select 10 famous Japanese and American politicians as source objects from       Japanese Wikipedia, in order to enables the participants to investigate relationships among the persons on Wikipedia       and create appropriate rankings. As the destination objects for each source, select four famous persons related to the       source. Here only four destinations for each source is selected, because preliminarily observed that participants sometimes wavered in their judgments for five or more destinations. For each of the 40 obtained pairs of a source and a       destination, the strength of the relationship from the source to the destination using this method is computed. | 
        
            | CONCLUSION AND FUTURE WORKS | 
        
            | Thus this method can measure the strength of a relationship between two objects on Wikipedia and       rank them. Furthermore, this method does not underestimate objects having high degrees. This paper plans to form a       relation tree, a directed tree with a unique node corresponding to the most recent common ancestor of all the entities at       the leaves of the tree. | 
        
            |  | 
        
            | Figures at a glance | 
        
            | 
                
                    
                        |  |  |  |  
                        | Figure 1 | Figure 4 | Figure 5 |  | 
        
            |  | 
        
            | References | 
        
            | 
                Y.  Koren, S.C. North, and C. Volinsky, “Measuring and Extracting Proximity in  Networks,” Proc. 12th ACM SIGKDD Int’l Conf.Knowledge Discovery and Data  Mining, pp. 245-255, 2006.
 M. Ito, K. Nakayama, T. Hara, and S. Nishio,  “Association Thesaurus Construction Methods Based on Link Co-OccurrenceAnalysis  for Wikipedia,” Proc. 17th ACM Conf. Information and Knowledge Management  (CIKM), pp. 817-826, 2008.
 K. Nakayama, T. Hara, and S. Nishio,  “Wikipedia Mining for an Association Web Thesaurus Construction,” Proc. Eighth  Int’l Conf. Web Information Systems Eng. (WISE), pp. 322-334, 2007.
 J. Gracia and E. Mena, “Web-Based Measure of  Semantic Relatedness,” Proc. Ninth Int’l Conf. Web Information Systems Eng.  (WISE), pp. 136-150, 2008.
 R.K. Ahuja, T.L. Magnanti, and J.B. Orlin,  Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993.
 K.D. Wayne, “Generalized Maximum Flow  Algorithm,” PhD dissertation, Cornell Univ., New York, Jan. 1999.
 R.L. Cilibrasi and P.M.B. Vita´nyi, “The Google  Similarity Distance,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 3, pp.  370-383, Mar. 2007.
 G. Kasneci, F.M. Suchanek, G. Ifrim, M.  Ramanath, and G. Weikum, “Naga: Searching and Ranking Knowledge,” Proc. IEEE  24th Int’l Conf. Data Eng. (ICDE), pp. 953-962, 2008.
 F.M. Suchanek, G. Kasneci, and G. Weikum,  “Yago: A Core of Semantic Knowledge,” Proc. 16th Int’l Conf. World wide Web  Conf. (WWW), pp. 697-706, 2007.
 “The Erdo¨ s Number Project,”  http://www.oakland.edu/enp/,2012.
 L. Katz, “A New Status Index Derived from  Sociometric Analysis,” Psychometrika, vol. 18, no. 1, pp. 39-43, 1953.
 S. Wasserman and K. Faust, Social Network  Analysis: Methods and Application (Structural Analysis in the Social Sciences).  Cambridge Univ. Press, 1994.
 C. Faloutsos, K.S. Mccurley, and A. Tomkins,  “Fast Discovery of Connection Subgraphs,” Proc. 10th ACM SIGKDD Int’l Conf.  Knowledge Discovery and Data Mining, pp. 118-127, 2004.
 P.G. Doyle and J.L. Snell, Random Walks and  Electric Networks, vol. 22. Math. Assoc. Am., 1984.
 M. Nakatani, A. Jatowt, and K. Tanaka,  “Easiest-First Search: Towards Comprehension-Based Web Search,” Proc. 18th ACM  Conf Information and Knowledge Management (CIKM), pp. 2057- 2060, 2009.
 L. Finkelstein, E. Gabrilovich, Y. Matias, E.  Rivlin, Z. Solan, G.Wolfman, and E. Ruppin, The WordSimilarity-353 Test  Collection,2002.
 E. Agirre, E. Alfonseca, K. Hall, J.  Kravalova, M. Pas¸ca, and A. Soroa, “A Study on Similarity and Relatedness  Using Distributional and Wordnet-Based Approaches,” Proc. 10th Human Language  Technologies: Ann. Conf. North Am. Chapter of the Assoc. Computational  Linguistics (NAACL-HLT), pp. 19-27, 2009.
 W. Xi, E.A. Fox, W. Fan, B. Zhang, Z. Chen, J.  Yan, and D. Zhuang, “Simfusion: Measuring Similarity Using Unified Relationship  Matrix,” Proc. 28th Ann. Int’l ACM SIGIR Conf. Research and Development in  Information Retrieval, pp. 130-137, 2005.
 [19]  D. Fogaras and B. Ra´cz, “Practical Algorithms and Lower Bounds for Similarity  Search in Massive Graphs,” IEEE Trans. Knowledge Data Eng., vol. 19, no. 5, pp.  585-598, May 2007.
 “Country Ranks 2009,”  http://www.photius.com/rankings/index.html, 2012.
 |