Comparisons of different things plays crucial role in human decision making process. Even though, decision making is common in our daily life but requires proper knowledge and skills to know what should get compare and what will be alternatives so as to make good decision. In this paper, we review the background and stateof- the-art of comparable entity data mining based on comparative questions.We first introduce the general background of entity mining from comparative questions and review related phases, such as information extraction and comparator ranking. With each phase, we provide a related background, discuss the technical challenges, and review current research on the techniques used in that phase. This survey is concluded with a discussion of latest experimental results from research articles.
                
  
    | Keywords | 
  
  
    | comparable entity mining; comparators; inductive extraction pattern; bootstrapping algorithm | 
  
  
    | INTRODUCTION | 
  
  
    | It is human nature to try to compare things. Laptops, mobiles, iPhones, cars etc. are compared on number of features.
      In decision-making process, comparing alternative options is one of the necessary steps that we carry out in day-to-day
      life. However it requires skillful and high knowledge expertise person. In today’s era everyone is using World Wide
      Web (WWW) and it is obvious to compare things online. For e.g. for online laptop shopping user must have detailed
      knowledge of its specifications such as processor, memory, storage, graphics, display, etc. In such case, it becomes
      difficult for a person with insufficient knowledge to make a good decision to finalize best laptop according to his/her
      need and also making comparison of alternative options available in market. Comparative question and its comparators
      are two main components of decision making process. | 
  
  
    | Comparative questions: A question with purpose of comparing two or more entities which are explicitly mentioned
      in the question archived by online users. | 
  
  
    | Comparator: Target entities in a comparative question which are to be compared are comparative entities or called
      as comparators. | 
  
  
    | In the following example Q1 & Q2 are not comparative questions whereas Q3 is comparative question in which
      “BMW” and “Skoda” are comparators. | 
  
  
    | Q1. “Which one is better?” | 
  
  
    | Q2. “Is BMW the best car?” | 
  
  
    | Q3. “Which car is better carBMW or Skoda?” | 
  
  
    | The outcomes of these comparative questions will be very useful in helping user’s exploration i.e. recommending
      variousalternatives choices by suggesting comparable entities on the basis of other previous online user’s requests. | 
  
  
    | The procedure of discovering related items for an entity is similar to recommender system, which recommends items
      for users. Recommender systems mainly rely on similarities between items and/or their statistical correlations in user
      log data. In literature we can found many research articles focusing on comparator mining [1], [2], [3],[4]. In our paper,
      we tried to make inside in to their proposed techniques with their pros and cons. | 
  
  
    | The rest of paper is organized as follows. In Section II a short literature survey is given, Section III gives a very brief
      review of information extraction. Latest experimental results are presented in section IV with conclusion in section V. | 
  
  
    | RELATED WORK | 
  
  
    | The work on comparator mining is related tothe research on entity and relation extraction ininformation extraction.
      Specifically, the most relevant work is by Jindal and Liu [1], [15]on mining comparative sentences and relations. Their methods applied class sequential rules (CSR) and label sequential rules (LSR) learned from annotated corpora to
      identify comparative sentences and extract comparative relations respectively in the news and review
      domains.Bootstrapping methods have been shown to be very effective in previous information extraction research [6],
      [8]. Bootstrapping technique is used to extract entities with a specific relation. | 
  
  
    | INFORMATION EXTRACTION | 
  
  
    | IInformation Extraction (IE) deals with locating specific pieces of data in natural-language documents, thereby
      mining structured and meaningful information from unstructured and/or a semi-structured one is called as Information
      Extraction [5]. One type of IE, named entity recognition, involves identifying references to particular kinds of objects
      such as names of people, companies, and locations. There are mainly three methods used for information extraction as
      [6], [7], [8] given below, | 
  
  
    | 1. Rule based Extraction: | 
  
  
    | One approach of IE is to automatically learn pattern-based extraction rules for identifying each type of entity or
      relation. For example, the system developed by Rapier in [9]. Patterns are expressed in an enhanced regular-expression
      language; and a bottom-up relational rule learner is used to induce rules from a corpus of labeled training examples.
      Inductive Logic Programming (ILP) [10] has also been used to learn logical rules for identifying phrases to be extracted
      from a document [11], [12]. | 
  
  
    | 2. Pattern based extraction: | 
  
  
    | Pattern based approaches build on annotated text fragments (the patterns), where words/phrases are labeled with
      linguistic information, e.g. POS-tag, word lemma, or syntactic information. Those patterns are matched against
      linguistically annotated text to detect relationships [13]. | 
  
  
    | 3. Supervised Learning: | 
  
  
    | Supervised learning is the machine learning task of inferring a function from labeled training data. The training data
      consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object
      (typically a vector) and a desired output value (also called the supervisory signal). However, supervised training of
      accurate entity and relation extractors is costly, requiring a substantial number of labeled training examples for each
      type of entity and relation to be extracted. Because of this, many researchers have explored semi-supervised learning
      methods that use only a small number of labeled examples of the predicate to be extracted, along with a large volume
      of unlabeled text [14]. All the above information extraction methods can be used for comparator methods as in [2], [3],
      [4] by Li Shasha, Jindal and Liu in [1],[15]. | 
  
  
    | A. Design Considerations: | 
  
  
    | Supervised comparative mining method was proposed by Jindal and Liu [1], [15] which is a baseline for
      comparison. It focuses mainly on two rules mentioned as Class Sequential Rule (CSR) & Label Sequential Rule (LSR)
      as descried below. | 
  
  
    | a. Class Sequential Rule (CSR): | 
  
  
    | It is a classification rule which maps a sequence pattern S (s1, s2 . . . sn) (a class C. C is either comparative or
      noncompetitive). Every CSR is associated with two parameter support and confidence. | 
  
  
    | b. Label Sequential Rule (LSR): | 
  
  
    | It maps an input sequence pattern S (s1, s2 . . . si . . . sn) to a labeled sequence S (s1, s2 . . . li . . . sn) by replacing token
      si in the input sequence with a designated label (li) and this token is referred as the anchor. | 
  
  
    | Jindal and Liu [1] method have been proved effective in their experimental setups. However, it has the some
      drawbacks as given below, | 
  
  
    |  The performance of Jindal and Liu’s method depends mainly on a set of comparative sentence indicative
      keywords [3]. | 
  
  
    |  Users can express comparative sentences or questions in many different ways. To have high recall, a large
      annotated training corpus is necessary. This is an expensive process | 
  
  
    |  CSRs and LSRs introduced by Jindal and Liu in [15] mostly a combination of POS tags and keywords. It is a
      surprise that their rules achieved high precision but low recall. | 
  
  
    | B. A Weakly Supervised Method for Comparator Mining: | 
  
  
    | To resolve the conflict in extracting comparative questions and its comparator with high precision as well as with
      high recall a Weakly Supervised Bootstrapping method is introduced by Li Shasha in [2]. | 
  
  
    | 1. Indicative Extraction Patterns Mining: | 
  
  
    | Indicative Extraction Pattern (IEP) is a sequential pattern which can be used for identification of comparative
      questions along with comparator extraction with high reliability. A question is classified as a comparative question if it
      matches an IEP and the token sequences corresponding to the comparator slots in the IEP are extracted as comparators.
      If a question matchesmultiple IEPs, the longest IEP is used. Therefore, instead of manually creating a list of indicative
      keywords, we create a set of IEPs automatically, referred as weakly supervised method which is iterative as shown in
      Fig. 1. The two key steps in this algorithm are pattern generation and pattern evaluation. | 
  
  
    | 2. Pattern Generation: | 
  
  
    | The weakly supervised IEP mining is highly based on two key assumptions as [3], [16], [17] | 
  
  
    |  If a sequential pattern can be used to extract many reliable comparator pairs, it is very likely to be an IEP. | 
  
  
    |  If a comparator pair can be extracted by an IEP, the pair is reliable. | 
  
  
    | Based on these key assumptions, bootstrapping algorithm designed as shown in Fig. 1. | 
  
  
    | To generate sequential patterns, Li Shasha in [1],[3] used surface text mining method introduced in [2]. In this
      method, comparators in the question are replaced by symbol $Cs in any given comparative question and its comparator
      pair. The symbol #start is attached to the beginning of the each sentence and the symbol #end at the end of sentence. Li
      Shasha in [3] used some heuristic rules and phrase chuncking for diversity reduction of sequence data and mine
      potential patterns. Following three kinds of sequential patterns can be generated from sequences of questions as: | 
  
  
    | a. Lexical Patterns: | 
  
  
    | These patterns indicate sequential patterns consisting of only words and symbols ($C, #start, and #end). | 
  
  
    | b. Generalized Patterns: | 
  
  
    | A lexical pattern is too specific for matching. So lexical patterns are generalized by replacing one or more words
      their POS tags. | 
  
  
    | c. Specialized Patterns: | 
  
  
    | Pattern specialization is done by adding POS tags to all comparator slots. For example, from the lexical pattern '<$C
      or $C>' and the question 'Paris or London?', '<$C=NN or $C=NN?>' will become specialized pattern. | 
  
  
    | Note that in this method, lexical patterns are used to generate generalized patterns and the combined set of
      generalized patterns and lexical patterns are used to generate specialized patterns [1],[3]. | 
  
  
    | 3. Pattern evaluation: | 
  
  
    | Bootstrapping gives very few reliable comparator pairs in its early stage. Hence for discovering more reliable pair’s,
      pattern evolution operation is performed. In this case, the value might be underestimated which could affect the
      effectiveness of on distinguishing IEPs from non-reliable patterns. This problem is mitigated by a look-ahead
      procedure. The next step is to rank possible comparators for a user’s input [1], [3]. | 
  
  
    | C. Comparator Extraction: | 
  
  
    | By employing IEPs, it is easy to identify comparative questions and collect comparator pairs from available data.
      For given question and an IEP, comparator extraction process is described in [1], [2], [3], [4] as follows: | 
  
  
    | 1. Generate sequence for the comparative question: | 
  
  
    | If the IEP is a pattern without generalization, then tokenize the questions and the list of resulted tokens is the sequence.
      Otherwise, phrase chuncking is needed. The sequence is a list of resulted chunks. | 
  
  
    | 2. Check whether sequence of the question matches with the given pattern: | 
  
  
    | If IEP is a specialized pattern, the POS tag sequence of extracted comparators should follow the constraints
      specified by the pattern. | 
  
  
    | However, a result of [3] shows about 67 % comparative questions can match to multiple patterns, and from 11 %
      comparative questions, we can extract different comparator pairs. Li Shasha in [3], [4] examined three different
      strategies to solve the issue of comparator extraction. | 
  
  
    | D. Comparator Ranking: | 
  
  
    | The comparability and graph based methods are examined rank possible comparators for user’s input [1], [3], [18]
      which are described below, | 
  
  
    | 1. Comparability-Based Ranking Method: | 
  
  
    | Frequent comparison of entity with particular entity would make comparator more interesting. Based on this
      intuition, a simple ranking function Rfreq(c;e) ranks comparators on the basis of number of times that a comparator c is
      get compared to the user’s input e in online comparative question archive Q. | 
  
  
    |  eq. (1) | 
  
  
    | Where (Qc, e) is a set of questions from which comparator c and user input e can be extracted as a comparator pair.
      This method also known as frequency based Method. The another ranking function is Rrel by combining reliability
      scores estimated in comparator mining phase | 
  
  
    |  eq. (2) | 
  
  
    | Where p q, c, e means the pattern that is selected to extract comparator pair of c and e from question [3]. | 
  
  
    | 2. Graph Based Ranking Method: | 
  
  
    | Frequency is consider as efficient parameter for comparator ranking but the frequency-based ranking method [3] can
      suffer when an user input occurs rarely in collection of questions; for example,suppose all possible comparators to the
      input are compared only once in questions. In this case, this method may fail to results correct ranking result. Hence in
      addition to it representing ability should also be considered. We regard a comparator representative if it is frequently
      used as a baseline while making comparison of interested entity. | 
  
  
    | Graph based page rank method is one of the solutions to get ability. A comparator can be considered as valuable
      comparator in ranking if it is compared to too many other important comparators including the input entity. Based on
      this idea, Page Rank algorithm is examined to rank comparators for a given input entity, which combine frequency and
      represent ability [3]. | 
  
  
    | EXPERIMENTAL RESULTS | 
  
  
    | In this section the experimental results of different research papers are compared and discuss. | 
  
  
    | A. Comparative Question Identification and Comparator Extraction: | 
  
  
    | The latest experimental results on comparative question identification and comparator extraction for the data of
      60M questions mined from Yahoo! Answers’ question title field can be found in [4]. The experimental results of Li Shasha [3] compared with Jindal and Liu’s [1] methods are shown in Table I. In the Table I, column with Identification
      only shows the performances in comparative question identification, column with Extraction only shows the
      performances of comparator extraction when only comparative questions are used as input, and last column with All
      shows the end-to-end performances when question identification results were used in comparator extraction. | 
  
  
    | In terms of precision, the method described in [1] is competitive to method used in [3] in comparative question
      identification. However, the recall is significantly lower in [1] than [3]. In the end-to-end experiments, weakly
      supervised method of [3] performs significantly better than method of [1]. F1-measure of [1] in All is about 30 % and
      32 % worse than the scores of Identification only and Extraction only respectively, our method only shows small
      amount of performance decrease (approximately 7-8 %). | 
  
  
    | B. Ranking Results of Comparability-BasedvsGraph-Based Ranking Methods: | 
  
  
    | The proposed algorithm is consists of three main steps. Ranking results of comparability and graph based ranking
      methods of [3] are shown in Table II. For some queries whose comparator’s frequency differs significantly, such as
      “iphone” and “BMW 328i” the ranking results of two methods do not make many differences. That’s because
      frequency plays the main role in the ranking process for these queries in graph-base ranking methods. However, for
      queries whose comparators share similar frequency, such as “BMW 328i”,“Nokia N75” and “Nikon D200” the
      differences between two methods are obvious. These experimental results show both graph based and page rank
      methods are effective for both comparative question identification and comparator extraction. | 
  
  
    | CONCLUSION | 
  
  
    | This paper surveys various research articles and there experimental results that are currently available and discusses
      the pros and cons for each of them. A thorough comparison between different methods based on experimental results
      for 60M questions. After surveys it is found that a novel weakly supervised method described in [1], [2], [3],[4]
      identifies comparative questions and extracts comparator pairs simultaneously with high precision and high recall. The
      results of [2], [3], [4] can be used for a commerce search or product recommendation system from user comparison
      interest. For example, automatic suggestion of comparable entities can assist users in their comparison activities and
      will help for better purchase decisions. | 
  
  
    |  | 
  
    | Tables at a glance | 
  
    | 
 
    |  |  |  
    | Table 1 | Table 2 |  | 
  
    |  | 
  
    | Figures at a glance | 
  
    | 
 
    |  |  
    | Figure 1 |  | 
  
  
    |  | 
  
    | References | 
  
    | NitinJindal and Bing Liu,âÃâ¬ÃËIdentifying  Comparative Sentences in Text DocumentsâÃâ¬Ãâ¢,Proceedings of the 29th annual  international ACM SIGIR conference on Research and development in information  retrieval, pp. 244-251, 2006.
 Li Shasha, Chin-Yew Lin, Young-In Song, and  ZhoujunLi,âÃâ¬ÃËComparable Entity Mining from Comparative QuestionsâÃâ¬Ãâ¢, Proceedings of  the 48thAnnual Meeting of the Association for Computational  Linguistics (ACLâÃâ¬Ãâ¢10), 2010.
 Li Shasha, Chin-Yew Lin, Young-In Song, and  ZhoujunLi,âÃâ¬ÃËComparable Entity Mining from Comparative QuestionsâÃâ¬Ãâ¢, Knowledge and  Data Engineering, IEEE Transactions on 25, no.7, pp. 1498-1509, 2013.
 LiShasha, Chin-Yew Lin, Young-In Song, and  ZhoujunLi,âÃâ¬ÃËComparative Entity MiningâÃâ¬Ãâ¢, U.S. Patent no. 8, 484, 201, July 2013.
 Califf M. Elaine and Raymond J.  Mooney,âÃâ¬ÃËRelational Learning of Pattern Match Rules for  InformationExtractionâÃâ¬Ãâ¢,Proceedings of the 16thNational Conference on Artificial  Intelligence and the 11th Innovative Applications of Artificial Intelligence  Conference (AAAI âÃâ¬Ãâ¢99/IAAI âÃâ¬Ãâ¢99), pp. 328-334, 1999.
 Mooney J. Raymond and RazvanBunescu, âÃâ¬ÃËMining  Knowledge from Text Using Information ExtractionâÃâ¬Ãâ¢, ACM SIGKDD explorations  newsletter 7.1, pp. 3-10, 2005.
 Cardie Claire, âÃâ¬ÃËEmpirical Methods in Information  ExtractionâÃâ¬Ãâ¢,Artificial Intelligence Magazine, vol. 18, pp. 65-79, 1997.
 Riloff Ellen and Rosie Jones, âÃâ¬ÃËLearning  Dictionaries for Information Extraction by Multi-Level BootstrappingâÃâ¬Ãâ¢,  Proceedings of the 16th National Conference on Artificial Intelligence and the  11th Innovative Applications of Artificial Intelligence Conference (AAAI  âÃâ¬Ãâ¢99/IAAI âÃâ¬Ãâ¢99), pp. 474- 479, 1999.
 Califf M. Elaine and Raymond J. Mooney,  âÃâ¬ÃËBottom-up Relational Learning of Pattern Matching Rules for Information  ExtractionâÃâ¬Ãâ¢, Journal of Machine Learning Research, vol 4, pp. 177âÃâ¬Ãâ210, 2003.
 Mooney J. Raymond and Loriene Roy,  âÃâ¬ÃËContent-based Book Recommending using Learning for Text CategorizationâÃâ¬Ãâ¢,  Proceedings of the 5thACM Conference on Digital Libraries, pp.  195âÃâ¬Ãâ204, 2000.
 FreitagDayne, âÃâ¬ÃËToward General-purpose Learning  for Information ExtractionâÃâ¬Ãâ¢, Proceedings of the 36th Annual Meeting of the  Association for Computational Linguistics and COLING-98 (ACL/COLING-98),  pp.404âÃâ¬Ãâ408, 1998.
 StephenSoderland, âÃâ¬ÃËLearning Information  Extraction Rules for Semi-Structured and Free TextâÃâ¬Ãâ¢, Machine Learning, vol. 34,  nos. 1-3, pp. 233- 272, 1999.
 Chang Chia-Hui and Shao-Chen Lui, âÃâ¬ÃËIEPAD:  Information Extraction Based on Pattern DiscoveryâÃâ¬Ãâ¢, Proceedings of the 10th  international conference on World Wide Web (WWWâÃâ¬Ã⢠01), 2001.
 Carlson Andrew, Justin Betteridge, Richard C.  Wang, Estevam R. HruschkaJr, and Tom M. Mitchell, âÃâ¬ÃËCoupled Semi-supervised  Learning for Information ExtractionâÃâ¬Ãâ¢, Proceedings of the 3rd ACM international  conference on Web search and data mining, pp. 101-110, 2010.
 Nitin Jindal and Bing Liu, âÃâ¬ÃËMining Comparative  Sentences and RelationsâÃâ¬Ãâ¢, Proceedings of the 21st National Conference on  Artificial Intelligence (AAAI âÃâ¬Ãâ¢06), vol. 22, pp. 1331-1336. 2006.
 RiloffEllen, âÃâ¬ÃËAutomatically Generating  Extraction Patterns from Untagged TextâÃâ¬Ãâ¢, Proceedings of the 13th National  Conference on Artificial Intelligence, pp. 1044-1049, 1996.
 RadevDragomir, Weiguo Fan, Hong Qi, Harris Wu,  and AmardeepGrewal, âÃâ¬ÃËProbabilistic Question Answering on the WebâÃâ¬Ãâ¢, Journal of  theAmerican Society for Information Science and Technology, 56, no. 6, pp.  571-583, 2005.
 HaveliwalaH.Taher,  âÃâ¬ÃËTopic-Sensitive PagerankâÃâ¬Ãâ¢, Proceedings of the 11th International Conference on  World Wide Web (WWWâÃâ¬Ã⢠02), pp. 517-526, 2002.
 |