Keywords
|
World Wide Web, search results, annotating search results |
INTRODUCTION
|
The Internet super highway is widely used as a vehicle to information sharing across the globe. People across the globe, of all walks of life, are accessing Internet resources through search engines. The search engines provide web based interface for information search. Search engines return huge amount of data which is presented in encoded format through web pages. However, the data comes from underlying database. The search results from web databases that can be used further in applications like price comparison, data collection, and other related applications. When the search word “LAP TOP PRICES” is given in Google, it returned 24,60,00,000 result pages. Some of the results obtained are as follows. |
As can be seen in Figure 1, it is evident that there are many search results that are associated with different web pages. The URL associated with each search result is different and the results came from many underlying databases in the web. The search results are to be made machine processable in order to use them further in real world applications. With the annotations, it is possible to process the web pages returned by search engines. For instance, the prices of various companies pertaining to a product can be compared. The comparison web sites can exist that derive data from across the pages returned by search engines. By providing price comparison, the web sites over Internet can help netizens to make well informed decisions. With many processing techniques, the search engines are presenting the results in meaningful way. Earlier the case was different. The results needed much human effort in order to annotate it manually. Recently Lu et al. [1]presented various ways of annotating the search results. They developed a mechanism that will automatically annotate the search results getting rid of manual labeling of web pages. Their solution contains three phases. They are illustrated in Figure 2. |
The first phase is known as alignment phase where data units are organized into groups based on different concepts. The phase 2 is known as annotation phase which takes care of making annotators that annotate web documents automatically. The phase 3 is known as annotation wrapper generation phase where an annotation rule is generated for each identified concept. Annotation wrapper is the collection of all the rules for all groups which have been aligned. Annotation wrappers help improve the process of annotation. A clustering based scripting technique is used to achieve this. |
In this paper we implemented few annotators presented by Lu et al. [1]. We built a prototype application that takes care of automatic annotations of search results. The research results are obtained through Google search. The results are automatically annotated using the mechanism proposed in [1]. The empirical results revealed that our prototype is useful and can be used in the real world. The remainder of the paper is structured as follows. Section II reviews literature that focuses on the prior work pertaining to annotation of search results. Section III presents the proposed approach to achieve automatic annotation of web search results. Section IV presents experimental results while section V concludes the paper. |
RELATED WORK
|
Extracting information from web and annotating search results for further processing has been around for some years. This is because there is an important utility in the real world when search results are annotated. Many existing systems that came into existence have manual system for annotating search results. For instance in [2] and [3], human users are involved for marking the annotations. These systems are manual and they are not scalable. However, they achieved high rate of accuracy. Their problemis that they are not scalable and thus can’t be used in real world applications [4], [5]. Spatial locality and presentation styles are used in [6] for annotations. However, the process of annotations in this approach is dependent on domains. Ontologism were used in [7] where labeling documents was done based on certain heuristics. Many prior works focused on constructions of wrappers. However, those wrappers could only extract data but not annotations. Many other researches came into existence that focused on automatic allocation of labels to search results [8], [9], and [10]. |
In [10] data units were annotated with closest labels but the method was not impressive for web databases. Query interfaces were used in [9] and ontologies are constructedin a domain dependent way. For the first time HTML tags are used by DeLa [8] in order to align data units. It was achieved using heuristics. Labeling and attribute extraction is done simultaneously in [11]. In this approach label sets are pre-defined and thus it is not so dynamic. HTML tag paths are the frequently used feature [12]. Visual features are also used in [13] for aligning data. However, it was successful only for text nodes. In [14] a record is split into various segments for data alignment and annotations. |
Recently Lu et al. [1] proposed an approach for automatic annotations of search results. First of all their approach considers various kinds of relationships in the data units and handles them. However, the existing works considers only some types as explored in [8] and [6]. Afterwards, Lu et al. used the features together besides ontologyin order to align data. Clustering based scripting algorithm is also used to achieve this. The work in [1] and that in [8] are similar. Both approaches make use of HTML tags for processing and handle all kinds of relationships. However, their approach is different for annotating search results. An annotation wrapper was constructed that can describe rules for assigning labels to search results. Crawling deep web is one of the applications of the annotations. ViNTs [15] was used to obtain records from search results. The previous paper [16] is the basis for the work done by Lu et al. [1]. |
PROPSED SYSTEM FOR ANNOTATING SEARCH RESULTS
|
In this paper we take the concepts for innovatingsearch results from [1]. Reader can get more basic information from [1]. However, in this section we provide the implementation details of our application and algorithm for automatic annotation of search results. As described in [1], our approach also has three phases in the application. The three phases and their functionality areproviding in the schematic representation as shown in Figure 3. |
As can be seen in Figure 3, it is evident that the web documents which are search results (taken from Google) are given as input to the system. Then the searchresults are processed in the first phase known as alignment to divide the data into groups and then annotation takes place in the second phase while the third phase focuses on annotation wrappers that provide final annotated web pages. Two kinds of annotators are applied in the proposed prototype application. They are Table Annotator (TA) and Query Based Annotator (QA). |
Table Annotator
|
Many search engines present some data in tabular format. It does mean the search results are presented in tabular format. The data in tabular format can help users to understand it by a glance. The table annotator identified column headers in the table. Afterwards, the data items are processed. The maximum vertical overlap in a column is identified and then the header text is used for labeling. |
Query – Based Annotator
|
This annotator takes the idea that the search results of a query are related to that query. Name of the search field title is used to annotate. A query with multiple query terms, that are pertaining to specific attribute returns records that satisfy the search results. The search results do not have all the attributes that are present in database. For this reason query based annotator is useful in this context. |
Data Alignment Algorithm
|
The algorithm for data alignment [1] assumes that the attributes of the data are in some specific order for all the rows. The assumptions make the algorithm work in that fashion. Generally this assumption is true for many search results that are presented in tabular format. Figure 4 shows the algorithm that is meant for data alignment. As can be seen in Figure 4, it is evident that the algorithm takes search results as input and generates many clusters or groups that are the result of the alignment process. These groups are used for further processing as illustrated in Figure 1. After the process of alignment, the application will automatically make annotations which are visualized in the prototype application. |
Prototype Application
|
The prototype application is built using My Eclipse IDE. Java is the programming language used. The interface provides search facility and the results are used as input to the algorithm. The application is with GUI that makes the application intuitive besides having the capabilities to visualize the results of annotations. The application facilitates saving of results as well for future retrieval and revisions. We used local database for storing results. MY SQL is used as backend in order to store local content. The application has provisions to produce summary of results in graphical format with the help of graphs which are presented in the ensuing section. |
EXPERIMENTAL RESUTLS
|
We have made experiments data from various domains with respect to two annotators only. The annotators used include table annotator and query – based annotator. Both the annotators are supported by the prototype application and it is extensible so as to support more annotators in future. The performance of data alignment and annotation are presented in Table 1. |
As presented in Table 1, it is evident that more than 90% precision and recall were recorded for both the performances such as data alignment and annotations. The table also shows the performance of annotation with wrapper. The results are presented in the following graphs. |
As shown in Figure 5, Figure 6, and Figure 7, it is evident that the prototype application is capable of producing annotations automatically given search results of Google. The performance of the application is encouraging and the application can be used in the real world applications. |
CONCLUSION
|
In this paper we focused on the problem of annotating search results. The search results of search engines form web databases which can be used for further processing in order to leverage them in various applications like content comparison, data extraction and so on. We built a prototype application that facilitatesusers to give a query, and then the query is programmatically submitted to Google. The results of Google are used in the application for further processing. As explored in Figure 1, the three phases are carried out. The phases are alignment phase, annotation phase and wrapper generation phase. After completion of these phases, the application visualizes results which are nothing but the annotated documents. HTML tags are used to process the pages while annotating them. The annotated results are further useful in real world applications. The empirical results revealed that our application is effective. |
Tables at a glance
|
|
Table 1 |
|
Figures at a glance
|
|
|
|
|
|
Figure 1 |
Figure 2 |
Figure 3 |
Figure 4 |
Figure 5 |
|
|
|
Figure 6 |
Figure 7 |
Figure 8 |
|
References
|
- Yiyao Lu, Hai He, Hongkun Zhao, WeiyiMeng and Clement Yu, (2013). Annotating Search Results from Web Databases. IEEE Transactions OnKnowledge And Data Engineering, Vol. 25, NO. 3.p1-14.
- N. Krushmerick, D. Weld, and R. Doorenbos, “Wrapper Inductionfor Information Extraction,” Proc. Int’l Joint Conf. ArtificialIntelligence(IJCAI), 1997.
- L. Liu, C. Pu, and W. Han, “XWRAP: An XML-Enabled WrapperConstruction System for Web Information Sources,” Proc. IEEE16th Int’lConf. Data Eng. (ICDE), 2001.
- Z. Wu et al., “Towards Automatic Incorporation of Search Enginesinto a Large-Scale Metasearch Engine,” Proc. IEEE/WIC Int’l Conf.WebIntelligence (WI ’03), 2003.
- W. Meng, C. Yu, and K. Liu, “Building Efficient and EffectiveMetasearch Engines,” ACM Computing Surveys, vol. 34, no. 1,pp. 48-89, 2002.
- S. Mukherjee, I.V. Ramakrishnan, and A. Singh, “BootstrappingSemantic Annotation for Content-Rich HTML Documents,” Proc.IEEE Int’lConf. Data Eng. (ICDE), 2005.
- D. Embley, D. Campbell, Y. Jiang, S. Liddle, D. Lonsdale, Y. Ng,and R. Smith, “Conceptual-Model-Based Data Extraction fromMultiple-RecordWeb Pages,” Data and Knowledge Eng., vol. 31,no. 3, pp. 227-251, 1999.
- J. Wang and F.H. Lochovsky, “Data Extraction and LabelAssignment for Web Databases,” Proc. 12th Int’l Conf. World WideWeb (WWW),2003.
- W. Su, J. Wang, and F.H. Lochovsky, “ODE: Ontology-AssistedData Extraction,” ACM Trans. Database Systems, vol. 34, no. 2,article 12, June2009.
- L. Arlotta, V. Crescenzi, G. Mecca, and P. Merialdo, “AutomaticAnnotation of Data Extracted from Large Web Sites,” Proc. SixthInt’lWorkshop the Web and Databases (WebDB), 2003.
- J. Zhu, Z. Nie, J. Wen, B. Zhang, and W.-Y. Ma, “SimultaneousRecord Detection and Attribute Labeling in Web Data Extraction,”Proc. ACMSIGKDD Int’l Conf. Knowledge Discovery and DataMining, 2006.
- Y. Zhai and B. Liu, “Web Data Extraction Based on Partial TreeAlignment,” Proc. 14th Int’l Conf. World Wide Web (WWW ’05),2005.
- W. Liu, X. Meng, and W. Meng, “ViDE: A Vision-Based Approachfor Deep Web Data Extraction,” IEEE Trans. Knowledge and DataEng., vol.22, no. 3, pp. 447-460, Mar. 2010.
- H. Elmeleegy, J. Madhavan, and A. Halevy, “HarvestingRelational Tables from Lists on the Web,” Proc. Very LargeDatabases (VLDB) Conf.,2009.
- H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “FullyAutomatic Wrapper Generation for Search Engines,” Proc. Int’lConf. World WideWeb (WWW), 2005.
- Y. Lu, H. He, H. Zhao, W. Meng, and C. Yu, “AnnotatingStructured Data of the Deep Web,” Proc. IEEE 23rd Int’l Conf. DataEng. (ICDE),2007.
|