ISSN ONLINE(2320-9801) PRINT (2320-9798)
Tomasz Boinski, Adam Brzeski Faculty of Electronics, Telecommunication and Informatics, Gdansk University of Technology, Poland |
Related article at Pubmed, Scholar Google |
Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering
The Polish language differs from English in many ways. It has more complicated conjugation and declination. Because of that automatic facts extraction from texts is difficult. In this paper we present basic differences between those languages. The paper presents an algorithm for extraction of facts from articles from Polish Wikipedia. The algorithm is based on 7 proposed facts schemes that are searched for in the analyzed text. The analysis includes morphosyntactic tagging, named entity extraction and relation identification. The results acquired for an exemplary Wikipedia text is presented. We indicate the free word formation principle as the main difficulty in the Polish texts analysis. At the same time satisfactory performance of the tagging and analysis tools for the Polish language was confirmed in the conducted experiment.
Keywords |
natural language processing, text analysis, knowledge extraction, unstructured information, tagging, named-entity recognition |
INTRODUCTION |
Internet contains a lot of knowledge. It estimated that currently there are over 3.3 billion web pages [1]. Most of those pages are documents formed in natural language thus information (or facts) extraction from such documents was in the interest of researchers from the beginning of the Internet era. |
So what is an information extraction? It is a process of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). |
The aim of this paper is to make a step towards the full automatic facts extraction in the Polish texts. Many researchers focus on the most widely used English language, which therefore has many tools available. Unfortunately those solutions, even those highly viable, do not perform well when used in conjunction with other natural languages. The nature of the Polish language makes it hard to apply the same rules as can be used for the English language. In this paper we focus on some basic ideas and problems that arise during our preliminary tests. |
The structure of this paper is as follows. In the next chapter we present related work from the literature. Then the difference between Polish and English languages is presented. Next, the proposed way of extracting facts is described. Finally, we present the obtained results and the conclusions. |
RELATED WORK |
The problem of automated facts extraction plays a more and more important role in web pages processing. Especially if the information (or even knowledge) is contained within unstructured, natural language formatted texts. |
For many years researchers and companies tried to tackle this problem using different approaches. The basic approach involved creation of patterns. In most cases those patterns were created before analysis of the text and then applied to search for matching facts [2]. Such approach introduced the need for proper complex patterns before the analysis of the texts and required supervision in the extraction process. Some approaches tried to eliminate this problem by providing means for automatic or semiautomatic pattern learning [3]. Finally modern approaches provide ontology based solutions eliminating the need for pattern creation and recognition [4], [5], [6], [7]. |
The problem of the polish language |
The Polish language differs from English in many ways. The most important differences are: |
1. The Polish language is formation free. Unfortunately the dominance of analytic languages, such as English or Chinese, makes research focuses primarily on the languages with fixed formation, while the language with free formation are less explored [8]. Tools like Ãâ¦ÃÅ¡wigra [9] or TaKIPI [10] solve this problem to some extent. |
2. More complicated regular conjugation – Polish has more conjugation templates whereas English has fewer templates with far greater number of exceptions. Furthermore, the Polish language is further complicated by inflection [11]. |
3. Complicated declination – modern English, similarly as with conjugation, has very simple declination compared to Old English or Polish, however it has much more exceptions. |
4. Combining declination with free word order makes sentences in Polish much more ambiguous than in English. |
Until recently, the Polish language also lacked proper tools for automating common tasks like tagging, finding lemma of the word or named entities look-up. The situation changed with the development of Morfeusz [9], [12] which performs a morphological analysis for Polish sentences. Morfeusz became a base for Ãâ¦ÃÅ¡wigra, TaKIPI and recently Pantera [13]. All those tools are efficient taggers of the Polish language. Another useful application is Spejd [14], [15], [16], a tool for partial parsing and rule-based morphosyntactic disambiguation. Nerf [17] in turn allows extraction of named entities. Most of those tools require an extensive corpora, especially during the process of named entities extraction or coreference analysis. Such a common corpus were developed during recent years – The National Corpus of Polish [18], [19]. All those tools allow complex analysis of text in Polish, laying foundations for analysis and extraction of knowledge contained within documents formulated in natural language. |
EXTRACTING FACTS FROM WIKIPEDIA |
In our research we focused on extracting knowledge from Wikipedia articles. The main body of a Wikipedia article is rather loosely formatted with arbitrary chosen sections and text blocks. Also the content of each page is a natural language text without a formal structure. We attempted to extract the facts in a form of <subject, predicate, object> triples. |
The test were done using Multiservice web site [20]. The general idea is to tag texts using Pantera and extract named entities using Nerf. Verbs (subst) and named entities (ne) of type [17]: persName, placeName, orgName and geogName are than mapped to subjects and objects, named entities of type date to objects. Pseudo participles (praet), participles (ppas) and prepositions (prep) are always treated as predicates. Adjectives (adj) were detected but ignored. The tags were taken from IPI PAN corpus tag syntax [21]. |
Two additional relations were introduced: isA and of. The isA relation introduces subsumption and can take named entities and verbs as subjects and verbs as objects. The of relation means that subject is related to object by some action, e.g. a boss is a chief of the company but the company is not subsumed by the boss. This relation can take verbs as subjects and verbs and named entities as objects. |
The assignment of verbs and named entities to subjects and objects in triples depends on the morphosyntactic context it is used in. Currently we recognize the following schemes (square brackets (“[” and “]”) means optional occurrence, pipe (“|”) means alternative): |
1. ppas date [prep ne] |
2. ppas prep placeName |
3. prep date [prep] (subst | [praet [subst [subst] [ne]]]) |
4. prep subst ne subst |
5. subst prep ne subst [adj] [prep subst ne] |
6. subst subst ([adj] | [prep subst [ne]]) |
7. subst ne |
When a phrase matching one of the schemes is found, the words are connected with the main subject. Unfortunately currently the user has to select one of the verbs or named entities as the subject of the sentence. |
THE RESULTS |
Preliminary research yielded some satisfactory results. Most of the facts were extracted. For example for Polish text “BronisÃâ¦Ãâaw Maria Komorowski (urodzony 4 czerwca 1952 w Obornikach Ãâ¦ÃÅ¡lÃâÃâ¦skich) – polski polityk, z wyksztaÃâ¦Ãâcenia historyk. Od 6 sierpnia 2010 prezydent Rzeczypospolitej Polskiej.” (“Bronislaw Maria Komorowski (born June 4, 1952 in Oborniki Ãâ¦ÃâºlÃâÃâ¦skie) - Polish politician, educated as a historian. Since August 6, 2010 President of the Polish Republic.”) [22] we acquire the following facts: |
Declaration: BronisÃâ¦Ãâaw Maria Komorowski (declaration of the main subject), |
urodziÃâÃâ¡ w Oborniki Ãâ¦ÃâºlÃâÃâ¦ski (born in Oborniki Ãâ¦ÃÅ¡lÃâÃâ¦skie), |
BronisÃâ¦Ãâaw Maria Komorowski urodziÃâÃâ¡ 4 czerwiec 1952 (born June 4, 1952), |
BronisÃâ¦Ãâaw Maria Komorowski isA polityk (BronisÃâ¦Ãâaw Maria Komorowski isA politician), |
BronisÃâ¦Ãâaw Maria Komorowski isA historyka (BronisÃâ¦Ãâaw Maria Komorowski isA educated as historian), |
BronisÃâ¦Ãâaw Maria Komorowski isA prezydent (BronisÃâ¦Ãâaw Maria Komorowski isA president), |
prezydent od 6 sierpieÃâ¦Ãâ 2010 (president since August 6, 2010),` |
prezydent of rzeczpospolita polski (president of the Polish Republic). |
As can be seen all of the facts were extracted correctly. Closer look however reveals some drawbacks of the existing tools. In the Polish language the basic form differs much from the one after declination. A reader familiar with Polish language can than find entities like “Oborniki Ãâ¦ÃâºlÃâÃâ¦ski” or “rzeczpospolita polski” understandable but quite odd. The correct forms are “Oborniki Ãâ¦ÃÅ¡lÃâÃâ¦skie” and “Rzeczpospolita Polska” respectively. Unfortunately, in order to present the right form, the morphosyntax tagger would require a database of all named entities and their basic forms to properly formulate given named entity. The other problem with named entities are abbreviations. Usually “RP” stands for “Rzeczpospolita Polska”. Our current solution will treat both of those entities as different ones. The same problem applies to normal verbs. The form of the education in the above example is incorrect. Instead of “historyka” it should be “historyk”. This in turn is caused by difficulty of guessing the correct form of the base lemma (like singular or plural form, the proper declination of the original etc.). We plan to address those problems using SÃâ¦ÃâowosieÃâÃâ¡ (Polish version of WordNet) [23], [24], [25]. |
`Further problems came with named entities consisting of more than one named entity. For example persName consists of at least one forename and surname. For our studies we decided to take into account the most complex form as one entity. In further studies we plan extracting additional information about each named entity based on its elements. |
CONCLUSIONS |
Much work has been done in the field of facts extraction from natural language texts. Recently at least 2 major research centers in Poland emerged that focus on automation of analysis of the Polish language. More and more tools are becoming available, leading towards the full analysis of the Polish language. |
The biggest problem lies in the complexity of the Polish language. The multitude and complexity of conjugation and declination can be however solved by usage of proper morphological analyzers. Another issue lies in the freedom of formation. This highly complicates construction of templates that can be applied to the text. |
With the constant development of supplementary tools and the experience gained through the research done for the English language, our preliminary research shows that a viable solution to facts extraction from documents formulated in the Polish language should be available soon. |
References |
|