15 Information Extraction
Biswanath Dutta
I. Objectives
To study the concepts related to Information Extraction.
II. Learning Outcome
After reading this module:
• The reader will have knowledge about the concept of Information Extraction and its significance
• The reader will understand the key factors that contribute to the growth of Information Extraction (IE).
• The user will gain the knowledge of different IE system which used for extracting information from various domains.
• The reader will understand the generic architecture of IE system and its types like Named entity recognition, Co-reference resolution, etc.
• The reader will gain the knowledge of various applications of IE.
• The reader will learn about the use of free and open source as well as commercial software for the IE.
III. Structure
1. Introduction
2. Information Extraction: State of Art
3. Information Extraction Architecture
4. Information Extraction Types
5. Evaluation of Information Extraction Systems
6. Applications
6.1. Enterprise Applications
6.2. Personal Information Management
6.3. Scientific Applications
6.4. Web Oriented Applications
6.5. Structured Web Searches
7. Tools and Services for Information Extraction
7.1. Free or Open Source Software and Services
7.2. Paid or Commercial Software and Services
8. Summary
9. References
1. Introduction
In the past few years there is a rapid escalation in the textual information present in digital form in various repositories both on web and on intranet. Major portion of this digital content e.g. government documents, news, corporate reports, social media contents etc. are available in unstructured form like free-text documents. The search of information from this unstructured content is a difficult task, so this raises the need for efficient and effective techniques for analyzing them and finding relevant information from it in the structured form. This led to the growth of information extraction techniques.
The term Information Extraction means the structured information like entities, relationships between entities and attributes describing these entities from unstructured sources are automatically extracted. This facilitates wide range of queries on the abundant unstructured data sources which would be otherwise not possible with ordinary keyword search technique. The Information Extraction has a goal of identifying a set of predefined concepts of a specific domain by extracting it from other irrelevant information that is available in a particular domain.
Information Extraction (IE) and Information Retrieval (IR) are complementary to each other but the former has not gained as much attention as the later. IE systems are much more complex and knowledge-intensive in principle than IR systems. The IR results are based on key-word search on a corpus of textual documents and relevant contents based on the query are selected and thesaurus can be used to augment the result. Usually, in IR process a ranked list of documents is presented against a query according to the decreasing relevance of the documents but no detailed information about the content of the document is provided in this ranked list. On the other hand, IE doesn’t provide any such ranked list neither selects documents; rather it extracts important facts about the query to have relevant, rich depiction of the content and this can be used to populate databases which give input in structured form for mining complex patterns from the document corpus. So, IE provides the efficient techniques which enables search and discovery of knowledge in such document collections. However, in IE systems for pre-filtering a large corpus of documents, IR systems are used and in IR systems to identify structures for intelligent document indexing, IE techniques are employed.
2. Information Extraction: State of Art
The Information Extraction (IE) has its inception in late 1980s. ATRANS system was one of the first attempts to use IE for extracting information from messages in financial domain regarding transfer of funds between banks. This was based on simple (NLP) Natural Language Processing techniques. JASPER as mentioned in and SCISOR was an IE system for extracting information from corporate related issues, using robust NLP techniques. Some of IE systems in the security domain are mentioned in. Majority of the IE systems during this era were mono-lingual and were not flexible to get adapted to new scenarios. After this efforts were made in developing general purpose IE systems and frameworks, with modular and capability of easier adaptability to nascent domains and languages. Some of IE systems of this type are as follows:
• The FASTUS system as mentioned in, developed at SRI International was able to process English and Japanese text.
• The SPPC system as mentioned in, can process German text.
After these IE systems were based on integration of shallow and deep linguistic analysis to achieve better results, implementations of this is LASIE – II which uses finite-state recognizers for detecting lexical patterns related to a particular domain. With the advent of Knowledge Engineering based, modular IE systems in late 1990s and the beginning of the twenty-first century processing of huge amount of textual data other than English language also integrated much efficiently and robustly. These system designs consists mainly two components – a general core engine and a domain specific language component which may be said as Knowledge base. But still the method of manually building Knowledge base is a very time consuming and difficult task. This motivated in development of trainable IE systems that use machine-learning techniques to prepare the system to get adapted to a new domain or task e.g. Nymbl. Inspite of all the efforts made in the field of IE still the success achieved in extracting information from languages other than English language text is low. The prime factor behind this is the lack of Natural Language Processing components and available linguistic resources for several languages e.g., lack of whitespace in Chinese language complicates word boundary disambiguation.
Also since 1987, IE has a series of Message Understanding Conferences. MUC is a competition- based conference that focused on the following domains:
MUC 1 in year 1987, MUC 2 in year 1989 on messages of naval operations. MUC 3 in year 1991, MUC 4 in year 1992 on Terrorism in Latin America. MUC 5 in year 1993 on domain of microelectronics and joint ventures. MUC 6 in year 1995 on News articles related to changes in management. MUC 7 in year 1998 on reports related to Satellite launching.
These MUC were supported by U.S. Defense Advanced Research Projects Agency (DARPA), with the goal of automatic identification of probable links to terrorism by scanning newspapers.
3. Information Extraction Architecture
As mentioned in that the IE systems are different from each other on the basis of the purpose for which they have been designed, but there are certain core components which all share in common. There are several dimensions along which the processing order of IE systems can be analyzed. The following below mentioned steps are generally performed in IE systems:
• Metadata analysis: This step deals with extraction of metadata components of a document i.e. data which describes a document like title of the document etc.
• Tokenization: In this step text is broken into words or units known as tokens and they are classified into groups according to characteristics and attributes.
• Morphological analysis: Here morphological information is extracted from tokens which are similar to performing parts of speech disambiguation.
• Sentence/Utterance boundary detection: In this step text is segmented into a sequence of sentence having lexical contents together with respective features.
• Common Named-entity extraction: Named entities like names of persons, organizations, locations, expressions of times, numerical and currency expressions, etc. are detected irrespective of their domain.
• Phrase recognition: Several phrases like noun phrase, verb groups, prepositional phrases, acronyms, and abbreviations are identified in this stage.
• Syntactic analysis: Here all the possible interpretations of sentences are computed which are obtained after above mentioned steps based on the lexical sequence of text or tokens.
In the domain specific section, Named Entity Recognition component identifies entities relevant in a domain of discourse. Patterns component is used to identify text fragments and extract key attributes relevant to a particular event. The Co-reference section detects texts that do self reference. Finally, in the Information Fusion step as the name suggests fusion of scattered relevant information over different sentences or even documents is performed. These above mentioned steps can be depicted with this below provided figure 1.
Fig.1: Generic architecture of an IE system
4. Information Extraction Types
• Named Entity Recognition: It is also known as entity identification, entity chunking and entity extraction and addresses the issue of identification (detection) and classification of text into pre-defined categories of named entities such as the names of persons (e.g., S.R Ranganathan), organizations (e.g. DRTC), locations (e.g., Bangalore), expressions of times (e.g., April 1962), numerical and currency expressions (e.g., 5 Thousand INR), etc.
• Co-reference Resolution: Co-reference resolution means identifying all the expressions that indicate to the same entity in a given text. For e.g.: “I voted for Modi because he was most aligned with my values, “He said. Here {I, my, He} and {Modi, he} refer to the same entity.
• Relation Extraction: This is the method of identification and classification of predefined relations among the given text. For e.g.: LocatedIn (DRTC, Bangalore): a relation between an organization and location, extracted from the text ‘DRTC offers course in Library Information Science at Bangalore’.
• Event Extraction: This is a method of detecting events in a text and deriving comprehensive details and structured relevant information about them.
For e.g. ‘Ministry of Human Resource Development (MHRD) approved the National Digital Library project and granted Rs. 100 crore’. From this text extraction involves are like partners involved (IIT Kharagpur, MHRD), capital involved (Rs100 crore) and location (Kharagpur, India).
5. Evaluation of Information Extraction Systems
IE systems can be evaluated based on two metrics from the user’s perspective:
Precision: It denotes the relevant output provided by the system.
Precision = correct / (correct + incorrect)
Recall: This is the extent to which a particular system can produce relevant output.
Recall = correct / key
Where,
Correct = the number of correctly filled slots in system’s response
Incorrect = the number of incorrectly filled slots in system’s response
Key = the total number of slots expected to be filled as compared to a standard set.5.
6. Applications
6.1 Enterprise Applications
• News Tracking: This is a classical application area of information extraction and it has gained a lot of attention of researchers in the NLP community. This is the task of automatically tracking a specific type of event from news contents. The popular MUC competitions are based on extracting structured entities like names of people and organization, and relation between them such as “is-Head-of” between them. One of its applications is connecting background information on people, location and organizations with the related contents in news articles using hyperlinks1.
• Customer Care: This type of extraction techniques are used in scenario where several types of data or facts are collected from customers and for efficient extraction of information from this unstructured data source. This extraction method identifies the relevant product categories from the details provided by the customers and on the basis of this personalized services are provided to the customers. This type of application also involves data cleaning from stored records in databases i.e. suppose from flat string containing address, structured data are extracted like road name, city and state.
• Classified Ads: Classified ads and other listings like list of restaurants, paying guest facilities, apartments on lease etc. is another vital area where if unstructured data is exposed is invaluable for query. This extraction method deals with extraction of information from record-oriented data.
6.2 Personal Information Management
Personal Information Management systems are used to organize personal digital contents such as emails, documents, projects etc. in a structured interlinked manner. So as to extract desired information efficiently when needed. Thus, for example we want information related to an email like name of the sender; phone numbers associated with it etc. then the information extraction system would provide these details from unstructured raw text.
6.3 Scientific Applications
With the development in the field of bio-informatics, the scope and need of information extraction has also increased from mere detection of named entities to extraction of biological entities like genes and proteins. Major problem in this domain is extraction of information from repositories of PubMed, protein names and genes because these entity names differ from the traditional person name extraction. So, this task has broadened the extraction techniques.
6.4 Web Oriented Applications
• Citation Databases: Several citation databases have been created by extracting information from vast range of databases ranging from conference web pages to personal home pages of researchers. Some of them are Citeseer, Google Scholar2 and Cora. These citation databases require several levels of extraction techniques to obtain desired results like navigating web pages for finding publication records, then extracting them from either HTML pages or from PDF files, and then extracting associated metadata like title, authors, venue and year. Citation databases provide important statistical analysis like author-level citation counts etc.
• Opinion Databases: There are several websites which collect opinion polls from users on a range of topics like sports, politics, about products, books, movies, music and people. Many of these opinions collected are in free-text form present in blogs, newsgroup posts, reviewing sites etc. These reviews by default are not directly reusable but the value can be enhanced by organizing them along structured fields.
• Community Websites: Community websites extract information about the events associated with a particular community or association. Some of its implementations are DBLife3 and Rexa4 which finds information about researchers, talks, workshops, conferences, projects and events relevant to the community.
• Comparison Shopping: With the growth of web several merchants have launched their commercial websites but as web is like a vast ocean so to enhance their visibility to the customers and compare prices of a particular product on different website for users is a challenging task. One of implementation of this kind for comparing prices of books online from different merchant websites is IndiaBookStore5.
• Structured Web Searches: This is the most challenging issue for effective information extraction using entities and the relationship among those instead of keyword based searching. Keyword search are effecting in providing information about entities which are noun or noun phrases but when it is a matter of extractions based on relations between entities they fail. For example, if one needs documents on “artist born in Italy between 1450 and 1600”, then keyword based search is not effective in extracting the documents.
7. Tools and Services for Information Extraction
7.1 Free or open source software and services
• General Architecture for Text Engineering: General Architecture for Text Engineering6 (GATE), which is bundled with a free Information Extraction system from large corporations to small startups, from €multi-million research consortia to undergraduate projects
• Apache OpenNLP: Apache OpenNLP7 is a Java machine learning toolkit for processing natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, parsing, and co- reference resolution.
• OpenCalais: OpenCalais8 is an automated information extraction web service from Thomson Reuters (Free limited version).
• Machine Learning for Language Toolkit (Mallet): Mallet9 is a Java-based package for a variety of natural language processing tasks, including information extraction.
• DBpedia Spotlight: DBpedia Spotlight10 is an open source tool in Java/Scala (and free web service) that can be used for named entity recognition and name resolution.
5 http://www.indiabookstore.net
6 https://www.gate.ac.uk
7 https://opennlp.apache.org/
10 https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
7.2 Commercial softwareand services
• Web Miner: Web Miner11 is commercial software used for extracting specific information, images and files from websites.
• Semantics3: Semantics312 is e-commerce product and pricing database that obtains its data through information extraction from thousands of online retailers.
11 https://webminer.avantprime.com/
12 https://www.semantics3.com/
8. Summary
The concept of Information Extraction was introduced. Different IE system used for extracting information from various domains were described. The generic architecture of IE system and its types like Named entity recognition, Co-reference resolution were explained. Applications of IE system in various fields were illustrated. The use of free and open source as well as commercial software for the IE was also explained.
9. References
1. Sunita Sarawagi, Information Extraction, Foundations and Trends in Databases, v.1 n.3, p.261-377, March 2008
2. Llytinen, S., Gershman., A., ATRANS: automatic processing of money transfer messages. In: Proceedings of the 5th National Conference of the American Association for Artificial Intelligence, IEEE Computer Society Press (1986)
3. Andersen, P., Hayes, P., Huettner, A., Schmandt, L., Nirenburg, I., Weinstein, S., Automatic extraction of facts from press releases to generate news stories. In: Proceedings of the 3rd Conference on Applied Natural Language Processing, ANLC ’92, Trento, pp. 170–177
4. Jacobs, P., Rau, L., SCISOR: extracting information from on-line news. Communication of ACM 33, 88–97 (1990)
5. Lehnert, W., Cardie, C., Fisher, D., McCarthy, J., Riloff, E., Soderland, S., University of Massachusetts: MUC-4 test results and analysis. In: Proceedings of the 4th Message Understanding Conference, Morgan Kaufmann, McLean (1992)
6. Lehnert, W., Cardie, C., Fisher, D., Riloff, E., Williams, R., University of Massachusetts: Description of the CIRCUS system as used for MUC-3. In: Proceedings of the 3rd Message.
- Understanding Conference. Morgan Kaufmann, San Diego (1991)
- Hobbs, J.R., Appelt, D., Bear, J., Israel, D., Kameyama, M., Stickel, M., Tyson, M., FASTUS: A cascaded finite state transducer for extracting information from natural language text. In: Roche, E., Schabes, Y. (eds.) Finite State Language Processing. MIT, Cambridge (1997)
- Neumann, G., Piskorski, J., A shallow text processing core engine Computer Intelligence. 18, 451–476 (2002)
- Humphreys, K., Gaizauskas, R., Huyck, C., Mitchell, B., Cunningham, H., Wilks, Y., University of sheffield: description of the LaSIE-II system and used for MUC-7. In: Proceedings of MUC-7, Virginia (1998)
- Bikel, D., Miller, S., Schwartz, R., Weischedel, R., Nymble: a high-performance learning Name finder. In: Proceedings of the 5th Applied Natural Language Processing Conference, Washington. Association for Computational Linguistics, Washington (1997)
- Gao, J., Wu, A., Li, M., ning Huang, C., Chinese word segmentation and named entity recognition: a pragmatic approach. Computer Linguistics. 31, 574 (2005)
- Piskorski, J., & Yangarber, R. (2013). Information Extraction: Past, Present and Future. In Multi-source, Multilingual Information Extraction and Summarization (pp. 23-49). Springer.
- S. Lawrence, C. L. Giles, and K. Bollacker, Digital libraries and autonomous citation indexing, IEEE Computer, vol. 32, pp. 67–71, 1999.
- McCallum, K. Nigam, J. Reed, J. Rennie, and K. Seymore, Cora: A Computer Science Research Paper Search Engine, http://cora.whizbang.com/, 2000.