11 Advances Course in Information Storage and Retrieval I: Natural Language Processing
Biswanath Dutta
I. Objectives
• To study the natural language processing techniques and their role in information storage and retrieval.
II. Learning Outcomes
After going through this module the students:
• Will know about Natural Language Processing and its relationships with Information Retrieval (IR).
• Will know about the various linguistic phenomena of natural language.
• Will know about the various NLP techniques that are generally practiced in IR.
• Will know about the NLP approaches at the syntactic and semantic levels.
III. Structure
1. Introduction
2. Natural Language Processing in Information Retrieval
3. Natural Language Understanding
4. Natural Language Processing Techniques
5. Natural Language Processing Tasks
5.1 Syntactic analysis
5.1.1 Context-free Grammar
5.1.2 Transformational Grammar
5.1.3 Parsing
5.1.3.1 Top-down parsing
5.1.3.2 Bottom-up parsing
5.1.4 Tokenization
5.1.5 Stemming
5.1.5.1 Stemming Algorithm
5.1.6 Lemmatization
5.1.6.1 Lemmatization vs. Stemming
5.2 Semantic Analysis
5.2.1 Knowledge Base
5.2.2 Knowledge Representation
5.2.2.1 Semantic Networks
5.2.2.2 Frames
6. Summary
7. References
1. Introduction
The goal of Information Retrieval (IR) system, as we know, is to response to user’s request by retrieving documents. The aim is to retrieve documents whose contents match with the user’s information need. The standard practice is after retrieval of the documents, user examines the retrieved documents by going through the text and determines whether they are relevant or not. The standard practice is users express their information requirements through the natural language as a statement or as part of a natural language dialogue. However, as we know from our experiences, often the retrieved documents do not match the user’s information need. This is because of the ambiguous nature of natural languages (discussed in details in the succeeding sections).
Natural Language Processing (NLP) is an area of research and application. It studies how a natural language text, entered into a computer system, can be manipulated and transformed into a form suitable for further processing [6]. The goal is to analyze the documents intelligently by determining the structure of the sentences and derive and interpret the meaning in a context. This has led the researchers in considering NLP techniques to information retrieval problems to produce document representations and queries for efficient retrieval [1].
In this module, we discuss the basics of NLP, the various linguistic phenomena of natural language, and the use of NLP in information retrieval. We also discuss some of the well-established NLP techniques and tasks.
2. Natural Language Processing in Information Retrieval
Natural Language Processing (NLP) is an area of research and application. The focus is to explore how natural language text entered into a computer system can be manipulated and transformed into a form more suitable for further processing [6]. NLP was formed in 1960 as a sub-field of Artificial Intelligence and Linguistics. The aim was to study problems in the automatic generation and understanding of natural language [8]. The primary goal of NLP is to process text of any type, the same way, which we, as humans, do and extract what is meant at different levels at which meaning is conveyed in a language [9].
Automatic NLP techniques have been considered as a desirable feature of an information retrieval system, especially the textual information retrieval system. The techniques can be used for facilitating descriptions of both document content and user’s query. The aim is to compare the descriptions of document content and user’s query and retrieve the documents that best suite user’s information needs [10].
In the following, the tasks of an NLP based automatic information retrieval systems are described [8].
i. Indexing the collection of documents: the index consists of document descriptions, is generated applying the NLP techniques. The documents are described using a set of terms that best represent the content.
ii. Query representation: when a user formulates a query, the system analyses it and attempts to transform it in the same way as the document content is represented.
iii. Query processing: The system matches the descriptions of each document with the query, and retrieve those documents having a close match with the query description.
iv. Display of results: the retrieved documents are usually listed in order of relevancy, i.e., based on the level of similarity between the document description and query description.
3. Natural Language Understanding
Before discussing the NLP techniques, we discuss the features of a natural language, alternatively, the linguistic phenomena that influence the recall and precision of information retrieval. The understanding of natural language is very important, as it lies at the core of NLP. The understanding of the natural language is concerned with the process of comprehending and using language once the words are recognized. The objective here is to specify a computational model that matches with humans in linguistic tasks such as reading, writing, hearing, and speaking [2].
The two main characteristics of a natural language are:
• Linguistic variation – different words (aka terms) are used to express the same meaning. For instance, words ‘car’, ‘auto’, ‘automobile’, and ‘motorcar’ communicate the same meaning “a motor vehicle with four wheels; usually propelled by an internal combustion engine”.
• Linguistic ambiguity – the same word allows more than one meaning, or allows more than one interpretation. For example, ‘crane’, can mean ‘a lifting device’ or ‘a large long-necked wading bird ‘.
The above characteristics of natural language seriously affect the information retrieval process. For instance, linguistic variation phenomenon can provoke the system to be silent from document retrieval [8]. Because the search term may not match with the term used in the document description, even though the semantically equivalent of the search term is available in the document. On the other hand, linguistic ambiguity adds noise to the retrieved result. Because the retrieved documents description might have the same terms as in the search query, but is used with the different connotation.
The effects of these phenomena in information retrieval are further illustrated below. The repercussions can be observed mainly at three different levels: syntactic level; semantic level and pragmatic level [8].
• At the syntactic level: the focus is to study the relationships between words forming a larger linguistic unit, phrases, and sentences. An ambiguity arises because of the possibility of associating a sentence with more than one syntactic structure. For instance, John read the pamphlet in the train. The example could mean two things: John read the pamphlet that was on the train, or John read the pamphlet when he was traveling by train.
• At the semantic level: the focus is to study the meaning of a word and sentence by studying the meaning of each word in it. An ambiguity arises as a word can have multiple meanings. For instance, John was reading a book in the bank. Here, the word bank may have, at least, two different meanings: a financial institution and a sloping land (especially the slope beside a body of water).
• At the pragmatic level: the focus is to study the language’s relationship to its context. However, we often cannot use a literal and automated interpretation of the terms used. The idea is, “in specific circumstances, the sense of the words in the sentence must be interpreted at a level that includes the context in which the sentence is found [8]”. For instance, John enjoyed the book. This can be interpreted differently: John enjoyed reading the book, or John enjoyed writing the book.
4. Natural Language Processing Techniques
There are two fundamental NLP techniques that are generally practiced in IR. They are:
i. Statistical approach; and
ii. Linguistic approach.
i. Statistical Approach
A statistical approach to natural language processing represents the classical model of information retrieval systems. The statistical approach is relatively simple. The key focus of this approach is in the ‘bag of words’ [8]. In this approach, all words in a document are treated as its index terms. Each term is assigned a weight in function of its importance. Usually, this is determined by the terms appearance frequency within the document. Nevertheless, the “bag of words” model is not ideal for processing natural language documents. Because this model fails to consider the other aspects of a natural language, especially, the ordering of words, structure, and meaning.
ii. Linguistic Approach
The linguistic approach is based on a set of techniques and rules that explicitly encode linguistic knowledge [11]. In this approach, the documents are analyzed at different levels, namely, syntactic, semantic and pragmatic level.
5. Natural Language Processing Tasks
In the following, we discuss some of the widely used linguistic processing techniques at the syntactic and semantic levels. Note that, today’s most of the NLP systems follow a mixed approach, i.e., a combination of techniques from both the statistical and linguistic approaches [8].
5.1 Syntactic analysis
Generally speaking, syntax deals with the structural properties of the texts. It is the grammatical arrangement of words in sentences. In the syntactic analysis, valid sentences are recognized and their underline structures are determined [6]. The syntactic analysis process involves in analyzing and decomposing the sentences into parts of speech with an explanation of the form, function, and syntactical relationship of each part.
The syntactic structure of a sentence is governed by the syntactical rules (aka grammar). Generally speaking, a grammar (formal grammar) is a set of rules for rewriting strings, along with a start symbol from which rewriting starts [32]. The grammar is the means of formalizing our knowledge, and hence, it generates legal sentences of the language.
5.1.1 Context-free Grammar
Context-free grammar was developed by Noam Chomsky in the mid of 1950’s [12]. A grammar is called context-free when its production rules can be applied regardless of the context of a nonterminal. According to this grammar in each production, there must have only a single nonterminal symbol on its left-hand side. A context-free grammar with an example is shown in Table 1. Figure 1 shows a context-free derivation tree, for example, John liked the book. The nonterminal node (a node that appears only in the interior of the tree structure for the given sentence [6]) is the starting point and is expressed by root. As we can see from Figure 1, the single nonterminal node, on the left side, can always be replaced by the right-hand side [13] and this process continues until we have only the terminal nodes ‘.’.
Table 1: Context-free grammarIn the above: np = noun phrase; vp = verb phrase; v = verb; det = determiner; final-punc = final punctuation.
Fig.1: Context-free derivation tree
5.1.2 Transformational Grammar
Transformational grammar is a generative grammar. It was first introduced by Noam Chomsky. In [14], Chomsky developed the idea that each sentence in a language has two levels of representation: a deep structure and a surface structure. The deep structure represents the core semantic relations of a sentence and is mapped on to the surface structure via transformations. It is to be noted that the context-free grammars (discussed in the previous section) fail to represent subject-verb agreement in all cases [6].
The transformational grammar starts out with context-free rules to build up the basics of the sentence but then modifies the basic sentences with the transformational rules [15]. Here, the tree structure produced by context-free rules from the basic structure is called deep structure. The tree structure produced after applying the transformational rules is called surface structure. Transformational grammars specify the legal sentences of a language by giving rules. For instance, in the rule s → np vp, the transformational rule specifies that the aux should be replaced by an aux that has a feature that gives it the same number as the subject of the sentence [6]. Figure 2 presents a transformational grammar, for example, John is sleeping. Here, Figure 2(a) shows the deep structure that is generated using the context-free rules, and Figure 2(b) shows the surface structure that is generated using the transformational rules.
Fig. 2 (a): Context-free rules Fig.2 (b): Transformational rules
5.1.3 Parsing
Parsing is the demineralization of linguistic input. The idea behind it is, use syntax to determine the functions of the words in the input sentences in order to create a data structure that can be used to get at the meaning of the sentence [6]. It is the transformation of linguistic input from some ambiguous phrase to an internal representation. A primary component of parsing is the parser, a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree, abstract syntax tree, or other hierarchical structure – giving a structural representation of the input, checking for correct syntax in the process [33]. The main purpose of parsing is to transform the potentially ambiguous input phrase into an unambiguous form [6].
Parsing can be done in two ways: top-down parsing and bottom-up parsing.
5.1.3.1 Top-down parsing
Top-down parsing is a parsing strategy starts at the root of the parse tree and grows towards leaves of the parse tree by using the rewriting rules of a formal grammar (a set of production rules for strings in a formal language). The strategy is to find leftmost derivations of an input stream by searching for parse trees using a top-down expansion of the given formal grammar rules [7].
Tokens are consumed from left to right. The inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of the grammar rules [33]. Figure 3 shows an example of top-down parsing tree for a given sentence: he drove a car.
Fig. 3: Top-down parsing tree
5.1.3.2 Bottom-up parsing
Bottom-up parsing is a parsing strategy starts at the leaves and grows towards the root. In other words, bottom-up parsing involves in identifying and processing texts at the lowest-level with small details first, before its mid-level structures, and leaving the highest-level overall structure to last. A bottom-up parser discovers and processes that tree starting from the bottom left end, and incrementally works its way upwards and rightwards [16]. For instance, as shown in Figure 4, starting with a sentence he drove a car, we match each part in turn with the right-hand side of some rule [1]. We can match the term he with the right-hand side of the pronoun he. Having done this, we replace the part matched by the left-hand side of the rule; giving rise to the statement pronoun drove a car. By consistently doing this, we generate the complete tree as shown in Figure 4. It can be noted that bottom-up parsers handle a large class of grammars.
Fig. 4: Bottom-up parsing tree
5.1.4 Tokenization
Tokenization (aka word segmentation) is the process of breaking up the text into words, phrases, symbols, or other meaningful elements [34]. Each of these elements is called a token. The list of tokens becomes input for further processing such as in parsing, text mining, and so forth.
Tokenization occurs at the word level and is done by locating word boundaries [4]. Tokenizers often rely on a simple heuristic to locate the word boundaries, i.e., by locating the end point of a word and beginning of the next word. The end point of a word and the beginning of the next word may include whitespace characters, such as space or line break, or punctuation characters. It is to be noted that punctuations and whitespaces may or may not be included in the resulting list of tokens. Tokenization in languages that use inter-word spaces (such as most that use the Latin alphabet) is fairly straightforward. However, besides this simple case, there are many edge cases where tokenization is not that simple, for example [6],
i. New York-based, in this case, a naive tokenizer may break at the space, even though the better break is at the hyphen. So as the case may be for, for instance, Hewlett-Packard, State- of-the-art, co-education, database, San Francisco, August 31, 2013, 800.110.2000,
ii. Ancient Greek, that is often written in scriptio continua, with no spaces between words,
iii. Chinese languages which have no word boundaries.
Two most popular tokenization software are [34]:
i. U-Tokenizer: is an API over HTTP that can cut Chinese and Japanese sentences at a word boundary. It also supports English;
ii. RoboVerdict: is the implementation of an algorithm that automatically rates products by tokenizing texts of various reviews and finding similarities between them. It supports only English.
5.1.5 Stemming
Stemming is the process that chops off the end of words to their stem (base) or root form. The goal is to reduce inflectional forms (also sometimes derivationally related forms) of a word to a common base form. For example, words automate(s), automatic, automation, all reduce to stem automat. Similarly, words fishing, fished, fish, fisher, all reduce to stem fish. The stem needs not to be identical to the morphological root of the word. It is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. The disadvantage of stemming is it increases recall compromising the precision. For instance, for words operate, operating, operates, operation, operative, operatives, operational, Porter stemmer [3] stems to base oper. Since operate in its various forms is a common verb, we would lose considerable precision on queries with Porter stemming [3]. For instance, for a query, consisting of words, operating and system, a sentence with words operate and system cannot be a good match. Similarly, for a query operational and research, a sentence with words operates and research cannot be a good match.
5.1.5.1 Stemming Algorithm
Stemming programs are commonly called as stemmer. Stemmers use language-specific rules. The most common stemming algorithm for English is Porter’s algorithm [17]. It has repeatedly been shown to be empirically very effective. Some of the other popular stemmers are Lovins stemmer [18], Paice/Husk stemmer [19]. Figure 5 shows an example of how these three stemmers work [3] for a Sample text shown at the beginning of the figure.
Fig.5: Output of three stemmers for a given sample text [3]
5.1.6 Lemmatization
Lemmatization, in linguistics, is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item [35, 36]. In computational linguistics, it is the algorithmic process of determining the lemma for a given word. In many languages, words appear in several inflected forms. For example, in English, the verb to walk may appear as walk, walked, walks, walking. Here, the base form walk is the lemma. The combination of the base form with the part of speech is often called the lexeme of the word [20]. Lemmatisation process involves two tasks: understanding the context and determining the part of speech of a word in a sentence. These two tasks make lemmatisation a complex process and are difficult to implement, especially, a lemmatiser for a new language.
5.1.6.1 Lemmatization vs. Stemming
Lemmatization is closely related to stigmatization. However, these two differ in their flavor as discussed below.
i. The fundamental difference is that a stemmer operates on a single word without the context, and as a result, it cannot distinguish between words, which have different meanings depending on the part of speeches [3]. Unlike stemming, lemmatization can, in principle, select the appropriate lemma depending on the context. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only, and to return the base or dictionary form of a word, i.e., the lemma. Word meeting can be either the base form of a noun or a form of a verb to meet, depending on the context, for example, in our last meeting or We are meeting again tomorrow.
ii. Stemmers are easier to implement and run faster compared to the lemmatizes. The applications, that do not need very high accuracy, stemmer is easy to implement and use. For instance, the word better has well as its lemma. Stemming misses this link, as it requires a dictionary lookup.
iii. Stemmers use language-specific rules and require less knowledge than a lemmatizer. Particular domains may also require special stemming rules [3]. Lemmatizers need a complete vocabulary and morphological analysis to correctly lemmatize words.
5.2 Semantic Analysis
The semantic analysis makes up the most complex phase of language processing [23]. In this phase, the semantic information is added to the derivational trees generated in the syntactic analysis phase. Specifically, as it is discussed in the above that, all syntactic analysis determines whether the text string on input is a sentence in the given natural language, and if this is the case, then the result of the analysis contains a description of the syntactic structure of the sentence, for example, in the form of a derivation tree. In the semantic analysis, based on the knowledge about the structure of words and sentences, the meaning of words, phrases, sentences and texts is stipulated, and subsequently also their purpose and consequences. Here, the idea is all syntactic analysis must use semantic knowledge to eliminate ambiguities that cannot be resolved by only structural consideration [6]. In the following, we discuss the two basic concepts, knowledge base and knowledge representation, that are commonly used in the semantic analysis phase.
5.2.1 Knowledge Base
A knowledge base consists of various kinds of knowledge, for instance, it can be a general knowledge about the world, or knowledge about a specific domain. More than decades now, many researchers are working on building such knowledge bases. The knowledge bases encompass a huge amount of data. Knowledge bases are useful to the NLP systems to process and interpret the meanings of the natural language statements, or in other words, are useful to eliminate the semantic ambiguities. Nevertheless, today, most knowledge bases, cover only specific domains, are created by relatively small groups of knowledge engineers [37]. The creation of knowledge bases is very cost intensive to keep up-to-date as the domains change. Some of the popular knowledge bases, such as, OpenCyc, WordNet, Freebase, and DBPedia are discussed below.
OpenCyc (http://www.cyc.com/platform/opencyc): OpenCyc is the world’s largest and most complete general knowledge base and commonsense reasoning engine. It includes hundreds of thousands of terms, along with millions of assertions relating the terms to each other. The first version of OpenCyc was released in spring 2002 and contained only 6,000 concepts and 60,000 facts. The latest version of OpenCyc 4.0, released in June 2012, contains 239,000 concepts and 2,093,000 facts. The knowledge base is released under the Apache License.
WordNet (http://wordnet.princeton.edu/): WordNet is a large lexical database of English. Part of speeches, such as nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual- semantic and lexical relations. WordNet is freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing. It superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important differences between WordNet and thesaurus. First of all, WordNet interlinks not just word forms – strings of letters – but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. And secondly, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.
Freebase (http://www.freebase.com/): Freebase is a large collaborative knowledge base consisting of metadata. The knowledge base is composed mainly by its community members. It is an online collection of structured data harvested from many sources, including individual wiki contributions [24]. Freebase aims to create a global resource which allows people (and machines) to access common information more effectively. It was developed by the American software company Metaweb. Freebase has been running publicly since March 2007. Currently, Freebase consists of 43,885,359 topics, and 2,430,769,963 facts. Its data is available for non-commercial use under a Creative Commons Attribution License.
DBPedia (http://wiki.dbpedia.org/): DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. The English version of the Bpedia knowledge base currently describes 4.0 million things, out of which 3.22 million are classified in a consistent ontology, including 832,000 persons, 639,000 places (including 427,000 populated places), 372,000 creative works (including 116,000 music albums, 78,000 films, and 18,500 video games), 209,000 organizations (including 49,000 companies and 45,000 educational institutions), 226,000 species and 5,600 diseases. The DBPedia knowledge base is accessible on the Web under the terms of the Creative Commons Attribution-ShareAlike 3.0 License and the GNU Free Documentation License.
5.2.2 Knowledge Representation
Generally speaking, knowledge representation (aka semantic representation) is an area that deals with representing knowledge about the world in general (e.g., that birds can fly, that a key is required to open the locked door), or domain specific knowledge (e.g., an even number higher than 2 cannot be a prime number) in a form that a computer system can utilize to solve complex tasks, such as having a dialog in a natural language. In other words, knowledge representation is an internal representation that is created from natural language statements. The internal representation is not limited to the language of the input text, and can be used for further processing, for instance, in matching users’ queries in information retrieval, in the creation of a database in one or more languages, in any sort of text processing work, i.e., representing the text in a specific format, etc. [6].
Knowledge representation incorporates findings from psychology about how humans solve problems and represent knowledge in order to design formalisms that will make complex systems easier to design and build. The two major purposes of knowledge representation as Kemp [22] stated are:
i. to help people understand the system they are working with, and
ii. to enable the system to process the representation.
The greatest challenge in knowledge representation is, as stated in [23], “not to gather the knowledge, but to represent and structure it in a suitable way, to search in it efficiently, and to use it to infer further knowledge. These goals in their essence correspond to the task of constructing artificial intelligence, which is without any doubt one of the biggest and most interesting topics of modern science”. Some of the examples of knowledge representation formalisms include semantic networks, frames, conceptual dependency, case grammar, ontologies. In the following, we discuss semantic networks and frames.
5.2.2.1 Semantic Networks
A semantic network is a network which represents semantic relations between language units. It is a directed or undirected graph consisting of vertices (nodes) and edges (arcs) [25]. Both nodes and arcs can have labels. Usually, nodes represent language units, such as words, phrases, word senses, sentences, documents, and edges represent relationships between them, in terms of, co-occurrence, collection, syntactic dependency, lexical similarity. The networks are called semantic because they express the meaning of the text that is achieved through the encoding of explicit relationships between the languages units.
Semantic networks have been proposed as representations because they enable the storage of language units and the relations, which allow for a variety of inference and reasoning processes. Semantic networks are primarily useful tools for those who need to form a conceptual schema of the domain.
There are two fundamental properties that we see in the semantic networks. They are inheritance hierarchy and intersecting search. Inheritance hierarchy refers to the taxonomy of concepts denoted by common nouns. The hierarchies are called inheritance because nodes inherit properties from those above them in the hierarchy. Intersecting search refers to a method of finding a connection between two concepts [26].
Figure 5 presents a simple semantic network consisting of nodes interconnected by semantic relations. It presents a series of facts embodied within the network. For instance, Penguin is a bird, Bird is a vertebrate, Bird has part wings. As stated above, these sets of facts, embedded in the network, allow us to infer new facts. For instance, from the first two facts, i.e., Penguin is a bird, and Bird is a vertebrate, it can be inferred that Penguin is a vertebrate.
Fig. 5: A simple semantic network
5.2.2.2 Frames
The frame is one of the widely used schemes used for knowledge representation. It was first proposed by Marvin Minsky in his article ‘A framework for representing knowledge’ in 1974, as a basis for understanding visual perception, natural language communication, and other complex behaviors [6]. The frame is an artificial intelligence data structure is used in dividing knowledge into substructures by representing stereotyped situations. Frames provide a structure, within which new data are interpreted in terms of concepts acquired through previous experiences [27]. The organization of knowledge facilitates expectation-driven reasoning by looking for things that are expected according to the context [6]. The representational mechanism that makes possible that kind of reasoning is the slot. In other words, each piece of information about a particular frame is held in a slot.
A frame consists of information, such as, how to use the frame, what to expect next, what frames are likely to be used in particular circumstances, and what to do when these expectations are not met. For example, in a student frame, the different slots are, may be, name, sex, age, date_of_birth, home, course_name, skill, and so on. Each of these slots can have a specific value. Table 2 presents the value of different slots for a particular student, John.
Table 2: Student frame
Name of slots | Values of slots |
Name | John |
sex | Male |
age | IF-NEEDED: Subtract(current,BIRTHDATE); |
date_of_birth | 05.12.1993 |
home | Janpath Nagar, New Delhi |
course_name | MSLIS |
skill | Computer |
Sometimes frames are linked in such a way that the value of one slot points towards the new frame to be considered. For example, the value MSLIS in the slot course_name may point towards another frame, like Courses. Frames are easy to implement. Note that because the frames are structurally based, it is possible to generate a semantic network given a set of frames, even though it lacks explicit arcs [38].
6. Summary
We know that natural language is the most practical means of users to interact with the information retrieval system. Users feel comfortable in constructing queries in natural language. However, it is often the case that the system fails to meet the user’s information need. We often come across with the fact that most of the retrieved documents are irrelevant to the users’ requirement. The user has to spend lots of time to filter out the relevant documents before actually using them. It is because the system retrieves lots of irrelevant documents and very few documents that actually meet the user’s information need. This happens mainly due to ambiguous nature of natural language. The common phenomena of natural language are like, homography, complementary polysemy, metonymy, metaphor, etc. [28, 30, 31]. In this module, we discussed the use of natural language processing techniques in information retrieval. We also discussed some of the important natural language processing tasks, mainly, the tasks carried out at the syntactic and semantic levels.
7. References
- Information retrieval: searching in the 21st century. Ayse Goker and John Davies (ed.). UK: Wiley, 2009.
- Robin (2010). Natural language understanding. http://language.worldofcomputing.net/understanding/natural-language-understanding.html
- Manning, Christopher D., Raghavan, Prabhakar and Schütze, Hinrich. Introduction to Information Retrieval. Cambridge University Press, 2008. http://nlp.stanford.edu/IR- book/html/htmledition/stemming-and-lemmatization-1.html
- What is Tokenization? http://language.worldofcomputing.net/category/tokenization
- Huang, C., Simon, P., Hsieh, S., & Prevot, L. (2007). Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Word break Identification. In the Proceeding of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 69-72.
- Chowdhury, G. G. Introduction to modern information retrieval. London: Library Association Publshing, 1999.
- Aho, A. V., Sethi, R. and Ullman , J. Compilers: principles, techniques, and tools. Boston, USA: Addison-Wesley Longman Pub. Co., 1986.
- Vallez, Mari, and Pedraza-Jimenez, Rafael (2007). Natural language processing in textual information retrieval and related topics. Hypertext.net, n. 5.
- Liddy, Elizabeth D. Natural language processing for information retrieval and knowledge discovery. https://www.ideals.illinois.edu/bitstream/handle/2142/26000/Liddy_Natural.pdf?sequence=2
- Allan, J. (2000). NLP for IR – Natural Language Processing for Information Retrieval. NAACL/ANLP language technology joint conference, Washington, USA. http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.8740
- Sanderson, M. (2000). Retrieving with good sense. Information Retrieval, vol. 2, pp. 49-69.
- Hopcroft, John E. and Ullman, Jeffrey D. Introduction to Automata Theory, Languages, and Computation, Addison-Wesley, 1979.
- Context-free grammar. http://en.wikipedia.org/wiki/Context-free_grammar
- Chomsky, Noam. Syntactic structure. Mouton & Co., 1957.
- Charniak, E. and McDermott, D. Introduction to artificial intelligence. Vol. 1. London: Pitman, 1981.
- Aho, Alfred, Lam, Monica, Sethi, Ravi and Ullman, Jeffrey. Compilers: Principles, Techniques, and Tools, ed. 2, Prentice Hall, 2006.
- Porter, Martin F. (1980). An algorithm for suffix stripping. Program, 14 (3), pp. 130-137.
- Lovins, Julie Beth (1968). Development of a stemming algorithm. Translation and Computational Linguistics 11 (1), pp. 22-31.
- Paice, Chris D. (1990). Another stemmer. SIGIR Forum 24 (3), pp. 56-61.
- Lemmatization. http://en.wikipedia.org/wiki/Lemmatisation
- Syntactic and Semantic Analysis and Knowledge Representation. http://www.fi.muni.cz/research/nlp/analysis.xhtml.en
- Kemp, D. Computer-based knowledge retrieval. London: Aslib, 1988.
- Syntactic and Semantic Analysis and Knowledge Representation. http://www.fi.muni.cz/research/nlp/analysis.xhtml.en
- Markoff, John (2007). Start-Up Aims for Database to Automate Web Searching. The New York Times.
- Sowa, J. F. (1987). Semantic Networks. http://www.jfsowa.com/pubs/semnet.htm
- Garbham, A. Artificial intelligence: an introduction. London: Routledge & Kegan Paul, 1988.
- Salton, G. Automatic text processing: the transformation, analysis and retrieval of information by computer. MA: Addison-Wesley, 1989.
- Freihat, A. A., Giunchiglia, F. and Dutta, B. (2013). Regular Polysemy in WordNet and Pattern based Approach. International Journal on Advances in Intelligent Systems. Vol. 6, no. 3 & 4, pp. 199 – 212.
- Freihat, A. A., Giunchiglia, F. and Dutta, B. (2013). Solving Specialization Polysemy in WordNet. International Journal of Computational Linguistics and Applications (IJCLA). Vol. 4, no. 1, pp. 29 – 52.
- Freihat, A. A., Giunchiglia, F. and Dutta, B. (2013). Solving Specialization Polysemy in WordNet. In 14th International Conference on Intelligent Text Processing and Computational Linguistics (CiCling), March 24–30, 2013, Samos, Greece.
- Freihat, A. A., Giunchiglia, F. and Dutta, B. (2013). Approaching Regular Polysemy in WordNet. In Proceedings of the 5th International Conference on Information, Process, and Knowledge Management (eKNOW), 2013, February 24 – March 1, 2013, Nice, France, pp. 63-69. Available in ThinkMind. ISBN: 9781612082547. Available here: http://www.thinkmind.org/index.php?view=article&articleid=eknow_2013_4_30_60145
- Formal Grammar. https://en.wikipedia.org/wiki/Formal_grammar
- Parsing. https://en.wikipedia.org/wiki/Parsing
- Tokenization. https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)
- Lematize. http://www.collinsdictionary.com/dictionary/english/lemmatize
- Lemmatization. https://en.wikipedia.org/wiki/Lemmatisation#cite_note-1
- DBPedia. http://wiki.dbpedia.org/about
- Frame. https://en.wikipedia.org/wiki/Frame_(artificial_intelligence)