14 Web Based Information Retrieval

Devika P Madalli

I. Objectives:

To study the use of Information Retrieval systems in Web based environment.

II. Learning Outcome

After going through this module it is expected that:

a) The reader will be able to understand the features of web browsers.b) The reader will gain knowledge of Hyper Text Markup Language (HTML) and various tags.

c) The reader will be enriched with the knowledge of the different web resources of information like Primary, Secondary, and Tertiary, etc.

d) The reader will able to distinguish between classical IR and Web Based IR.

e) The reader will gain the knowledge of advanced search engines and applications.

f) The reader will understand the concept of semantic search in semantic web and search engine technology.

III. Structure

1. World Wide Web

1.1. Document Model

1.2. Naming

2. Types of Information and resources

2.1 Primary Sources

2.2 Secondary Sources

2.3 Tertiary Sources

3. Users Interaction and search

4. Difference between classic IR and WBIR

5. Search engines

5.1 Types of Search Engines

5.2 Features of Search Engine

5.3 Advanced Search Engine and Applications

6. Web directories

7. Ontology

7.1 Semantic Search

8. Summary

9. References

1. World Wide Web

The World Wide Web (WWW) is a huge system based on client-server architecture with millions of distributed servers worldwide. Here, every server maintains a collection of documents and each document is stored as a file (although documents can also be generated on request). A server accepts requests for providing a document and then sends it to the client. Requests for storing new documents can also be made to servers.

The WWW started as a project at CERN, the European Particle Physics Laboratory in Geneva, to provide access to shared documents using a simple hypertext system for its large and geographically dispersed group of researchers. A document in this context can be anything that can be displayed on a user’s computer terminal, such as personal notes, figures, drawings, reports, blueprints, and so on. By connecting documents to each other, it becomes easy to integrate them from various places into a new document without the need for centralized changes. The only thing required was to establish a document providing connections to other relevant documents.

The easiest technique to refer to a document is by a reference known as Uniform Resource Locator (URL). This specifies the location of a document.

Here a client or a user interacts with Web servers with the help of a special application called browser. A browser has the responsibility of properly displaying a document also, accepts inputs from a user usually by allowing the user select a reference to another document, that it subsequently fetches and then displays.

1.1 Document Model

In the Web environment all information is represented in the form of documents. There are many ways to express a document in a web environment. Some documents are as simple as an ASCII text file, while others are done by a collection of scripts which gets automatically executed if the document is downloaded into a web browser.

A document can contain references to other documents in the form of hyperlinks. i.e. when a document is opened in a web browser, hyperlinks to other documents are visible to the users. The user can follow a link by clicking on it.

Hyper Text Mark-up Language or simply HTML is the basic building block of the Web documents. This is a mark-up language that provides keywords to provide structure into various sections. For example, each HTML document is divided into a heading section and a main body. HTML also provides tags or keywords to distinguish headers, tables, lists and forms. It also enables to insert images or animations in a document. Besides these structural elements, HTML provides various keywords to provide instructions to the browser how to render the document. Below is a simple HTML code provided:

<HTML><!– Start of HTML document

<BODY><!– Start of the main body

<H1>Hello World</H1><!– Basic text to be displayed

<P><!– Start of new paragraph

<SCRIPT type = “text/javascript”><!– Identify scripting language document.writeln (“<H1>Hello World</H1>”); // Write a line of text

</SCRIPT><!– End of scripting section

</P><!– End of paragraph section

</BODY><!– End of main body

</HTML><!– End of HTML section

1.2 Naming

The WWW uses a naming scheme to identify documents, known as Uniform Resource Identifiers or simply URIs (Berners-Lee et al., 1998). URIs consists of a Uniform Resource Locator (URL) which identifies a document by including information on location of the document and the Uniform Resource Name (URN) which acts as true identifier by providing reference to a document.

2. Types of Information and Resources

An Information Source is a source of Information, i.e. anything that may provide information about something or provide knowledge about it. Different types of problems require different information sources. They may be categorised into Primary Sources, Secondary Sources and Tertiary Sources.

2.1 Primary sources

Primary sources are original information sources. They are from the time period involved and have not been passed through interpretation or evaluation. They are usually the first formal appearance of information in physical, print or electronic format. They present original thinking, share new information or report a discovery.

Examples include:

• Artefacts (e.g. coins, all from the time under study);

• Audio recordings (e.g. radio programs)

• Internet communications on email, listservs;

• Interviews (e.g., oral histories, telephone, e-mail);

• Journal articles published in peer-reviewed publications;

• Newspaper articles written at the time;

• Original documents (i.e. birth certificate, will, marriage license, trial transcript);

• Proceedings of meetings, conferences and symposia;

• Records of organizations, government agencies (e.g., annual report, treaty, constitution, government document);

• Survey research (e.g., market surveys, public opinion polls).

2.2 Secondary sources

Secondary sources of information are those which are either compiled from or refer to primary sources of information. They are interpretations and evaluations of primary sources. Secondary sources are not evidence, but rather commentary on and discussion of evidence.

Examples include:

• Bibliographies;

• Biographical works;

• Dictionaries, Encyclopaedias (also considered tertiary);

• Textbooks

• Monographs

• Web site (also considered primary).

2.3 Tertiary sources

Tertiary sources consist of information which is a distillation and collection of primary and secondary sources which include:

• Bibliographies of Bibliographies;

• Directories;

• Guides to Literature

3. Users Interaction and Search

Searching for the required information on Web can be a frustrating and disappointing experience due the availability of huge amount of information on the Web. These web contents are different from the traditional resources that are available in libraries and in online databases because Web contents are heterogeneous, networked and available in multimedia types. Such as text, hypertext, images, audio, video, etc. Here information creation is dynamic and beyond the physical boundaries. On the Web user’s profile is heterogeneous like the resources, and the major portion of which are novices with wide variety of subject backgrounds and different computer and web literacy. So, the user interaction with the different interface of search engines largely depends on the skills of users to frame the search query in such a manner so as to bridge the gap between the way a programmer has indexed a particular content under what keyword and the actual requirement of the user. In the traditional Library environment a users approaches a reference librarian for the information need and it is the librarian who analyses the user’s query and then provides with the relevant information. So, Librarians act as a bridge between the user and the information. But in Web environment users, themselves have to perform the task of the librarian, search engine will just provide all the documents matching user’s query keywords. Then user has to perform the task of finding the relevant information from the collection.

4. Difference between classic IR and WBIR

Information retrieval is a very important task done by every person to meet their daily life requirements. Their need and purpose may vary depending upon the users requirements. In the present web environment, we are surrounded by various types of digital resources and searching them have become a routine activity. But satisfying the information need of the users on the web is not a simple task. It is a difficult and time consuming task. In this scenario finding a relevant information is like searching a needle from haystack. Present web-based search tools have been inspired by classical information retrieval systems but due to the nature of web environment and the diversity in users’ behaviours brings several challenges in searching and retrieval patterns.

In the classic IR systems both the resources and the users were more or less predictable and homogeneous. The digital contents from online as well as offline databases, Online Public Access Catalogues (OPACs) mainly contain data stored in a structured manner. Due to this nature of stored data, the search and retrieval process was much easier and more predictable. Accordingly, the user group were mainly comprised of people from academics, researchers, subject experts or librarians. They were well aware of the search keywords to use for finding a particular document. But with the emergence of Web there is a flooding of digital information in a very unstructured, uneven and heterogeneous manner so to cope up with this situation Web based IR systems evolved. These systems provide accessibility to web based digital contents. They use programs to maintain and update a list of web documents added to web at a regular interval or according to the need. Then these web IRs mainly look for the searched keywords in the maintained list or the index file to retrieve the original document present on the web. Web IRs user’s interface is much more user friendly than the traditional IR systems by keeping in mind the issue of increase in inexperienced user group of web IR systems. So, now Web IR interface designers have to look more into the information seeking behaviour of this type of users than in the classical IR systems where users were mainly experts of a particular domain under consideration.

5. Search Engines

Search engines are computer programs that search for particular keywords entered by users and returns a list of documents in which they were found, it is especially a service that searches contents on the web. But they not only search for keywords rather some search for other things also and these are not “engines“ in the classical sense like mouse is not a “mouse” in digital world.

5.1 Types of Search Engines

Search engines can be mainly categorised into four types:

• Crawler-based search engines are useful if we have specific search keyword in our mind but if our search topic is a general one then these type of search engines may provide several irrelevant documents to a search request, e.g. AltaVista.

• Human-powered directories are good if our search is a general topic, then this type of search engines powered with human crafted directories will guide us and help to converge our search and fetch refined responses, e.g. DMOZ.

• Hybrid search engines use a combination of both crawler-based results and directory results, e.g. Google.

• Meta-search engines are good for saving time by gathering results from different search engines at a single interface. It is excellent if we wish to know whether something is available about a particular topic or not on the web, e.g. Dogpile.

5.2 Features of Search Engine

The features like basic text search facilities, like Boolean search, proximity search, phrase search, truncation, field-specific search and limiting search are provided by almost all the search engines.

Boolean search uses 3 different techniques such as:

• combination of keywords with AND, OR, NOT

• by prefixing ‘+’ or ‘-’ operators with keywords

• by selecting options like ‘all of the words’ as in Google.

Proximity search which looks for two or more words which occur within a certain number of words from each other, e.g. Google supports ‘AROUND’.

• Field Search using ‘intitle’ or ‘all in title’ before the search terms in Google.

• Phrase search to search for a phrase in a document using double quotes.

• Limiting search for a document by date or file type in Google.

The search interfaces in the modern day search engines enables users to use above features without much effort. Many advanced search interfaces also provides enough help information for users to perform the search on the search interface itself.

5.3 Advanced Search Engines and Applications

Present day search engines are like encyclopaedias operating on the internet, allowing users to search and retrieve relevant digital contents. But from users perspective only requirement is to search for a desired content using appropriate search engine. Because different search engines are meant for different purpose and requires different skill set to use it. Advanced search engines will satisfy the most of the users queries by providing advanced search options, thus efficiently providing solutions to users queries.

Some of advanced search engines with such advanced applications are as follows:

• For General Search: If users requirement is written information, the general search engines like Google is efficient one. Google with its advanced search options enable users to perform more specific search queries.

• Reverse Image Search: If a users requirement to search for images then a advanced search engines like TinEye is a efficient one as this can read the content and thus making it searchable while a general search engines can look for only file names or user defined tags.

• Similar Image Search: The advanced search engines like GazoPa can look for similar features in the image like texture, colour or structures but cannot recognize exact copies of a given content.

• Invisible Search: The CompletePlanet advanced search engines have the application of searching the desired content from the data stored in databases which are almost invisible to the general search engines. Because general search engines mainly index the resources from the websites by following the hyperlinks one after another. This type of hidden web is known as Deep Web.

• Semantic Search: Semantic search is meant for searching terms in a meaningful manner i.e. terms with exact meaning, context and definition. The search engines like Yummly based on such type of semantic search algorithms are efficient in obtaining relevant results.

6. Web Directories

Web directory also known as link directory is a type of directory available on the WWW. Its purpose is to provide links to other web sites and further classify those links. It is not like a search engine and do not display list of websites based on user searched keywords; rather this maintains a list of web sites according to class and subclass. This classification of websites is not based on individual pages but on the complete website. Websites for their inclusion in a web directory in a particular category is either submitted by the site owner followed by review process for the approval or done by the web directory maintainer. Some of the Web directories are as follows:

• Yahoo Directories is the one of the best and oldest directory on the Web

• The Open Directory is a human-edited directory. Also known as DMOZ (Directory Mozilla).

7. Ontology

The concept originated more than two thousand years ago from philosophy and more specifically from Aristotle’s theory of categories. The original purpose was to provide a categorization of all existing things in the world. Ontologies have been lately adopted in several other fields, such as Library and Information Science (LIS), Artificial Intelligence (AI), and more recently in Computer Science (CS), as the main means for describing how classes of objects are correlated, or for categorizing the document resources. Many definitions of ontologies have been provided. According to Gruber, ontology is defined as, “an explicit specification of a conceptualization”. Later on Studer et al extended the definition and defined ontology as “a formal, explicit specification of a shared conceptualisation”. Studer’s definition includes the idea of shared in the notion of conceptualization and formal relations among the concepts. The explicit, formal representation of a shared conceptualization involves a perspective of a specific reality, and is constituted in the conceptual structure of a knowledge base. The ultimate objective of ontology is to share the knowledge it represents. An ontology defines the terms and their formal relations within a given knowledge area. The main features of ontology are:

i. Ontology provide a shared understanding of domains;

ii. Ontology is useful to represent and to facilitate the sharing of domain knowledge between human and automatic agents;

iii. Ontology is useful for the organization and navigation of websites;

iv. Ontology is useful for improving the accuracy of Web searches. Web searches can exploit the generalization and/ or specialization of information.

7.1 Semantic search

Enormous amount of information is produced everyday to cater the needs of human being. There are so many users who have different information needs and most of them are dependent on the web to fulfil their needs. Web is a big semi structured database which provides a vast amount of information. WWW constitutes of near to 12 billion web pages (Gulli, A., and Signorini, A.,2005). Through the rapid growth of web, it has become an easy way to access information. But for the rapid increase of information users are facing new challenges. So locating the precise information also raises a big challenging task. Moreover, most of the search engines primarily give priority to index huge amount of information-contained files or web-pages, so that, once searched it can give maximum number of hits. But, evaluating this enormous links by several parameters is strictly a statistical procedure. Semantic sense and relevance measures on the search space are still grey areas of research and raises many open questions. This issue can be solved using semantic search techniques.

Semantic search is a combined efforts of semantic web and search engine technologies which is designed to solve complex queries, automatic clustering and managing the large number of web documents. From Web point of view, each Semantic Web data is addressed by URLs and retrieved as form of Semantic Web documents. So, the Semantic Web is an extensive collection of static or dynamic semantic web documents (SWD) distributed over the Web-space. Meta- information to each ontology also executes the searching and matching operation very optimally and retrieves ontologies stored in an ontology registry, providing a compact representation for efficient search and reuse of related ontologies. Some of the semantic search engines are such as Swoogle and Watson.

8. Summary

Web based retrieval[WBIR] was introduced making a distinction between classic IR and WBIR. Features of web browsers and search engines were described. Hyper Text Markup Language (HTML) and use of various tags were discussed. Advanced search engines and their applications were discussed. Advanced concepts of semantic search in semantic web and search engine technology were also covered in this module.

9. Reference

Peiling Wang et al. (1998) An exploratory study of user searching of the World Wide Web: a holistic approach. Proceedings of the 61st Annual Meeting of the American Society for Information Science, October 25–29, Pittsburgh, PA, pp. 389–399.
Mansourian, Y. (2004). “Similarities and differences between Web search procedure and searching in the pre-web information retrieval systems”. Webology, 1(1), Article 3. Available at: http://www.webology.org/2004/v1n1/a3.html.
Croft, W. B., Metzler, D., & Strohman, T. (2010). Search engines: Information retrieval in practice (p. 88). Reading: Addison-Wesley.
A Comparison of Search Engines For Finding Resources by By Yuanlei Zhang, April 28, 2004, http://www.yuanlei.com/studies/articles/is567-searchengine/page2.htm
http://www.guidingtech.com/16116/google-search-little-known-around-operator/Accessed on Jun. 20, 2015.
http://www.webology.org/2010/v7n1/a76.html. Accessed on Jun. 20, 2015.
http://bcs.org/upload/pdf/ewic_tl06_s2paper3.pdf. Accessed on Jun. 22, 2015. Aristotle’s Categories, 2007. http://plato.stanford.edu/entries/aristotle-categories/
Dini, Luca (2004). NLP technologies and the semantic web: risks, opportunities and challenges. Intelligenza Artificiale 1(1), pp. 67-71.
Gruber, T. R. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2), pp.199–220].
Studer, R., Benjamins, V. R. and Fensel, D. (1998). Knowledge engineering: principles and methods.
http://www.das.ufsc.br/~gb/pg-ia/KnowledgeEngineering-PrinciplesAndMethods.pdf
Dutta, B., Chatterjee, U. and Madalli, Devika P. (2013). From Application Ontology to Core Ontology. In the Proceedings of International Conference on Knowledge Modelling and Knowledge Management (ICKM 2013), Bangalore, India. ISBN: 978-93-5137-765-8.
Semantic Web Made Easy. http://www.w3.org/RDF/Metalog/docs/sw-easy
Antoniou, Grigoris and Harmelen, Frank van. A semantic web primer. London: MIT Press, 2004.
“Web directory”. Dictionary. Accessed on Jul. 30, 2015.
Wendy Boswell. “What is a Web Directory”. About.com. Accessed on Jul. 30, 2015. http://websearch.about.com/od/enginesanddirectories/a/subdirectory.htm. Accessed on Jul. 30,2015.
Basu, A. and Paul, M. (2015). A Case study on Semantic Web Search Engines. In: Proceedings of Libraries in Next Era (LiNE), India. ISBN- 978-81-930849-0-8.
Gulli, A., & Signorini, A. (2005, May). The indexable web is more than 11.5 billion pages. In Special interest tracks and posters of the 14th international conference on World Wide Web (pp. 902-903). ACM.W. Roush, “Search beyond Google,” Technology Review, 2004.