27 The Hypertext Transfer Protocol (HTTP)

Prof. Bhushan Trivedi

Introduction

We have already seen two application layer protocols, DNS and FTP. In this module, we will see the most used application layer protocol, the HTTP. Our browsers communicate and get web pages from web servers using this protocol. We will see how this protocol work, how client and server communicate with each other and also some of the unique features of HTTP in this module.

The Hypertext Transfer Protocol

The HTTP protocol is one of the most used protocols of the Internet. Unlike FTP that we have seen in the previous module, HTTP does not remember the status of the client and thus is a stateless protocol. The reason for HTTP being stateless is simple, it wants to entertain as many users as it can. If it were stateful like FTP, it has to remember the state of every user connected to it and that would severely hamper its ability to respond to a very large number of users comparatively. The HTTP works like FTP in one respect, the client and server communicate using request and response command like FTP control connection. Unlike FTP, there is only one TCP connection in HTTP carrying both data as well as the control traffic. The HTTP client asks for web pages the HTTP server contains and HTTP server responds back with the page asked for. There are two types of pages, one which is stored in the database at the web server, and another, which are constructed on the fly when the request arrives, called static and dynamic web pages respectively.

Uniform Resource Locator (URL)

The web page and a few objects as a part of the web page are identified by a typical type of text-based identifier known as a URL. A URL is two things combined into one, a web server name and a virtual directory within. Let us try to understand the same with an example.

URL: www.oup.com/default.html

In above URL, the www.oup.com is a name of the web server. The default.html is a name of a file which lies in a default directory on which the web server is installed, known as the virtual directory. The virtual directory is mapped to some physical directory so this file is searched for the real directory after getting that mapping and that file is returned.

The HTTP client and server communicates using quite simple, text-based commands with some typical parameters. Such a communication is open to being read by anybody and understand what is going on. Most of the internet protocols are designed in similar fashion.

Some authors call such protocols request-response protocols. To provide more information, the HTTP uses typical headers and with specific values. Client query and server response follow a typical structure which is the theme of the next section.

HTTP Request and Responses

Let us learn the structure of HTTP request and response. Figure 29.1 describes the normal process. Every URL a browser request, it asks the web server to collect the objects belong to that web page, one by one and deliver it to the client.

The process is carried out by the browser sending the HTTP request and the web server providing the suitable response accordingly. Figure 29.2 describes this process.

Response in HTTP

An HTTP request and response has a typical structure. Figure 29.3 depicts them. The request begins with a request line and followed by few headers indicating the nature of the request. This two; request line and header-value pairs constitute header. Once the header part is over, a blank line separates the header from the body part. The body may or may not be present, depending on the type of command issued by the request line. The response begins with the status line, header value pairs to indicate the nature of the response and a blank line separating optional body with the header part. The request and response lines also follow a typical structure. Let us elaborate that part.

Request and Status lines

A request line indicates the request from a browser to the server. The status line describes the response from the browser. The format of the request and status lines are depicted in figure 29.4. The request line begins with a method name. The method name is followed by URL, which consists two things, the web server name and a virtual directory name and a filename. The final part is the HTTP version. The status line begins with HTTP version number, next is status code which indicates whether the request is positively responded or not. The final part is the string representation of the status code.

Examples of request and response

Figure 29.5 showcases the typical request. The request line has method GET, URL is http://www.oup.co.in/category.php?cat_id=43 while the HTTP version is HTTP/1.1. The first part of URL, http://www.oup.co.in indicate the web server. The second part category.php ?cat_id=43 indicate the PHP file with an option called query string. The final part indicates the HTTP version 1.1.

The next line starts another part, containing a few headers and their values. For example, a host is a header with value www.oup.co.in. The list of headers is followed by a blank line and an empty body. It is because the GET method asks for a web page to be downloaded. Such a request does not need to have a body.

Closely observe figure 29.6. It describes the response from the server. The status line begins with HTTP version of the server which is 1.0. The client runs a later version but this makes the client fall back to older version. 250 is a status code which indicates what the string coming at the end indicates, OK. The header values indicate additional information coming from the server. For example, it introduces itself running Apache with a typical version and the information that is being delivered will become obsolete by 19th Nov 1981.

Headers and Values

We have skipped discussion of headers in the previous section. Let us discuss each one used in above example. The host is the header for indicating who the host is. User Agent is the name of the browser who is sending the request. Accept header indicates preferences. When it is coming from a browser, it is a browser’s preference and when it is coming from a server, it is server’s preference. For example, look at these three lines from the browser.

Accept: text/html,application/xhtml+xml, application/xml;q=0.9,*/*;q=0.8

Accept-Language: en-us,en;q=0.5

Accept-Encoding: gzip,deflate

It tells it can accept text in form of html, xhtml+xml. Otherwise, plain xml will do with little less priority (higher value of q indicates better priority when no q value is specified, it is considered 1 so highest priority). The xml has q value as 0.9 as compared to earlier two options with the default q value 1. The statement also indicates that anything else (indicated by */*) is acceptable with q value 0.8. The next Accept header indicates that the language en-us (US English) is preferred but other English will do with priority 0.5. So if the web page is available in multiple languages, the browser says it prefer the US English version or another other English otherwise but no other language page is accepted. The final Accept indicates the encoding type allowed. The encoding type indicates the format of the content of the page. Thus this statement says that I am able to accept the page if the content is in gzip or deflate format and not otherwise.

Date header is used to indicate the date, server for indicating the server name. Any header begins with X- are defined for user-defined headers. Expires indicate the validity of the page till the date mentioned. The cache-control indicate whether the page should be cached. As this is a page generated on the fly by a PHP file, which reads from the database and populates the names of the books, the output should not be cached. The cache control header indicates so. The content-type header indicates the content is in which format. The objects like audio files, images, and video files etc. can be available in hundreds of formats available across operating systems and machines. When the browser is ready to accept any one of the formats, the server must indicate which format it is so the browser can open an appropriate application for opening that content. When the PHP file is used to generate the data on the fly and server sending it to the client, the server has no idea when the output will get over, so cannot supply the length of the output well in advance. The server, however, can indicate the end of the content by closing the connection. So whenever the client receives the connection close request, it understands that the page content is over.

That indicator is provided by the header Connection: close. For all types of dynamic pages, this header is a must.

There is one more header shown in the client, indicating cookie. We will study about cookies later, so we will hold off discussion on this topic till we introduce cookies.

Cookie: PHPSESSID=c3a1082780a2c6621126fa3bd8ad11f1

Other methods

GET is not the only method. There are a few other methods used by HTTP. One more method is called HEAD. The difference between the GET and the HEAD is that the HEAD returns only headers but the GET returns both the headers as well as the body. Many times we need to fill forms available online. Once we fill the form and press the submit button, the HTTP client will send a POST command. This command always has a body. PUT is another command used to upload an object to a web page. Similarly, DELETE is used to remove an object from a specified location from a website. One more method, known as CONNECT can be used to convert an HTTP request into an HTTPS request.

The persistent connection

When the HTTP was introduced, numbered as HTTP 1.0, it was designed in a way that for fetching every object of the page, a new TCP connection was established and torn off once the object is transferred. The idea was not to worry about the object length. The sender does not need to mention the length. The receiver continues to receive the connection is over. This process was especially useful for dynamic pages which are generated on the fly and the sender has little idea about their length. However, this simple design introduces a big overhead of establishing multiple TCP connections and closing. In the second version, the HTTP 1.1, this concept is changed. A single TCP connection is allowed to fetch multiple objects one by one. If ever a page length is unknown to the server, it sends back a header with connection:close indicator which indicates that the server is unaware of the length of the (dynamically constructed) page and thus closes down the connection as soon as the end arrives. The client, upon receiving this header value, accepts data till the TCP connection is closed from the other end. On all other occasions, the same TCP connection continues to fetch objects one after another and the client is not involved in the wasteful operation of closing and establishing connections afresh every time. Such a connection is known as the persistent connection. When the TCP is saved from establishing multiple connections, a lot of time and network bandwidth is saved. The difference between persistent and non-persistent connection is explained by two figures, 29.7 and 29.8. Both figures describe the HTTP client fetching multiple objects from the server. In 29.7, a fresh connection is established before each fetch and terminated thereafter. In 29.8, the same connection is used for fetching all objects.

Computer Networks and ITCP/IP Protocols 6

Multiple TCP connections between the same pair of HTTP client and server are also possible. This is done when the client opens multiple TCP connections to fetch multiple objects using parallel connections. Whenever the CPUs have multi-core architecture or there are multiple processors involved, such a design saves a lot of time. This process is known as pipelining and quite popular as most vendors provide their versions with the support for pipelining.

HTTP, as we mentioned before is designed as stateless, that means, the server does not remember the status of the client. The advantage of it being stateless is apparent. Web servers can manage many more clients as compared to FTP.

Proxies as intermediaries

We have chosen each application layer protocol for its unique set of features. One of the features of HTTP is an exclusive provision for a proxy or intermediary. HTTP protocol allows an intermediary between a client and a server, known as a proxy. A proxy relays HTTP commands from the client to the server. In this case, there are two TCP connections running, from client to proxy and from the proxy to the server. An administrator can establish a proxy server in the network in such a way that all HTTP clients are configured to connect to the HTTP server via this proxy server. Explicit configuration information like IP address of the proxy server and the port number is provided at the client machine. The client now communicates with the proxy for any web page request and the proxy accesses typical web server to access that page on the client’s behalf.

Such a design is needed for two purposes primarily. First is better administrative control. If there are any web access policies of the organization, for example, no mail communication during office hours is allowed, etc. can be implemented at a single place, the proxy server. Individual machines are not required to be configured. The second reason is caching is possible to reduce network load. For example, if A and B both access website X, the proxy stores the web page when A accesses the page. When B does so after a while, it is readily available from the proxy’s cache and does not need to fetch altogether from the server yet again. In fact, proxies also are used to enable web access from multiple clients using single or handful of valid IP addresses generally. So if we have 200 users and have only five valid IP address, proxies combined with NAT can extend the web access to all users. Another advantage of such proxies is that they reduce a load on web servers as many queries are resolved at their level. Most ISPs use proxy servers for this purpose (that reduce their web access bill as well if other clients need a page one client asked for downloading earlier. It will count on user’s volume of data, so ISP can charge for it, but it does not need to download itself so they do not need to pay to their higher level ISP!). Not only proxy reduces the web server load, they improve the response time of users as the queries being answered locally is always faster than from the authoritative webserver.

One may also be curious about similar caching from browsers. All browsers also remember web pages that the users have downloaded in past and a repeat request will fetch the page from the local cache directory for a typical browser. This caching is used for the same reason as we discussed above.

The question that may arise is when all browsers provide the caching, why proxy is doing something similar? The short answer is, a browser’s cache is limited to users of that browser on that machine only, while the proxy’s cache is extended to all users of the network. It saves network-wide bandwidth. Then, why browsers cache? The reason is, there is no guarantee that there is always going to be a proxy, to improve the later access to the same data, a cache is a great solution. A browser doing it surely improve the user’s experience with repeated access to the same web page.

HTTP also provides control over a number of intermediaries that one can have. There is a header known as Max-Forwards: <value> when the value indicates a number of proxies allowed in the path. A browser can specify the number of proxies that are allowed using this header. A typical case where a user provides Max-Forwards:0, the connection is only established between an HTTP client and server communicates directly and there is no proxy along the path. Our discussion is summarized in the figure 29.9. It shows how when the web page is accessed for the first time, the proxy delivers it to the browser but also caches it. Next time, when the same or some other browser of the network asks for that page, it delivers it from the cache.

Session variables

HTTP is designed in a stateless fashion for a valid reason but that does not augur well with some operations that demand knowledge about the client. Consider a case where the page is restricted and only available to an administrator or an authenticated user (who has supplied his credentials successfully). Unfortunately, the stateless protocol cannot remember the status of the user and have no idea if the user has really supplied his username and password successfully or not. You might initiate the website access by a typical login page and only then divert the user to the restricted page but what if the user supplies the URL of that restricted page, bypassing the login page? The solution is to use session variables. These variables are defined and used by client and servers together but stored in the server’s memory during the session and are removed once the session gets over. The username and passwords are stored in form of session variables when the login page is successfully negotiated and contains null values otherwise. The restricted HTTP page can be designed using following logic

if (username == “ ”),

//no username provided

Jump to login page

else if (user name != administrator)

Display “this page is restricted” message

Otherwise display the rest of the content of the page

Thus if the session variable’s value is null, the user has not processed the login procedure, he should be redirected to it. Even if he has passed through it successfully but he is not an administrator, tell him that he can only access the page if he is an administrator. The moral of the story is, the statelessness of a server is a tricky thing and one may need to find a way around problems of this type.

Session variables are handy in many instances like when a user is using a shopping cart and providing billing related information only to last for the session. When the user moves from one page to another, such information is available through the session variables. Our discussion above should also ignite one more query in your mind. What if one needs to retain information across sessions? That means between multiple HTTP sessions? For example, many websites provide automatic login, for example, websites like Amazon and Flipkart which ‘remembers’ the users. Another solution called cookies are used for this purpose. We will throw some light about cookies in the next section.

Cookies

Let us take an example to understand the use of cookies. Suppose admin logs into a website from his own machine and expects the machine to remember his credentials for that website. When the presses ‘remember me on this website’ kind of a button, the server responds back by storing all his credentials in a database and generate a unique key for that data, popularly known as a cookie. The information is passed back to the browser using a typical header Set-Cookie with a value indicating that index, the cookie.

Set-Cookie: Admin = ABCIDMCIEOMLD1839475jdko

The browser stores that information into its own cookies directory. Now when the user wants to access the same page again, it sets a HEADER value Cookie as follows

Cookie Admin = ABCIDMCIEOMLD1839475jdko

The server looks in the database and gets the context information, fill in the web page with respective values before delivering to the client. So when the client displays that information to the user, the user is happy to see all the information supplied earlier is available yet again, saving him from entering those things repeatedly.

Conditional Download

One good thing about HTTP is, it allows downloading a page only if it is updated after the previous download. The page contains the header indicating the date on which it is modified. The client, upon request, can indicate that the page to be downloaded only if updated after a specified date when the current version of the page available at the client is downloaded. This process is known as the conditional download. Closely look at an HTTP request and two possible responses in figure 29.11. The if-modified-since header informs the server the date of the previous download of the page mcaragular.htm at the server www.glsict.org. If the page is not modified, the server generates the first response with response code 304. Otherwise, second response with a specified page is provided.

GET

http://www.glsict.org/mcaregular.htm

HTTP/1.1

Host: www.glsict.org

If-modified-since: Thu, 29 Mar 2010

4:35:10 GMT

Conditional download: Two responses

Response-1

HTTP/1.1 304 Not Modified

Response-2

HTTP/1.1 200 OK

Date: Mon, 19 Apr 2010 10:50:00 GMT

Server: Apache

Last-Modified: 15 Apr 2010 3:34:05GMT

The HTTP can also provide a dynamic page which is constructed on the fly. Figure 29.12 showcases how a dynamic page is constructed on the fly to provide the user a dynamic page.

Summary

In this module, we looked at HTTP as one of the most used stateless protocol of the Internet. HTTP is unique in many regards, it uses request response in text like form for its operation, it uses a typical structure for the request from the browser and response from the web server. It contains the structure which uses headers and values for indicating various options or signals for communication between the client and the server. HTTP uses proxies as intermediaries, session variables, and cookies for mitigating the aftermath of stateless design, dynamic page download, and conditional download facilities.

References

1. Computer Networks by Bhushan Trivedi, Oxford University Press

2. Data Communication and Networking, Bhushan Trivedi, Oxford University Press