12 Error correction and other jobs of Data Link Layer

Prof. Bhushan Trivedi

epgp books

 

Introduction

 

CRC, which we discussed in the previous module is an excellent method to detect an error but for error correction, we need some other method. We have already seen that when the sender sends data which has a high probability of corruption every time, CRC like solutions is not recommended. When CRC indicates that there is an error, the frame is to be retransmitted and again we are reliant on the network to deliver us the valid frame. On the contrary, it is always better if we can correct the errors in the frame and avoid retransmission, if not all than in most of the cases. Obviously, any scheme that corrects an error must be able to provide more help to the designer (at least the position where the error has occurred) and thus need a higher number of redundant bits. Simplest of that method correct a single bit error. Richard Hamming of Bell Labs in 1950 devised a method of correcting a single bit error which we explore in the following. The method is popularly known as Hamming code, to honor the inventor.

 

The Hamming distance and error correction

 

The Hamming distance between any two binary number is the number of bits one needs to change to convert from one number to another. This Hamming distance is a critical value as it indicates the size of error which can convert one valid value into another. Let us try to understand. Suppose we have an AC and we want to send commands from the remote to switch on and off the AC. What should be command values if there is likely error of two bits?

 

Suppose we take two values as 110 as switch off and 011 as switch on. Will this work? This won’t as a two-bit error converts 110 as 011 (first and last bit) and vice versa. We need a better code. What about 1100 and 0111? We need to change at least 3 bits for converting a valid code to another valid code and thus this is better. For example, if last two bits are corrupted, 1100 become 1111 and the AC does not understand it so the user presses switch off button again. Hopefully, it will reach intact at the other end and the problem is solved. However, is it the best coding scheme? Let us take another example 111000 and 000111 are two codes that we use for AC again. Even when two bits are changed, for example, last two bits of 111000, and it becomes 111011, we can see that it is very far from 000111 and can only be 111000. That means despite the two-bit error, we can always judge what the actual number sent and convert the incoming number back to a valid value, without any retransmission.

 

What has above discussion to do with Hamming distance? The first code has exact Hamming distance 2, that means changing two bits may convert a valid code into another valid code. In the second case, a two-bit error always results in an invalid code so we can be sure that an error has occurred. But the codes are such that we cannot determine what was the original code as the Hamming distance between valid values was just 3. When we have increased the Hamming distance to 6 (5 would be enough), any two-bit error will convert a valid code to another code which is still near to the original code and we can always convert the erroneous code to original code.

 

Thus, a general rule is, if the Hamming distance between any two code is greater than d, and maximum error is of d bits, we can always detect that error. If the Hamming distance is greater than 2d, we can even correct that. In above case, the d was 2. So when we had codes with Hamming distance 3, we could successfully have caught it. When we have hamming distance > 4 we could also correct it.

 

Number of redundant bits needed for one-bit error correction

 

Above method that we used is simple enough as we have the design of codes in our hand and we can design it with Hamming distance that we want it to have, based on an amount of error expected. In the domain of data communication, we are not that lucky. When we are working at the data link layer and we get a packet of m bits, all m bits are equally likely and we have no way of designing hamming code just from the data that we receive. The idea is to provide some redundant bits so much so that we get a Hamming distance of our liking. It is very similar to our discussion of CRC with 4 reductant bits. The idea is, we must add enough redundant bits to increase the Hamming distance to our need. If we need to catch a one-bit error, we must have each valid code at least 2 Hamming distance from all other valid codes. That means, if any one of the bits is changed, the resultant code must be invalid.

 

The data that the data link layer receives from the network layer, is of length m, what we need is r redundant bits to make it m + r such that when all valid codes of length m are converted to codes of length m + r, they are at least two-bit hamming distance away from each other. Thus any one of m + r bits is inverted, the resultant value is invalid.

 

Let us try to guess what is the minimum value of r, i.e. a minimum number of redundant bits, for a given value of m. To get all values which are one distance away from a number, we can start working on it and invert one bit after another. That means if the value is 10101010, the first value which is a distance of 1 away from it is 00101010 (inverting first bit), 11101010 (inverting second bit) and so on, till 10101011 (inverting last bit). All these values from 00101010 to 10101011 must be invalid. How many such invalid values possible for a single number? 8 in above case and m + r in a general case. In the case of m + r bit value, we have a fixed m bit value and total 2r values for the next r bits, out of which only one value is correct and rest are invalid. That means for one valid value of r, we have 2r-1 invalid values. We need m+  r invalid values and we have 2r-1 invalid values. That means we need m + r <= 2r-1. If we have a byte m is 8 and when we are only assuming an ASCII value, the m value is 7 (all ASCII values are 7 bit for English characters). If we take r as 3, m + r = 7 + 3 = 10 and 23-1 = 8-1 =7 so m + r <= 2r-1 is not satisfied. When we take r as 4, however, the m + 4 comes out to be 11 and 2r-1 becomes 16-1 = 15. Thus r = 4 is the minimum number of redundant bits that one needs for one-bit error correction. Interestingly, even with a byte value (8 bit), the m + r becomes 12 which is still less than 15 and thus for even 8-bit data, 4 redundant bits are enough.

 

Hamming has also provided one typical method to correct a one-bit error. In the next section, we will describe that in detail.

 

Hamming code calculation

 

The actual solution to correct one-bit error was provided by hamming by using exactly four bits for a 7-bit data. As mentioned before, the same method can also be used for 8-bit as well. It works like this. We have 7-bit data and 4-bit redundant code that makes it 11-bit final data which we call hamming code. Following process describes how one generates that 11-bit data from the actual 7-bit data.

 

As per hamming’s solution, 11-bit data has four places which are multiple of 2, position 1 (20), position 2 (21), position 4 (22) and position 8 (23). At all these places, we keep the redundant bits, at all other places, the original data is to be kept. The figure 14.1 indicates the places for redundant and data bits. We take the data bits as 1001000 or ASCII value of H for showing how the Hamming code generation and error correction process takes place.

 

1.  The first thing that we do is to insert the data bits at their respective position. Figure 14.2 indicates that.

 

2. We will represent data bit positions as the summation of multiples of 2. Figure 14.3 illustrates how to do that.

 

3. Finally, we calculate the redundant bits as a parity for all positions which includes their value, for example, R1 is the parity bit for which 1 is found in the summation indicating the data bit position, i.e. 3 (which is 2 + 1), 5 (which is 4 +1) etc. R2 is the parity bit for all places which found 2 in the summation, for example, 3 (2 +1), 6 (4 +2) etc. You can see that it is possible that single bit contributes to the calculation of multiple redundant bits. For example, data bit at position 3 (M1), contributes to R1 and R2 both. Figure 14.4 shows that process

 

4.      When the receiver receives the data, it recalculates redundant bits using the same algorithm, if they are same, the data is accepted as valid.

 

Figure 14.5 Recalculating the redundant bit values when an error has occurred at 7th bit situated at the 11th position.

The receiver’s calculations are shown in figure 14.5. When those values are recalculated, it is found that redundant bits R1, R2, and R4 are not matching. Adding their positions, 1 + 2 + 8 yields 11, which is the place of the error. Now we will invert that bit and the problem is solved! The value 11, which points to error is known as the error syndrome.

 

 

Hamming code for handling burst error

 

We have seen that data traffic is more likely to encounter the burst errors and thus we need a solution to handle burst errors. With a simple trick, the very Hamming code can be used to catch and correct burst errors. Closely observe figure 14.6. Data bytes are converted to hamming code and thus every 7-bit entity is converted to 11 bit and placed one below another. We accumulate 8 such rows and pick up the first column (of length 8 bits) and send it across. We send 11 such bytes across. A burst error might corrupt major portion of one of the bytes. The receiver rearranges the received bytes in columnar fashion and thus generate the same 11-bit entities which the sender has generated. As we have a corrupted column, most of the rows now have a one-bit error. Hamming code corrects that error and the problem are solved! Only if the burst error is longer than one byte or occurs in multiple bytes, the Hamming code fails to correct. Many researchers worked and many additional solutions are proposed. We will look at some real-world solutions in the following.

 

Convolution code

 

Convolution code is used in 802.11 as well as GSM phones and Satellite communication. Speech recognition programs and text to speech converters also use it. It is designed to withstand heavy errors in transmission. Another good property of it is that it can correct despite the occurrence of many errors. The code has few other names. NASA built that code and used in Voyager space mission. Once that mission got over, NASA opened it for others to use freely. That is why this code is also known as a NASA code or a Voyager code.

 

The code is not a block code, that means it does not work on blocks of data, it works continuously on the string of bits. The continuous bit stream is input to this code and a continuous bit stream is an output from this code. In a case of a Hamming code, the data bits are sent as it is and they do not change their values at the time of conversion. The codes like hamming code, that is why are known as systematic codes. Convolution code does not do so and that is why it is not systematic. The data bits also change during the process of conversion.

 

Convolution code generates an internal state based on previous 32-bit values. That internal state continuously changes as the input bits come and enter the internal state. The older bits go out at the same rate to keep the size of internal state same.

 

The convolution code also does something else which is different than earlier codes, known as soft decision encoding. We have earlier seen that when we expect +5V or -5V and we receive +4V in earlier cases, we change it to +5V and proceed further while we discussed digital signaling. The idea is to assume the nearest value to be correct. That means those codes do not differentiate between +4 and +5. That is known as hard decision encoding. The convolution code differentiates between +5V and +4V by attaching some uncertainty with the second observation. Thus the receiver may choose +4V to be -5V in some cases where it thinks better given the context. The context is described by the internal state.

 

Another property of convolution code is that it is possible to use other codes with it to improve the error handling process. The next code that we will study, the Reed-Solomon Code (RS code) is the frequency used in conjunction with the convolution code.

Here is a quick rundown of how convolution code works.

 

1. It takes inputs in terms of bits; one after another.

2. The incoming bits, along with earlier bits generate the internal state with some typical number of bits, for example, 32 bits. Thus, at any given point of time, the internal state is decided by the values of the previous 32 bits. This value is different for a different implementation. The length of the internal state is known as constraint length.

3. When a bit is input, it is added to the state at the end and the first bit is removed, thus acts like a queue.

4.  The output bit is generated based on both values, the input bit as well as the current value of the state.

5.  Every input bit generates one or more output bits. The code used in 802.11 produces two output bits for every one bit of input.

6. The processing of the bits depends on the state and connections between them. The output bits are not necessarily equal to the input bit and thus the code is not systematic.

7. Every bit is input and processed in a linear fashion, that means, it is a linear code.

8. Whenever the receiver receives the value, it processes the data in reverse order and tries to guess the actual output considering uncertainty and also the possibility of specific input codes for the specific output received. Using the aforementioned information makes it a pretty strong error correction mechanism. In fact, convolution code is a very good scheme for finding single-bit errors.

 

To understand the processing of convolution code, let us take the example of NASA code itself. Figure 14.7 depicts a typical case where bit with value 1 has just joined the internal state. NASA code is of constraint length 7 which indicates that there are 7 bits in the internal state, as depicted in the figure. Carefully observe that every bit is either connected with two, one or none of the output bits. There are two output bits, generated from exoring a typical set of bits. For this case, the input bit is the leftmost bit which is 1 and output two bits are 0 and 1 respectively. It is also possible that when an input bit is 1, both output bits are zeros. That is the reason this code is not systematic. At this point of time, the internal state is 1111001 which is written from left to right. The internal state is placed in a register where the processing happens very fast.

 

Processing in Convolution code

 

 

Now let us try to understand how the receiver processes it. The receiver looks at the output pattern and tries predicting the input pattern which is most likely to generate this output pattern. The idea is to make the output pattern to indicate actual input and not received input1 and thus corrects an error. Interestingly, it is possible to have multiple guesses based on output bits and the convolution coding picks up the guess with a minimum of errors2.

 

When the output is observed and the input needs to be guessed using that observed output, this algorithm is quite good, despite being quite different. For example, in the speech to text conversion, the spoken syllable is to be mapped to the bunch of characters, the algorithm generates multiple guesses and chooses the best guess based on the context (that is why we need the state) and input syllable.

 

The algorithm, being linear, has a problem with long constraint length. A longer constraint length gives the better idea of context to the algorithm and thus can always produce a better guess. The problem with longer constraint length is the number of possible guesses increases

 

1 This is more humanlike processing. When we hear a speech, and a word is not clearly matching with the actual word especially when the speaker belongs to some other community and his pronunciations do not match ours, we look at the context to determine the actual world which is quite far from what we hear actually.

 

2 Again this is humanlike processing. For example, when we hear word ‘bank’, we relate it to financial institute if the context is so and river bank if the context indicates that. In a context where finance is being discussed, the financial institute version produces fewer errors than the river bank exponentially with the length. It would be difficult for any system to process those guesses in real time. To overcome this limitation, generally, two things are done.

  1. Some other algorithm is used in conjunction with this algorithm so this algorithm gets a small domain to work with.

 

2. Instead of convolution, some other coding scheme is used which can process bits in such a fashion that the number of guesses produced is linearly dependent on the constraint length, unlike convolution coding. Thus constraint length of arbitrary size can be used without compromising the speed significantly.

 

Reed-Solomon Code

 

One good way to deal with burst error is to think big and choose an entire byte or word as an entity (called symbol) and not a single bit. The idea is to consider that symbol itself is in error or not. Even when one bit of the symbol is corrupted, the symbol is considered corrupted. When there are multi-bit errors confined to one symbol (which is a usual case), one out of many symbols is corrupted and we can deploy error handling based on single symbol error (similar to single bit error). The idea is to find an incorrect (corrupted) symbol using some form of redundant symbols and correct it. The idea is quite similar to hamming code. If the RS code is designed with r redundant symbols, it can correct errors up to r/2 symbols. It can also detect if the maximum of r symbols is moved altogether (deleted) from the communication.

 

Changing the granularity is quite a clever move considering modern communication is more likely to experience a burst error rather than a single bit error. As the data communication speed increases, the bit length decreases and thus noise of the same length corrupts larger number of bits. For example, in a case of 10 Mb Ethernet speed, assume a noise corrupts one bit. The same noise, at 100 Mb, corrupts 10 bits and at 1 Gb corrupts 100 bits.

 

Like CRC, the RS code picks up symbols and represent them as coefficients of a polynomial p(x) over a finite field. There is a generator polynomial g(x) which is used to multiply with p(x) which results into symbols that the sender is sending. The receiver divides the same polynomial with g(x) to obtain the original polynomial, and thus, the message. The RS code is represented for n symbol transmission as [n, n –r]. For example, a popular code [255,223] means 255 total symbols out of which 223 are message symbols and 255-223 = 32 redundant symbols. Each symbol is of length 8 bits here. When we have r redundant symbols, r/2 symbol errors can be corrected. In above case, an error up to 16 symbols can be corrected. 16 symbols mean 16 * 8 = 128-bit burst error which is quite good for systems like Wi-Fi.

 

The very idea of increasing granularity to a byte which made it so strong in correcting burst errors, make them less suited for detecting and correcting isolated single bit errors. The reason why convolution code is normally used with RS code is this; RS can correct burst errors while the convolution coding can take care of single bit errors. Both of them together works very well.

 

Unlike convolution coding, RS codes are designed around hard decision encoding. Later codes, especially Turbo and LDPC, which are popular today, use soft decision encoding.

 

Turbo and LDPC (Low-Density Parity Check) codes

 

LDPC codes are used in 10 Gb Ethernet and the latest version of Wi-Fi. Turbo codes are used in 3G.4G mobile phones including LTE. They are competing with each other as both of them provide performance which is quite close to the theoretical limit. While using this code, it is possible to send data as fast as the Shannon limit suggests for a noisy channel. Earlier codes failed to achieve speeds of that level.

 

The idea of LDPC code is simple. The output bit is generated based on some input bits. The idea is quite similar to our Hamming Code. A parity bit calculation in Hamming Code dependent on some specific set of bits. In LDPC also, there are a typical set of bits picked up and parity bits are chosen and calculated based on mapping represented by a matrix. The matrix contains multiple rows. The 1’s in each row indicate the position of data bits used to calculate that redundant bit. Usually out of all data bits, only a few data bits are used to calculate the redundant bits and this matrix contains larger number of zeros than ones. That is why this matrix is also known as low-density matrix and the code is given this name.

 

But the similarity ends here. The LDPC codes use multiple parity bits to estimate values of the input bits. Once the first estimate is obtained, it recalculates values of the parity bits and normalizes the values; in a way correcting it. This process goes on for a while to correct as many bits as it can.

 

Turbo codes are also error correcting codes. It is a mixture of a block encoding and convolution encoding. It uses three stage encoding process. It uses two encoders of convolution type and in the middle, it uses an interleaver. The process is based on soft decision encoding and thus probabilistic values are passed from one stage to another. The receiver decodes the incoming data using a similar three-stage process. Like LDPC codes, the decoding process reiterates until a specific condition is met. The probability of each bit is calculated (called extrinsic information) and updated at every iteration. Eventually, when the process stops, it gives corrected output.

 

These error correction codes are sometimes defined as FEC or Forward Error Correction codes. We have looked at both, detection and correction and also indicated that choosing one is not very easy to determine. Let us try to see how one can decide whether to use the detection or correction in the process.

 

Detection Vs Correction

 

In all real world scenarios, a question “which is better” always responded best by “it depends”, error correction and detection is no exception. However, we have already hinted that when the amount of error is high, correction is better and when the amount of error is less, detection is better. Here is a more detailed list of differences.

 

1. Detection only detects errors; the sender additionally needs to retransmit the frame. That makes detection a much slower process as compared to correction.

 

2. Wired networks which use better UTP cables or FO cables can work with error detection while wireless networks which use error prone frequency bands are better off with error correction.

 

3. Error correction needs more information and thus need more redundant bits. For example, 1 parity bit is good enough to detect the one-bit error in a string of any length, while a 7-bit string needs 4 bits to correct the one-bit error. Larger strings demand larger number of bits.

 

4. Error detection is usually more robust. However, the latest error correcting codes are on verge of making error correction as robust as detection.

 

Flow Control

 

One critical job of data link layer is to make sure that the slow receiver is not swamped by the faster sender. The data link layer connects two entities directly so anyone can start sending while other can start receiving. In fact, in most cases, either party can start sending and it is generally the case that both of them are sending as well as receiving at the same point in time. To understand the problem of flow control, let us take two different cases. A movie is being downloaded from the web to a mobile phone and students submitting their assignment to the central server at the last moment. In case 1, the web server might send the data to the mobile phone via some routers and the final router which sends the video to the mobile phone has one critical problem. The capacity of the mobile phone (however smart), is much lesser than a normal router and it cannot process data as fast as the router can. Thus the router must deploy some kind of flow control and reduce its speed to some extent so make sure the mobile is not inundated by the incoming data. In the second case, the central server may be very powerful than student computers but total traffic all students generate together may overwhelm it and there must be a mechanism that the central server is not drowned by this traffic. Both examples indicate the same problem but as a result of two different situations. The problem is that the receiver receives more traffic than it can handle. In case number one, there is a clear mismatch between sender’s and receiver’s capacity. The sender is much faster than the receiver. In the second case, the sender is not fast but the receiver is overburdened with collective data sent by multiple senders and hence the problem. In either of the cases, we need flow control; i.e. slow down the sender to help the receiver to recover from the problem.

 

Flow control is needed at both, the data link layer where two adjacent machines are exchanging traffic and at transport layer where a client is communicating with a server remotely. Ethernet, for example, deploy a technique called pause frame. Whenever receiver encounters faster sender, it starts buffering frames and also sends a pause frame indicating the sender to stop sending for some time. By that time, the receiver processes that buffer and be ready for next burst of frames. In case of wireless communication, generally stop and wait method is deployed. Whenever a frame is sent, the receiver sends back the acknowledgement and only then, the sender sends the next frame.

 

We will study how transport layer implements flow control when we look at TCP in module 26.

 

Interfacing with network and physical layer

 

One more important job of the data link layer is to receive packet from the network layer and after generating frame from it, for sending it over to the physical layer. The receiver’s physical layer passes the frame to its data link layer and it has to extract the packet from it and pass it to its network layer. Thus the data link layer has to provide interface with both the network as well as the physical layer for the same

 

Addressing

 

Addressing is another critical job that the data link layer has to do. Every machine in the network has a unique data link layer address, normally known as a MAC address, be it wired or wireless. The address, interestingly, is attached to the card the machine contains for connecting to the network and not the machine itself. That means the addressing is not the identity of the machine but its network connection. The address may be hard coded or software defined. Ethernet cards, for example, has MAC address permanently assigned by the manufacturer and cannot be changed. On the contrary, most wireless cards MAC addresses can be assigned on the fly by the administrator. MAC address is also called the physical address or link layer address. The idea of link layer addressing is to make sure every node in a given network has one typical address unique across that network.

 

Interestingly, other layers also provide their own address. For example, network layer identifies every network card by a unique IP address in the entire world. The transport layer identifies every process by a typical port number in a given machine.

 

Summary

 

The data link layer error correction process was the central theme of this module. We looked at Hamming distance to illustrate the number of bits one needs to change to convert one correct value into another. We have seen a hamming code method to correct one-bit error. We have seen the convolution code method which is used to correct single-bit errors based on judging what could be the input for output bits. The Reed Solomon code considers a symbol as a unit and corrects errors of length r/2 symbols where r is the number of redundant symbols. The LDPC and turbo codes are even better than RS code as they deploy soft decision encoding and context based decision making to be more accurate. Finally, we looked at flow control and communicating with network layer and physical layer as two important duties of the data link layer.

you can view video on Error correction and other jobs of Data Link Layer

References

  1. Computer Networks by Bhushan Trivedi, Oxford University Press
  2. Data Communication and Networking, Bhushan Trivedi, Oxford University Press