11 Manipulating Text in GNU/Linux

Mr. Hardik Joshi

epgp books

   

Introduction

 

Databases are used by most of the computer professionals to store records. In earlier days, before the database systems were developed data was stored in the text format. Huge text files used to exist that contained thousands of records. In the current scenario, searching the World Wide Web is possible due to search engines like Google, Yahoo etc. It will be quite interesting to learn how Google works. What Google does is, it searches the keywords across the entire web documents and tries to find the matching documents for producing results. However the difference between organization’s data and Google’s data is that organizations maintain structured data whereas Google searches are within unstructured documents or web pages. In both the cases, although the files may be quite large and the number of files may be more, searching multiple huge files in very less time is possible due to filters like grep, sed etc. In this chapter we will discuss the grep utility and its variants. Such utilities help in searching a filename from a directory listing, searching lines with a specific word from text documents, keyword from a program’s source code etc.

Regular Expression

 

When we search within any text document, it implies that we are looking for a matching pattern in any document. A regular expression is a text string (pattern) consisting of sequence of characters that is matched against a text document. For example, “colou?r” is a regular expression which consists of alphabets and some special characters. A regular expression is made up of atoms and operators. An atom specifies what we are looking inside the text; operator helps is generating various combination of regular expression. An operator is not required in all regular expressions. Table .1 classifies atoms into various types.

Characters like * (asterisk), ^ (caret), $(dollar), ?(question mark), .(dot) has special meaning in Linux and Unix. They can be used with regular expression to search within a text document, for example, they can help to search for a particular word that begins or ends with a particular pattern or searching a line that begins or ends with a particular pattern. The use of each one of them will be explored in the subsequent sections. The combination of atoms and operators works like a Swiss knife for a programmer. Table 2 classifies expressions into various types.

 

Utilities available in Linux and Unix support most of the operators while using the regular expressions. Regular expressions are supported by many programming languages like Perl, PHP etc, they are also supported by Linux utilities like awd, grep, sed etc. In the following sections the family of grep command is discussed preceded by the discussion on use of metacharacters in regular expressions. It must be noted that not all the operators are supported by grep utility.

 

Pattern Searching with grep

 

Grep is one of the most extensively used utility to search patterns from files. Grep is an acronym of “Globally Search a Regular Expression”. Grep is a family of utilities for pattern searching, it includes commands like grep, egrep (extended grep) and fgrep (fixed character grep). Pattern searching has already been explored to some extent in chapter 3, where the features of vim editor were discussed to search and replace patterns. This section extensively discusses grep command and related utilities.

 

Grep command can be used to search patterns from files or output of other commands. Grep command saves the effort to locate patterns from a file without opening the file in any editor. Many tasks can be automated by using combination of grep and other commands through pipes. Grep being a filter can be used on left side or right side of a pipe. For example a user can easily find out whether his friend has logged in or not.

 

Grep is a pattern searching utility, it simply displays the lines to the standard output containing the matching pattern from a file. It must be noted that grep does not provide any option to process partial files, nor does it allow to add/delete/modify the lines within files, the users cannot search within a file by specifying criteria like line numbers.

The syntax of grep command is :

Grep command can accept multiple file-names as its input, suppose the search pattern contains multiple words then grep command will consider the first argument to be a pattern and second to be filename, in such scenarios the pattern or regular expression must be quoted. Grep command remains silent and does not display any output if no matching pattern is found in the input file.Grep command can be used with combination of other commands using pipes. The output of unix command can be input to the grep command. The following syntax illustrates the use of grep command in combination with other commands

The above screen displays that user demo has currently logged in, it also helps to identfiy the number of terminals being used by the user.

 

4 Learning grep through examples

 

This section continues the discussion of grep and explores the command in more detail through various illustrations. A sample file emp.txt is used that contains the employee database of any organization, the file contains records of employees separated by ‘-‘ character as a delimeter. The following screen displays the contents of emp.txt file. The first field is employee-id, second field is employee name, third field indicates the designation, fourth field is the department of the employee, fifth field signifies the salary and the last field is the highest qualification of the employee.

Figure: Contents of file emp.txt

 

Case-1: List the managers

 

To display the employee with the designation of Manager, we simply search for the pattern MANAGER from the file. It must be noted that grep command is case sensitive. The following screens illustrate the search for manager in employee database

In many cases, the grep command remains silent if no matching pattern is found and does not display any output on the screen. A user can check whether grep command executed successfully or not using the $? variable. If the value of $? displays 0, it signifies that grep executed successfully and found a pattern while the value 1 signifies that grep did not find any matching pattern. The following screen illustrates that a given pattern is not found in the employee database.

Figure: Grep remains silent when no match exists

 

The $? variable signifies exit status of the last executed command. In the above example, grep remains silent (manager pattern in lower-case does not exist) and does not display any output, so by echoing the contents of $? variable a user can judge whether grep found a pattern or not. Here $? returns the value 1 which signifies failure of grep command. The $? is very helpful in shell programming and its use will be discussed in detail in further chapters.

 

Case-2 Count the number of Managers

 

Options like -c and -n can be used to count the number of occurrences of any pattern and display the line numbers of the matching patterns respectively.

 

Figure: Grep options -c and -n

 

In the above illustration, grep command when used with -c option displays the count of the number of managers in the employee database while the -n option displays the line numbers (1 and 6) that contain the pattern manager in employee database.

 

Case-3 Match complete words

 

Grep command searches for the regular expression. Even though if a user supplies sub-strings instead of entire words, grep displays the lines for corresponding patterns. To avoid this, options like -w will match complete word and will discard the sub-strings.

Figure: Match complete words

 

In the above illustration, -w matches the complete word ‘MANAGER’ and does not display the output for sub-strings like ‘MANA’.

 

Case-4 List Managers from multiple files

 

Suppose there are two files, emp.txt and sec.txt and the users wants to find the pattern ‘MANAGER’ from both the files then the user can supply multiple files to the command grep.

 

Figure: Grep from multiple files

 

The output of grep command has a column in the beginning specifying the file from which patterns are located.

 

Case-5 List Managers and Clerks

Multiple patterns can be found from files using -e option.

 

Figure: Search multiple patterns

 

The above screen illustrates that patterns ‘MANAGER’ and ‘CLERK’ can be searched together using the -e option. A user can search multiple patterns from multiple files aswell. The egrep variant of grep provides similar functionality and is explored in the end of the chapter.

 

Case-6 List all employees who are not Managers

Inverse search can be done by providing the -v option of grep command. This option will display the records that do not contain ‘MANAGER’ pattern.

 

Figure: Patterns not containing MANAGER

 

Case-7 List files that contain Managers

 

The -l option will list all the files that contains a pattern from the list supplied to grep command.

Figure: Display files containing a pattern

 

The above screen illustrates that two files, emp.txt and sec.txt are supplied to the grep command to check whether they contain a pattern or not, the output lists both the files containing a given pattern.

 

Case- 8 List a particular employee

 

Quotes must be used while looking for a pattern with multiple words. Suppose the user wants to display the record of a user named Kavita Majumder then both the arguments cannot be supplied to grep command, grep will search for Kavita as a pattern and will treat Majumder as a filename instead the proper usage is

 

Figure: Multiple patterns to grep command

 

It must be noted that if a pattern itself contains a single quote then the grep command can be supplied the pattern with double quotes. The following illustration explains giving an example of pattern – India’s

5 Using Metacharacters

 

This section provides examples of using various metacharacters with grep. Wild cards used by shell has already been discussed in the earlier chapters. This section explores pattern matching using wild card characters and formulating regular expressions. It must be noted that the use of regular expressions must be within quote marks and the metacharacters of shell may have different interpretation than those discussed here.

Case-1 Using ^

The symbol ^ is known as caret or circumflex character. It can be used to search and display the lines that start with a specific pattern. The ^ sign is prefixed to a pattern. For instance suppose the user wants to list the directories then the command can be

you can view video on Manipulating Text in GNU/Linux

References:

  1. Jon. Emmons and Terry Clark. 2006. Easy Linux commands : working examples of Linux command syntax, Rampant TechPress.Richard Blum. 2008. Linux command line and shell scripting bible, Wiley Pub.
  2. William E. Shotts. 2012. The Linux command line : a complete introduction, No Starch Press.
  3. Daniel J. Barrett. 2016. Linux pocket guide : essential commands, O’Reilly Media.
  4. Mark G. Sobell and Matthew Helmke. A practical guide to Linux commands, editors, and shell programming,