1 Introduction to Categorical Data

Sayantee Jana

epgp books

 

 

1. Introduction

 

One of the most common types of data that we come across in our everyday life is categorical data. It is a special case of discrete data. These days categorical data have huge practical relevance. It has found application in medical science, epidemiology, public health, education, zoology, sociology, studying physiological factors, psychological factors, behavioural science, marketing, industrial quality control and many more.

 

2.  Distinction between categorical and numerical data

 

Data are of two types : categorical and numerical. When the answer to a question is a number then it is numerical data and when it is not a number rather it signi es a category or group then it is categorical data. Let us consider few examples. If a person is asked how much does he or she weigh then the answer to that question will be something like 65, 62, 59.4, 65.7 or 57 kgs etc. So the answer is a number and hence such an answer is an example of numerical variable. Data consisting of such a variable is called numerical data. More examples of numerical data is shoe size, height of a person etc. But if a person is asked what is his favourite colour the answer can be pink or yellow or blue or magenta etc. This answer is not measurable and cannot be expressed as real numbers, rather it consists of some categories of colour and hence called categorical variable. Data consisting of such a variable is called categorical data. More examples of categorical variable is eye colour, hair colour, favourite music genre, favourite cuisine etc.

 

A categorical variable can also just consist of a `yes/no’ or `high/low’ answer. This is called binary data since it has only two choices. It is a special kind of categorical data and probably the most frequently occurring categorical data. Examples of binary data is gender of human subjects in a study, smoking status of a person, answer to a TRUE or FALSE question etc.

 

It is also to be noted that although a categorical variable cannot be expressed as a real number but numbers can be used as a code to denote several categories of a categorical variable. For example we can use the numbers `0′ and `1′ to denote `yes’ and `no’ answer for a question. But we need to understand that here `0′ for `yes’ does not mean the number 0 rather it is just a code to signify `yes’ or in other words it is just a numerical name given to that answer choice. The use of numeric codes for categorical answer choices is for convenience and mathematical modelling which we will discuss more later in future.

 

Formal de nition: A categorical variable is such that consists of a set of categories or groups.

 

Practical examples1 of categorical data can be found in medical science (diagnostic stages of breast cancer – normal, benign, probably benign, suspicious, and malignant), social sci-ence (Political philosophy – liberal, moderate, conservative), biomedical science (outcomes of a medical treatment – successful or unsuccessful), behavioural science (type of mental illness – schizophrenia, depression, neurosis), zoology (alligators’ primary food preference – sh, inver-tebrate, reptile), education (student responses to an examination question – correct, incorrect), marketing (consumer preference among brands of soap – Dove, Lux, Vivel), culinary science (how a particular food product tastes – sweet, sour, mild spicy, hot).

 

3.  Di erent types of variable

 

To understand well the distinction between categorical and numerical variable we need to know the four di erent types of variable , or in other words, di erent scales of measurements. They are :

 

Nominal variable: have categories without any natural order (example: religious a lia-tion – Christian, Hindu, Buddhist, Jain, Muslim, Jewish, Zoarastrian, Bahai etc.),

 

Ordinal variable: have categories with a natural order but distance between categories is not de ned (example: social class – upper, middle, low),

 

1  Agresti, A. (2002): Categorical Data Analysis, John Wiley and Sons., Inc.

Interval-scaled variable: have a de ned numerical distance between its values (example:blood pressure level), Ratio-scaled variable: have a de ned origin as well as distance between its values (ex- ample: height of a person). The    rst two form categorical data and the last two come under numerical data.

 

A variable is classi ed according to how it has been measured. There is no strict way of classifying a variable. The same variable can be classi ed as nominal or ordinal or as interval or ratio-scaled depending on how it has been measured and how much information is available to us. For example consider the variable mode of transportation. If we just consider the categories walk, bicycle, public transport, private vehicle as mere modes of transportation without any natural order then it is a nominal variable. If we consider private vehicle to be superior to public transport and the later to be superior to bicycle and consider walking to be least superior since a person spends more money annually to travel using private vehicle compared to public transport and then for cycling (for maintenance) and nothing for walking, then it is an ordinal variable. If we consider the amount of money spent annually for the mode of transportation then it is a ratio-scaled variable.

 

It is to noted that we can always reduce an interval or ratio-scaled variable to ordinal or nominal variable but not vice-versa. This is because we can always move down the ladder of classi cation with loss of information but cannot move up the ladder since in such a case we will have to provide more information about the variable. For example if we record the ages of patients in a hospital then it is a ratio-scaled variable and later on we can classify the patients as `young’ and `old’ based on their ages and then the variable becomes an ordinal variable. But if we record just the age-groups (young and old) of patients then we cannot convert it into a ratio or interval-scaled variable later on.

you can view video on Introduction to Categorical Data