You have just seen categorical variables!
What are Categorical Variables?
Remember that univariate data, also known as one-variable data, are observations that are made on the individuals in a population or sample. That data comes in different types, like qualitative, quantitative, categorical, continuous, discrete, and so on. In particular, you will be looking at categorical variables, which are also often called categorical data. Let's first look at the definition.
A variable is called a categorical variable if the collected data falls into categories. In other words, categorical data is data which can be divided into different groups instead of being measured numerically.
Categorical variables are qualitative variables because they deal with qualities, not quantities. So, some examples of categorical data would be hair colour, the type of pets someone has, and favourite foods. On the other hand things like height, weight, and the number of cups of coffee that someone drinks per day would be measured numerically, and so are not categorical data.
To see the various types of data and how they are used you can take a look at One-Variable Data and Data Analysis.
Categorical vs. Quantitative Data
Now you know what categorical data is, but how is that different from quantitative data? It helps to look at the definition first.
Quantitative data is data that is a count of how many things in a data set we have a particular quality.
Quantitative data usually answers questions like "how many" or "how much". For example quantitative data would be collected if you wanted to know how much people spent on buying a cell phone. Quantitative data is often used to compare multiple sets of data together. For a more complete discussion of quantitative data and what it is used for, take a look at Quantitative Variables.
Categorical data is qualitative, not quantitative!
Categorical vs. Continuous Data
All right, what about continuous data? Can that be categorical? Let's take a look at the definition of continuous data.
Continuous data is data that is measured on a scale of numbers, where the data could be any number on the scale.
A good example of continuous data is height. For any of the numbers between \(4 \, ft.\) and \(5 \, ft.\) there could be someone of that height. In general, categorical data is not continuous data.
Types of Categorical Variables
There are two main types of categorical variables, nominal and ordinal.
Ordinal Categorical Variables
A categorical variable is called ordinal if it has an implied order to it.
An example of ordinal categorical data would be the survey at the start of this article. It asked you to rate satisfaction on a scale of \(1\) to \(5\), meaning there is an implied order to your rating. Remember that numerical data is data that involves numbers, which the survey example does have. So it is possible for survey data to be both ordinal and numerical.
Nominal Categorical Variables
A categorical variable is called nominal if the categories are named, i.e. if the data does not have numbers assigned.
Suppose a survey asked you what kind of housing you live in, and the options you could pick from were dorm, house, and apartment. Those are examples of named categories, so that is nominal categorical data. In other words, if it has a named category but isn't numerically ordered, then it is a nominal categorical variable.
Categorical Variables in Statistics
Before you go on to look at more examples of categorical variables, let's look at some of the advantages and disadvantages of categorical data.
On the advantage side are:
The results are very straightforward because people only get a few options to choose from.
Because the options are laid out ahead of time, there are no open-ended questions that need to be analyzed. Categorical data is called concrete because of this property.
Categorical data can be much easier to analyze (and less expensive to analyze) than other kinds of data.
On the disadvantage side are:
In general, you need to get quite a few samples to make sure the survey accurately represents the population. This can be expensive to do.
Because the categories are laid out at the start of the survey, it isn't very sensitive. For example, if the only two options for hair colour on a survey are brown hair and white hair, people will have trouble deciding which category to put their hair colour in (assuming they have any at all). This can lead to non-responses, and people making unanticipated choices on what their hair color is which skews the data.
You can't do quantitative analysis on categorical data! Because it isn't numerical data you can't do arithmetic on it. For example, you can't take a survey satisfaction of \(4\), and add it to a survey satisfaction of \(3\) to get a survey satisfaction of \(7\).
You can see a summary of the advantages and disadvantages of categorical variables in statistics in the following table:
Table 1. Advantages and disadvantages of categorical variables |
---|
Advantages | Disadvantages |
Results are straightforward | Large samples |
Concrete data | Not very sensitive |
Easier and less expensive to analyse | No quantitative analysis |
Collecting Categorical Data
How do you collect categorical data? This is often done through interviews (either in person or on the phone) or surveys (either online, in the mail, or in person). In either case, the questions asked are not open-ended. They will always ask people to choose between a specific set of options.
Categorical Data Analysis
The collected data then needs to be analysed, so how do you analyze categorical data? Often it is done with proportions or percentages, and it can be in tables or graphs. Two of the most frequent ways to look at categorical data are bar charts and pie charts.
Suppose you were asked to give a survey to decide whether people liked a particular soft drink and got back the following information:
- 14 people liked the soft drink; and
- 50 people did not like it.
First, we should figure out if this categorical data.
Solution
Yes. You can divide up the answers into two categories, in this case "liked it" and "didn't like it". This would be an example of nominal categorical data.
Now, how could we represent this data? We could do so with a bar or a pie chart.
Like and Didn't Like Bar Chart
Pie chart showing percentage of people who liked or didn't like the soda
Either one gives you a visual comparison of the data. For many more examples of how to construct a chart for categorical data, see Bar Graphs.
Examples of Categorical Variables
Let's look at some examples of what categorical data can be.
Suppose you are interesting in seeing a movie, and you ask a bunch of your friends whether they liked it or not in order to decide whether you want to spend money on it. Of your friends, \(15\) liked the movie and \(50\) didn't like it. What is the variable here, and what kind of variable is it?
Solution
First of all, this is categorical data. It is divided into two categories, "liked" and "didn't like". There is one variable in the data set, namely your friends' opinions of the movie. In fact, this is an example of nominal categorical data.
Let's look at another example.
Going back to the movie example, suppose you asked your friends whether or not they liked a particular movie, and what city they live in. How many variables are there, and what kind are they?
Solution
Just like in the previous example, your friends' opinions of the movie is one variable, and it is categorical. Since you also asked what city your friends live in, there is a second variable here, and it is the name of the state they live in. There are only so many states in the US, so there are a finite number of places they could list as their state. So the state is a second nominal categorical variable you have collected data on.
Let's change what you are asking in your survey a bit.
Now suppose you have asked your friends about how much they are willing to pay to see the movie, and you give them three price ranges: less than $5; between $5 and $10; and more than $10. What kind of data is this?
Solution
This is still categorical data because you have laid out the categories your friends can answer in before you asked them to answer your survey. However this time it is ordinal categorical data since you can order the categories by price (which is a number).
So how do you compare categorical variables anyway?
Correlation Between Categorical Variables
Suppose you asked your friends whether or not they liked a particular movie, and whether they paid less than \($5\), between \($5\) and \($10\), or more than \($10\) to see it. Those are two categorical variables, so how can you compare them? Is there any way to see if how much they paid to see the movie influenced how much they liked it?
One thing you can do is look at comparative bar charts of the data, or at a two-way table. You can find more information about those in the article Bar Graphs. The other thing you can do is a more official kind of statistical test, called a chi-square test. This topic can be found in the article Inference for Distributions of Categorical Data.
Categorical Variables - Key takeaways
- A variable is called a categorical variable if the data collected falls into categories.
- Categorical variables are qualitative variables because they deal with qualities, not quantities.
- A categorical variable is called ordinal if it has an implied order to it.
- A categorical variable is called nominal if the categories are named.
- Ways to look at categorical variables include tables and bar charts.
How we ensure our content is accurate and trustworthy?
At StudySmarter, we have created a learning platform that serves millions of students. Meet
the people who work hard to deliver fact based content as well as making sure it is verified.
Content Creation Process:
Lily Hulatt is a Digital Content Specialist with over three years of experience in content strategy and curriculum design. She gained her PhD in English Literature from Durham University in 2022, taught in Durham University’s English Studies Department, and has contributed to a number of publications. Lily specialises in English Literature, English Language, History, and Philosophy.
Get to know Lily
Content Quality Monitored by:
Gabriel Freitas is an AI Engineer with a solid experience in software development, machine learning algorithms, and generative AI, including large language models’ (LLMs) applications. Graduated in Electrical Engineering at the University of São Paulo, he is currently pursuing an MSc in Computer Engineering at the University of Campinas, specializing in machine learning topics. Gabriel has a strong background in software engineering and has worked on projects involving computer vision, embedded AI, and LLM applications.
Get to know Gabriel