Box plots in R are a good way to measure and visualize how closely your data is distributed. These are also sometimes known as box and whisker plots. Each data distribution has certain measures of central tendency – mean, median and mode.
Some distributions are closely placed around the median and mean values, while others get spread across a wide range of values and also contain a number of outliers. Box plots let you examine your data using a five-number summary. These are:
- Median – The mid-value of the set – known as Q2
- First quartile – The number half-way between the median and the smallest value of the set – known as Q1
- Third quartile – The number half-way between the median and the largest number in the set – known as Q3
- The distance between Q1 and Q3 is known as the interquartile range – IQR.
- Minimum – Q1 -1.5*IQR – not the smallest value
- Maximum – Q3 + 1.5*IQR – not the largest value
Any data point that is beyond the limits of the minimum and maximum values is treated as an outlier. Thus the box plot can give you a comprehensive idea of the data distribution.
Creating Box Plots in R
Box plots can be created using the
boxplot() function in R. Let us try creating our first box plot by making use of the R’s builtin airquality dataset.
This is a dataframe with 6 columns and 153 rows, recording weather data like wind speed, temperature, ozone quantity, etc. Let us try making a box plot for the wind speed column of the dataset.
- The thick line slicing through the box represents the median of the data set – which is roughly around 10.
- The lower half of the box looks larger the upper half – indicating the values less than the median are more dispersed.
- The upper and lower boundaries of the box represent the Q3 and Q1 points respectively.
- The smaller horizontal lines extending outside the box, known as whiskers represent the minimum and maximum values.
- The small circles above the maximum mark here are the outliers.
Let us try plotting a box plot for another variable in the dataset.
It can be observed that this dataset has two outliers above the maximum mark and the data is dispersed above the median value.
Building Multiple Box Plots
R also makes it possible to compare the distribution of two variables using multiple box plots.
> boxplot(airquality$Ozone,airquality$Temp, names=c('Ozone','Temperature'),col=c('red','orange'))
The command uses two different colors to distinguish the variables. The names to the different plots are provided by the names attribute to the function.
Plotting Variable Relationships with Box Plots
It is also possible to compare a variable against any other categorical variable in the dataset. For example, if we wish to look at the distribution of the temperature for every individual month, we only need to include the two variables within the formula part as – Temp ~ Month, setting data to the data frame name.
Temp ~ Month means that we wish to know the relationship of Temp depending upon the month. Let us now execute the command and try building a horizontal plot instead of a vertical one.
boxplot(Temp ~ Month, data=airquality, horizontal= TRUE, col=c('red','green'))
Adding Notches to Box Plots in R
A variation to the box plot is sometimes seen with notches added. Notch is nothing but a small compression in the middle of the box, identified by its width and height.
Two plots with similar notch dimensions tell us that the two plots were likely drawn on data selected from the same distribution. Also, if two notches do not overlap, the medians of the distributions are likely to be different. Notches can be added setting the notch parameter to TRUE.
Let us make a notched variant of the above multigraph.
> boxplot(Temp ~ Month, data=airquality, horizontal= TRUE, notch= TRUE, col=c('red','green','orange','blue','purple'))