Scatterplots in R are the simplest form of graphs that plot two vector variables against each other. These are useful to know the distribution and density of a variable relative to another.
Suppose you wish to plot the heights of children against their age and see how tall most children are for the given age, the best plot to use is a scatter plot.
These can also be applied to situations like lung capacity vs hours of exercise, time of the day vs employee logins, week of the month vs daily sales, etc.
We will begin by plotting some scatterplots using the R’s basic package and then move onto visualizing similar graphs in 3D using R.
Creating Scatterplots in R
The simplest scatterplot can be created using a
plot(x,y) command, where x and y are vectors. Let us look at an example using some in-built R datasets.
The iris dataset in R is a collection of 150 observations across 5 variables concerning the iris flower. These variables indicate the dimensions of flowers such as sepal length/width and petal length/width. Let us try plotting the petal width of the flowers against the petal length observed.
This gives us our first scatterplot. The graph is easy to read and tells us that some of the petals have a length of 1 to 2 cm, with a petal width of less than a centimeter – around 0.5 cm to be more precise.
Additionally, most of the petals are longer (3-7 cm) and tend to have wider petals that are mostly clustered around 1 to 2 cm. We can also clearly say that there is a weakly positive, slightly linear relationship between the petal length and width.
How do we make these graphs more informative? Observe that the iris dataset has 3 species of flowers. We can have a scatterplot that plots the 3 species with 3 different colors for the same variables we have chosen above.
All you need to do is use an
unclass() function against the
col argument when plotting the graph. The unclass function takes the iris$species as an argument and the colors of the dots in the plot get assigned according to this. We also add the label, title and legend to make the graph look more appealing. Let us look at the code snippet.
plot(iris$Petal.Length, iris$Petal.Width, pch=20, col=c("red","green","blue")[unclass(iris$Species)], xlab="Petal Length", ylab="Petal Width" ,main="Iris length vs width by species") legend("topleft",c("Setosa","Virginica","Versicolor"), pch=20, col=c('red','green','blue'), title="Species")
The resulting graph is shown below.
This graph carries far more information than the simple one we created above such as – Setosa is the smallest species in the lot and Versicolor is the largest. Also, there are some Virginica irises that are large enough to look like Versicolor ones. All these interpretations are possible with a legend we have drawn above the graph.
Creating 3D Scatterplots
Sometimes, we might also be interested in creating scatterplots that involve 3 variables instead of 2.
While R’s basic graphics have support for only 2D scatterplots, there is another package named scatterplot3d that accomplishes this purpose. We will begin by installing this package.
install.packages("scatterplot3d") package ‘scatterplot3d’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\JournalDev\AppData\Local\Temp\Rtmpc7b8JR\downloaded_packages
Now make sure that you load the installed package to your environment.
So let us create a 3D scatterplot for three variables from the iris dataset.
Now, let us try plotting a different color for each class as above. However, the unclass function will not work in this case.
We need to convert the species variable into a factor using the
as.factor() function. Let us also add x-label, y-label, z-label and title to the graph like above.
> spec <- as.factor(iris$Species) > scatterplot3d(iris$Petal.Length,iris$Petal.Width,iris$Sepal.Length, pch=20, xlab='Petal Length',ylab='Petal Width',zlab='Sepal Length',main="3D plot", color=c('red','green','blue')[spec])
The resulting 3D plot is as follows:
Scatterplots are simple yet important tools in data analysis. They are best used to identify outliers and get a general observation of the data characteristics.