Let’s talk about exploratory graphs in exploratory data analysis in R. Do you often think exploring data by visualizing it will be more fruitful? If yes, then you should be aware of different plotting systems for both one-dimensional and multidimensional data. As we know EDA in R is the heart of any analysis using R. So, in this article we will be exploring various plots to draw key insights. Why wait, let’s roll!!!
Performing EDA with Exploratory Graphs in R
What are the common methods of analysis include? It can include understanding data structure, data types, summary, and statistical measures of data. But now, we are going to see how exploratory graphs play a major role in the analysis. In addition, we are going to discuss more one-dimensional and multi-dimensional data. Above all, we go with graphs and a suitable analysis approach for EDA.
However, you can use diamonds dataset for the purpose. It’s available with ggplot2 package. Let’s load the data and inspect it briefly.
#Load the required library and data library(ggplot2) View(diamonds)
We have loaded the data and we can use function str() to understand the structure of the data.
#Display the structure of data str(diamonds)
One-Dimensional analysis using Exploratory Graphs
If you are analysing the data and looking for one dimensional summaries, then you have these options.
Five number summary
The five-number summary will showcase the percentiles, median, and min-max values. You can use the fivenum() function in R to compute the five-number summary.
#Computes five number summary fivenum(diamonds$price)
326.0 950.0 2401.0 5324.5 18823.0
#Computes summary of data summary(diamonds$price)
Min. 1st Qu. Median Mean 3rd Qu. Max. 326 950 2401 3933 5324 18823
The five number summary is similar to summary function. However, the summary() function computes the mean of data as well.
In the results, (Price in $).
- The minimum price of diamonds lies at 326.
- First quatile of price is at 950.
- Median value of the price is 2401.
- Mean value of the price is 5324.
- Finally, maximum price of diamonds is 18,823.
Boxplot exploratory grap
Boxplots are used to plot the five-number summary. Any points that lie out of whiskers have treated outliers and you have to handle them. Let’s compute a boxplot to understand the distribution of data.
#Creates a boxplot boxplot(diamonds$price, col = 'green', xlab = 'Counts', ylab = 'Price', main = 'Boxplot')
- As we know the 3rd quartile value of price data is 5324. The boxplot shows there are more prices present beyond that range.
- You can also try with log-transformation which helps in data normalization. For this use the below code.
#Log transformed values boxplot(log(diamonds$price), col = 'green', xlab = 'Counts', ylab = 'Price', main = 'Boxplot')
Histograms are like bargraphs that show the frequency of data. It will showcase each column and the frequency of data points occurrence. Let’s plot a histogram to visualize the data.
#Creats a histogram hist(diamonds$price, breaks = 10, col = 'green', xlab = 'Price', main = 'Histogram of Price')
- It’s clear that over 25,000 diamonds will costs around $2000.
- We have 3000-5000 diamonds whose price lies beyond $10000.
- Distribution is right skewed also called as unimodal distribution.
- The skewness indicates the presence of outliers. Take time to see if it is truly a outlier of a potential data record.
- Therefore, understand your data first. Spend some time in understanding the variables and its contribution to data.
A Barplot is very useful in visualizing categorical variables. Let’s see how we can plot a bar graph with the quality of diamonds.
- Bar plots are very easy to understand plots which visualise categorical data.
- As we can see, we have a majority of very good quality of diamonds in the data.
- This followed by premium, verygood, good and fair as well.
Two-Dimensional Analysis with Exploratory Graphs in R
We have come across an analysis of one-dimensional data. Now, we can look after visualizing two-dimensional data with exploratory graphs. You have got multiple options for this. You can use multiple boxplots and multiple histograms as well. Scatter plot also is very useful in data of multiple dimensions.
You can make use of multiple box plots to understand the relationship between two variables. It will be a side-by-side visualization to make you feel better about your data.
#Creats multiple boxplots boxplot(carat ~ cut, data = diamonds, col = "green", main = 'Multiple Boxplots')
- You can observe the weight / size of the diamonds with respect to quality of diamonds.
- The best quality diamonds got the weight within 2.5
- After that, we can also observe that low quality diamonds are bigger in weight / size.
It’s important to use multiple histograms to understand the distribution of data in multiple variables. Let’s see how it works.
#Multiple histograms par(mfrow = c(2, 1), mar = c(4, 4, 2, 1)) hist(subset(diamonds, cut == 'Fair')$price, col = 'green', xlab = 'Price', ylab = 'Count', main = 'Histogram of Cut(Fair) Vs Price distribution') hist(subset(diamonds, cut == 'Ideal')$price, col = 'green', xlab = 'Price', ylab = 'Count', main = 'Histogram of Cut(Ideal) Vs Price distribution')
- It’s just a simple illustration to find the price v/s quality of diamonds.
- You can observe that, good quality diamonds costs less. It’s a strange pattern but it’s true as data is concerned.
- The low quality diamonds costs more.
Multiple Scatter plots
par(mfrow = c(1, 2), mar = c(5, 4, 2, 1)) with(subset(diamonds, cut == "Fair"), plot(carat, price, main = "Fair", col = 'green')) with(subset(diamonds, cut == "Ideal"), plot(carat, price, main = "Ideal", col = 'green'))
- Here you can see that the price of the diamonds is dependent on the weight/size of the diamonds.
- Above all, the quality of diamonds is having less contribution compared to the weight of diamonds.
- Multiple scatter plots are always handy for plotting a relationship between two variables in EDA in R.
They said and well said.
Exploratory graphs are “Quick & Dirty”. In conclusion, EDA in R is such a major aspect of any analysis using R. You can know your data better before modeling. You can see many relationships between variables and their contribution to the target variable.
In this process, we understood that low-quality diamonds are priced higher than good ones. However, there is a strong linear relationship visible. You can negate this linearity to get the data into the right format.
That’s all for now. Happy R!!!
More R graphs: R graph gallery