Descriptive Statistics Using R language

Filed Under: R Programming
Descriptive Statistics Using R Language 1024x512

Hello folks, as some of you are using the R language, which is best known for its statistical analysis abilities, let’s try to understand something about data. As data is often called modern fuel today, we are going to dig deep about what is data, what are its forms, measures in terms of statistics. This is also called descriptive statistics and let’s see what is Descriptive Statistics Using R language, which is fun!

What is Data?

I know it is a very common question and people do have multiple answers for that. In simple words, data is a collection of facts that can include words, numbers, and much more.

If I have to define “data”, I will say, data is a distinct piece of information.

Those pieces should be organized and analyzed to gain some insights which make sense. Today data is used explicitly in almost all organizations, to make key decisions.

Descriptive statistics – Types of Data

Before just listing the types of data, I want to tell a story to you and by the end, you will be able to understand types of data in Descriptive statistics.

I used to visit a nearby cafe for a coffee. While I enjoy my coffee, I tend to observe a lot of things and among them, I usually keep a count of cars that pass through that route.

Here, I often do wonder about how many cars I see, on which days, and what’s their brand is. So, here comes two aspects of the data.

  • Quantitative data: It is the type of data that takes numerical values which allow us to perform mathematical computations. 3+4 = 7. Eg: No. cars (Don’t forget the story I told you).
  • Categorical data: It is a type of data that is often used to label something. Eg: Brands of cars and color of cars.
  • Quantitative data is measurable data. The data such as height, weight, income, age, and more will fall under this data type. You can easily measure that. Eg: Income = 20k + 35k = 55k.
  • Categorical data is something, if you measure/compute that it, won’t make any sense. Eg: Car brands = Ferrari + Skoda = ???.

Then why Zip Code falls under categorical data type?

The answer is simple. If you add two Zip codes, the resultant code doesn’t make any sense.

Types of Quantitative data

We can further divide quantitative data into two types.

  • Continuous data: The continuous data type is nothing but when you divide the data further and further, you will still leave with some data. Eg: Age. You can measure age in decades, years, months, and even days. But, you still left with smaller units such as hours, minutes, and more.
  • Discrete data: The discrete data type is something that is countable. Eg: No. cars.

Types of Categorical data

Like quantitative data, you can divide categorical data into 2 types.

  • Ordinal data: Ordinal data is something that includes ranking or ordering of data points. Eg: Consider movie reviews – Poor, good, Excellent where Poor < Good < Excellent.
  • Nominal data: The nominal data do not include the ranking of data points. Eg: Colours. You cannot say Black < Blue < white.

Analyzing Quantitative Data in R

I hope by now you got a better understanding of data and its types as well. There are 4 aspects of analyzing the quantitative data. Let’s see what they are and how they work.

  • Measure of Center
  • Measure of Spread
  • The shape of the data
  • Outliers in data

1. Measure of Center

The measure of center includes three aspects which are mean, median and mode.

Mean: In simple words, mean is an average value of the data. You can get the mean value by adding all the values and divide them by total number of values.

Let’s compute “mean” using R programming.

#Create a vector and pass it to mean function 
x <- c(23,3,45,65,45,789,4.6,0.897,45)
mean(x)

Output: 113.3886

Median: The median is the exact middle number or 50th % of the data.

  • Median of odd: If the data include an odd number of observations, then the median is just the direct middle number. Eg: If the data has 5 observations, the median is 3.
  • Median of even: If the data include an even number of observations, then the median will the average of two numbers in the middle. Eg: If the data has 10 observations such as 1,2,3,4,5,6,7,8,9,10, then the median should be 5,6. So, the average will be 5+6/2 = 5.5.

(Note that, in order to compute the median, you should order / sort the data first)

Median in R:

#Create a vector and pass it to median function 
x <- c(4,56,34,67,54,23,56,98,7,6,56,7,89,50)
sort(x, decreasing = F)
median(x)

Output: 52

Mode: The mode is defined as the most frequent occured values in the data. There can be multiple modes for a data and also there can be no mode of a data as well.

Mode in R: Note that, R dont have in-built function to compute mean. We have to define a function of this as shown below.

x <- c(4,56,34,67,54,23,56,98,7,6,56,7,89,50)
modefunction <- function(x) {
     uniqv <- unique(x)
     uniqv[which.max(tabulate(match(x, uniqv)))]
 }
modefunction(x)

Output: 56

2. Measure of Spread

The measure of spread is used to understand how our data is spread out from each data point. It includes –

  • Range
  • Interquartile Range (IQR)
  • Standard deviation
  • Variance

Range: The range is the calculation of minimum and maximum values present in the data.

#Computes the min and max values in the data
x <- c(4,56,34,67,54,23,56,98,7,6,56,7,89)
range(x)

Output: 4 , 98 -> 4 is minimum value and 98 is maximum value.

Interquartile Range(IQR): To understand this, you should first know about the 5-number summary. To understand the spread of the data, we usually look at this 5-number summary, which includes min, Q1, Q2, Q3, and max values of the data.

Here, Q1 is the 25% of the data, Q2 is 50% of the data (Median) and Q3 is the 75% of the data. Now, the interquartile range (IQR) is the difference between Q3 and Q1.

Let’s compute this using R.

x <- c(4,56,34,67,54,23,56,98,7,6,56,7,89,50)
summary(x)
Min.First quartile (Q1)MedianMeanThird Quartile (Q3)Max.
4.0011.0052.0043.3656.0098.00

So, based on IQR definition, it’s the difference between Q3 and Q1. We can use IQR function in R to compute this.

#Computes the IQR values of the data
#Q3-Q1 i.e. 56-11 = 45
IQR(x)

Output: 45

Standard deviation: The standard deviation is one of the common measures used to understand the average distance of a data point from the mean value.

#computes the standard deviation for the data. 
x <- c(4,56,34,67,54,23,56,98,7,6,56,7,89,50)
sd(x)

Output: 30.82858

Variance: The variance is the measure of the spread of data around the mean of the data. We can use var function in R to compute the variance of the data.

#computes the variance of data around the mean
x <- c(23,34,28,26,30,31)
var(x)

Output: 15.06667

Note: We usually use standard deviation rather than variance, because the standard deviation shares the same unit as of our data, and variance produces the squared units.

3. Shape of the data

It is much important to understand the shape of data in the analysis and nothing can serve the purpose than ‘Histogram‘. There are three shapes of data –

  • Right skewed
  • Left skewed
  • Symmetric (Normal distribution)

Right skewed: In the right-skewed data, the shorter bins will in the right side, and longer bins will be on the left side. You can see this in the below picture.

Right Skewed
Right Skewed

Left skewed: In the right-skewed data, the shorter bins will in the left side, and longer bins will be on the right side. You can see this in the below picture.

Left Skewed
Left Skewed

Symmetric (Normal distribution): The symmetric shape is something which when you can able to draw a line in the middle and you can see both left and right sides. The below picture will tell you more.

Descriptive statistics in R - symmetric
Symmetric

Key points to remember:

  • In the symmetric distribution, Mean = Median = Mode.
  • when it comes to the right-skewed distribution, Mean > Median.
  • In the left-skewed distribution, Mean < Median

4. Outliers in data

The outliers are the data points that lie far away from the rest of the data. These outliers can influence the measures such as mean and standard deviation. It is our last aspect of descriptive statistics.

There are some common and effective techniques that will help you to work on outliers –

  • Fix the typos if any.
  • Understand their existence and analyze the impact on our questions.
  • Reporting is the key.
  • When you have the outliers in the data, use the 5-number summary as the measure.
  • When you have a normal distribution in the data, use standard deviation and mean as measures.

You can make use of box plots to recognize the outliers effectively.

Descriptive statistics - Box plot
Box Plot

With the help of box plot, you can easily identify the quartiles, median, min, max values, and outliers as well as shown in the above picture.

Wrapping Up

Descriptive statistics is the key aspect of any data analysis project. Because, if you know the data better, you can analyze it better. Descriptive statistics helps you in understanding the data types, data shapes, data measures, data shape, and outliers as well. This will give a strong knowledge about the data you are working on. I have used R to compute all the data-driven things and I hope you got a better intuition over statistics using R. That’s all for now. Happy learning!

More read: Stats and R

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content