Let’s continue in our R programming tutorial series, and understand data frames in R. If you have ever handled data in databases, you will be familiar with the idea of records. Records are nothing but a collection of variables. For example, a student record could contain the student’s roll number, name, age, and gender altogether in one observation termed as the record. Database management systems store collections of such records as a table. R has a structure similar to such tables. These are known as data frames.
Unlike matrices or vectors, data frames have no restriction on the data type of variables. Each data frame can be a collection of numeric, strings, factors and so on. The only rule for writing a data frame in R is that all the records must be of the same length. Data frames in R are equipped with several functions and capabilities to handle large amounts of data for statistical processing purposes. Let us get started with data frames.
Creating a Data frame in R Programming
A data frame can be created using the
data.frame() function in R. This function can take any number of equal length vectors as arguments, along with one optional argument
stringsAsFactors. We will discuss about this shortly. The following is an example of a simple data frame creation.
#Create a string vector to hold the names. > names <- c("Adam","Antony","Brian","Carl","Doug") #Create an integer vector to hold the respective ages. > ages <- c(23,22,24,25,26) #Build a dataframe from the two vectors > playerdata <- data.frame(names,ages,stringsAsFactors = FALSE) #Display the data frame > playerdata names ages 1 Adam 23 2 Antony 22 3 Brian 24 4 Carl 25 5 Doug 26
Notice how the data frame accommodates both the string and associated integer together in its structure. It is possible to have any number of columns like this for a data frame. R also provides a unique index number to each record in the data frame as shown.
The argument stringsAsFactors is set to FALSE. Otherwise, the R compiler would treat each name as a specific categorical variable as we have seen in the factors tutorial earlier.
Accessing Records from Data Frames in R Language
The components of data frames can be accessed via the index numbers or the column names. The indexing of columns is done using a double square brace symbol
[[ ]]. When you access the columns using the names you need to precede the name by a dollar sign
> playerdata[]  23 22 24 25 26 > playerdata$names  "Adam" "Antony" "Brian" "Carl" "Doug"
When you wish to access data at a specific location, such as 2nd item on the 4th column, it can be done by a matrix indexing like notation. Let us look at an example.
> playerdata[3,2]  24
You can verify the size of your data frame using the
> names <-c("Akash","Amulya","Raju","Charita","Lokesh","Deepa","Ravi") > sex<-factor(c("M","F","M","F","M","F","M")) > age<-c(23,24,34,30,45,33,25) > emp<-data.frame(names,sex,age, stringsAsFactors = FALSE) > emp names sex age 1 Akash M 23 2 Amulya F 24 3 Raju M 34 4 Charita F 30 5 Lokesh M 45 6 Deepa F 33 7 Ravi M 25 #Check the dimensions of the data frame > dim(emp)  7 3 > nrow(emp)  7 > ncol(emp)  3
Extending Data Frames in R
Often, real-time data is dynamic in nature. The structure of data keeps changing as new variables get added. The length of the data keeps changing as more observations are made. To accommodate these, R provides means to add and remove both rows and columns to data frames.
Let us try adding a new record to the above created
emp data frame. To do this, we first need to create the records to be added as a data frame separately. Say we need to add a single record.
> newdata <-data.frame(names="Indu",sex="F",age=29)
Now we add this row to the already created
emp dataframe as follows.
> emp <- rbind(emp,newdata) > emp names sex age 1 Akash M 23 2 Amulya F 24 3 Raju M 34 4 Charita F 30 5 Lokesh M 45 6 Deepa F 33 7 Ravi M 25 8 Indu F 29
Adding columns to the data frame can be done by using a
cbind() function instead.
> salary <-c(10000,12000,20000,12000,21000,15000,13000,10000) > emp<-cbind(emp,salary) > emp names sex age salary 1 Akash M 23 10000 2 Amulya F 24 12000 3 Raju M 34 20000 4 Charita F 30 12000 5 Lokesh M 45 21000 6 Deepa F 33 15000 7 Ravi M 25 13000 8 Indu F 29 10000
You might be familiar with SQL (Sequential Query Language) used to query tables in databases using some logical conditions. R offers similar capabilities to query the data frames and generate logical subsets of larger data frames.
Suppose that I wish to extract all the data records from the
emp frame that belong to male employees. I can do that using the following line of code.
> emp[emp$sex=="M",] names sex age salary 1 Akash M 23 10000 3 Raju M 34 20000 5 Lokesh M 45 21000 7 Ravi M 25 13000
emp$sex=='M‘ gives a Boolean vector of whether or not the value of sex is M for a particular row. We use the same logical Boolean vector to index upon the emp frame. The comma that follows is necessary to specify matrix-like indexing. The part before comma represents the rows and the one after is columns. Since we left it blank, we simply select all the columns. We could also choose to display only the names and sex instead.
> emp[emp$sex=="M",1:2] names sex 1 Akash M 3 Raju M 5 Lokesh M 7 Ravi M
Instead of supplying the index number of columns, you can also give the names of the columns like below.
> emp[emp$sex=='F',c("names","sex")] names sex 2 Amulya F 4 Charita F 6 Deepa F 8 Indu F
You could also add logical conditions in indexing your data frames. Suppose you wish to extract the records of all the female employees with a salary greater than 12000. You can do that as follows.
> emp[emp$sex=='F'&salary>12000,] names sex age salary 6 Deepa F 33 15000
Useful Data Frame Functions
Apart from the utilities listed above, there are some handy functions you may need while handling data frames. This section lists a few of them.
Sorting a Data Frame
Sorting can be done using an order function in the indexing.
#Sort by decreasing order of salaries. > emp[order(emp$salary,decreasing = TRUE),] names sex age salary 5 Lokesh M 45 21000 3 Raju M 34 20000 6 Deepa F 33 15000 7 Ravi M 25 13000 2 Amulya F 24 12000 4 Charita F 30 12000 1 Akash M 23 10000 8 Indu F 29 10000 #Sort by ascending alphabetical order of names. > emp[order(emp$names,decreasing = FALSE),] names sex age 1 Akash M 23 2 Amulya F 24 4 Charita F 30 6 Deepa F 33 5 Lokesh M 45 3 Raju M 34 7 Ravi M 25
These are used to get the first few or last few rows of a dataframe respectively. These are especially useful when you have a huge dataset. They allow you to examine the characteristics of data without having to clog memory by displaying the entire dataset.
#Get the top 2 rows of the dataset. > head(emp,2) names sex age 1 Akash M 23 2 Amulya F 24 #Get the last 3 rows of the dataset. > tail(emp,3) names sex age 6 Deepa F 33 7 Ravi M 25 8 Indu F 29
By default, the head and tail functions fetch 6 rows without any number specified.
Merging two Data Frames
Merging data frames is similar to performing database joins on tables. When there is more information available regarding one column of the data frame in a separate data frame, we can easily merge these two using the common column. For example, consider that we have the marital status of some of the employees available as below.
> mar -> data.frame(names=c("Akash","Amulya","Raju","Lokesh","Ravi","Indu"),marital=c("single","married","single","single","single","married"))
We can now merge this with our
emp data frame to get the combined information.
> merge(emp,mar,by="names",all=TRUE) names sex age marital 1 Akash M 23 single 2 Amulya F 24 married 3 Charita F 30 <NA> 4 Deepa F 33 <NA> 5 Indu F 29 married 6 Lokesh M 45 single 7 Raju M 34 single 8 Ravi M 25 single
Even when we have no information about some employees, the data frame still gets populated with NA values to ensure a smooth merge.