R for beginners | part 2

Youtube playlist for this tutorial series https://www.youtube.com/playlist?list=PLn-5J2322iDh2Rf6BUR836bAkYTXHB65O

Learning objectives

  • Learn how to preprocess data in R for analysis and plotting

Content

  • Data types in R: vector, matrix, dataframe, factor, character
  • Read data from multiple folders
  • Replace/remove missing values
  • Manipulate subset of dataframe
  • Calculate statistics by group
  • Row/column-wise statistics
  • Order dataframe values
  • Reshape dataframe
  • Randomly split data
  • Standardize and normalize data
  • For loop and if/else condition
  • Join data with common column

Data/code for this tutorial

Download data for this tutorial from https://raw.githubusercontent.com/szalam/R-tutorial/master/R_tutorial_part2_data.zip. You may also download thd data to your computer using R code

#web link
url.data = 'https://raw.githubusercontent.com/szalam/R-tutorial/master/R_tutorial_part2_data.zip'

#destinatin in computer where the data will be downloaded
setwd('C:/sarfaraz/Project_R_tutorials/R-tutorial/R_beginner_part2_files/')

#download data
download.file(url = url.data, destfile = 'R_tutorial_part2_data.zip', method="auto")

#unzip data
unzip(zipfile='R_tutorial_part2_data.zip')

Data types in R

Vector: Sequence of data. Can be told as 1-dimensional data.

#Vector
v1 = c(1,2,3,4)
v2 = c('A', 'B', 'C', 'D')

print(v1)
## [1] 1 2 3 4
print(v2)
## [1] "A" "B" "C" "D"

Matrix: 2-dimensional data with row and columns. Data type (or class) of all the matrix element should be same (numeric, character)

#Matrix. All values are numeric
m = matrix(1:10, nrow = 5, ncol = 2)
m
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10

Dataframe: 2-dimensional data with row and columns. Data type of different columns can be different. One column can be numeric and another can be character.

#Dataframe. Column 1 and 2 numeric, and column 2 character
c1 = c(1:5)
c2 = c(11:15)
c3 = c('A', 'B', 'C', 'D', 'E')
df = data.frame(c1, c2, c3)

List: Similar to 3-dimensional data. For example, a list with two objects (or lists) can have dataframe in the first list and vector in the second list. It can store different types of objects in different list.

#Lists
l = list()
l[[1]] = df
l[[2]] = v1
str(l)
## List of 2
##  $ :'data.frame':    5 obs. of  3 variables:
##   ..$ c1: int [1:5] 1 2 3 4 5
##   ..$ c2: int [1:5] 11 12 13 14 15
##   ..$ c3: chr [1:5] "A" "B" "C" "D" ...
##  $ : num [1:4] 1 2 3 4

Factor: Used to categorize data. Can be integer or character. Factor is very usefull during categorical plots.

setwd('C:/sarfaraz/Project_R_tutorials/R-tutorial/R_beginner_part2_files/data1/')
df = read.csv('rainfall_data.csv')
head(df)
##        Date Month Year      loc_A       loc_B       loc_C
## 1 1/15/1981     1 1981 188.682179 83.57845383 164.6300596
## 2 2/15/1981     2 1981  83.090653 42.52718642  46.4600436
## 3 3/15/1981     3 1981 136.694246 99.88195830 130.9438164
## 4 4/15/1981     4 1981  40.223348          NA  37.0420120
## 5 5/15/1981     5 1981  46.528754 13.56783115  18.1241447
## 6 6/15/1981     6 1981   4.081162  0.01092273   0.1125636
#lets plot year vs rainfal in loc_A in the original dataframe 'df'
df.mod = df[!is.na(df$loc_A),]
#plot(df.mod$Year,df.mod$loc_A)

#convert the Year column of df into factor
df.mod2 = df.mod
df.mod2$Year = as.factor(df.mod2$Year)
# plot(df.mod2$Year,df.mod2$loc_A)

Read data from multiple folders

A commonly used technique to read csv/txt files from different folders is shown below. Here, we read one .csv file from folder “data1”" and 1 .txt file from folder “data2”.

fold1 = 'C:/sarfaraz/Project_R_tutorials/R-tutorial/R_beginner_part2_files/data1/'
fold2 = 'C:/sarfaraz/Project_R_tutorials/R-tutorial/R_beginner_part2_files/data2/'

#read csv from folder data1
df1 = read.csv(paste0(fold1,'rainfall_data.csv'))
df2 = read.table(paste0(fold2,'d1.txt'),fill =T)

Remove/replace missing values

Here we read a csv file with missing data (which is generally read as “NA” in R). Then remove/replace NA values.

#set working directory
setwd('C:/sarfaraz/Project_R_tutorials/R-tutorial/R_beginner_part2_files/data1')

#read rainfal data
df = read.csv('rainfall_data.csv')
summary(df)
##      Date               Month            Year          loc_A         
##  Length:144         Min.   : 1.00   Min.   :1981   Min.   :  0.0393  
##  Class :character   1st Qu.: 3.75   1st Qu.:1984   1st Qu.: 11.8359  
##  Mode  :character   Median : 6.50   Median :1986   Median : 42.6371  
##                     Mean   : 6.50   Mean   :1986   Mean   : 73.8314  
##                     3rd Qu.: 9.25   3rd Qu.:1989   3rd Qu.:112.1323  
##                     Max.   :12.00   Max.   :1992   Max.   :416.9089  
##                                                    NA's   :3         
##      loc_B               loc_C         
##  Min.   :  0.01092   Min.   :  0.0719  
##  1st Qu.:  2.93055   1st Qu.:  5.6136  
##  Median : 17.68898   Median : 27.4605  
##  Mean   : 34.57416   Mean   : 55.3813  
##  3rd Qu.: 50.15615   3rd Qu.: 80.5734  
##  Max.   :245.07228   Max.   :342.1296  
##  NA's   :3           NA's   :2
#calculate mean
mean(df$loc_A)
## [1] NA
mean(df$loc_A, na.rm = T)
## [1] 73.83143
#find ids of NAs for loc_A, loc_B, loc_C
id.na.a = which(is.na(df$loc_A)==T)
id.na.a = which(is.na(df$loc_A)==T)
id.na.a = which(is.na(df$loc_A)==T)

#replace NA with -999
df[id.na.a,4] = -999

#remove rows with NA values in loc_A
df2 = df[-id.na.a,]

#another way to remove rows with NA in a single line
df3 = df[!is.na(df$loc_A),]

Similar to rows, specific column can also be removed from the dataframe

#removing column 2 from df
df.rm = df[,-2]

#removinb column 2 and 3
df.rm.2 = df[,c(-2,-3)]

Replace character/numeric

Replacing character/numeric value is a neccesary is skill. For example, a data may have spelling mistake (“Californiaa”" instread of “California”). Here, we will learn how to replace character/numreic values.

#set working directory
setwd('C:/sarfaraz/Project_R_tutorials/R-tutorial/R_beginner_part2_files/data1')

#read data and check unique Regions
df.tmp = read.csv('replace_char_num.csv')
unique(df.tmp$Region)
## [1] "California"  "Ohio"        "Nevada"      "Californiaa"
#looks like there is a spelling mistake. "Californiaa" should be "California". Lets correct it
df.tmp[df.tmp$Region == 'Californiaa',] = 'California'
unique(df.tmp$Region)
## [1] "California" "Ohio"       "Nevada"

Using ifelse function can be an efficient way to do the replacing job.

#here we will replace Region Ohio with Washington. The code replace where region is Ohio, otherwise keep as it is.
df.tmp$Region = ifelse(df.tmp$Region == "Ohio", "Washington", df.tmp$Region)   
unique(df.tmp$Region)
## [1] "California" "Washington" "Nevada"

The method I showed above are also applicable for numeric values or signs. For example, you might want to replace numeric values, or signs like “#” or “?” or even blank spaces " “. The above codes are extremely useful for data cleaning and preparation for analysis.

Manipulate subset of dataframe

I will show three ways to select part of a dataframe that follow user-specified condition. Let us separate data with months 1,3 and 7

#method 1: identify the indices, then separate
id.tmp = which(df$Month == 1 | df$Month == 3 | df$Month == 7)
df.new1 = df[id.tmp,]

#method 2: using %in% 
mon.tmp = c(1,3,7)
df.new2 = df[df$Month %in% mon.tmp,]

#method 3: using filter funtion of the library dplyr. Call dplyr library first
library(dplyr) # if you don't have the library than install using install.packages('dplyr')
## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df.new3 = filter(df, Month == c(1,3,7))
#to select subset within a range (2 to 8)
df.new4 = filter(df, Month <=8 & Month >=2)

Let us multiply 1.5 with all the rainfall in month 2 and 4 that occurred in loc_A

#identify the indices of rows with month 2 and 4
id.tmp = which(df$Month == 2 | df$Month == 4)

#multiply 1.5 with the loc_A column having row id as id.tmp
df[id.tmp,4] = df[id.tmp,4] * 1.5

Calculate statistics by group

Let us calculate mean rainfall in all the January months (month = 1).

#using aggregate function
df.gr1 = aggregate(df[,c(4,5,6)],by = list(df[,2]),FUN = mean,na.rm = T)
head(df.gr1)
##   Group.1     loc_A     loc_B      loc_C
## 1       1  26.42661 59.006145  85.862767
## 2       2 202.93061 65.025331 104.953206
## 3       3 168.03628 91.019927 136.666929
## 4       4 -49.59144 28.554692  45.024609
## 5       5  36.71198 10.733013  19.447401
## 6       6  17.97584  4.148091   7.806769

Row/column-wise statistics

Calculate mean value of all rows/columns

#calculate mean along selected columns (here, column 4, 5 and 6)
df.gr2.mean = apply(df[,c(4,5,6)], 2, mean,na.rm = T) # here 2 means column-wise

#also calculate column-wise standard deviation
df.gr2.sd = apply(df[,c(4,5,6)], 2, sd,na.rm=T)

#calculate row-wise mean in selected columns (here, column 4, 5 and 6)
df.gr3 = apply(df[,c(4,5,6)], 1, mean,na.rm = T) # here 1 indicate row
# we can also do this using colMeans() and rowMeans() functions
df.gr2_2 = colMeans(df[,c(4,5,6)], na.rm =T)
df.gr2_2
##    loc_A    loc_B    loc_C 
## 55.74019 34.57416 55.38134
df.gr3_2 = rowMeans(df[,c(4,5,6)], na.rm =T)

Order dataframe values

Let us sort the datframe in ascending/descending order

#sort in ascending order
df.tmp = df[order(df$loc_A),]
head(df.tmp)
##           Date Month Year         loc_A      loc_B        loc_C
## 88   4/15/1988     4 1988 -1498.5000000 60.2571635  77.04603099
## 11  11/15/1981    11 1981  -999.0000000 50.1490205 162.31109130
## 49   1/15/1985     1 1985  -999.0000000 28.2914828  26.01976957
## 8    8/15/1981     8 1981     0.0393286  0.5085948   0.16317945
## 103  7/15/1989     7 1989     0.1020899  0.2536141   0.07192623
## 68   8/15/1986     8 1986     0.1706050  0.8371909   0.50209573
#sort in descending order
df.tmp2 = df[order(-df$loc_A),]

#plot
plot(df$loc_A, ylab = 'Rainfall', xlab = 'Months')
plot(df.tmp$loc_A, ylab = 'Rainfall', xlab = 'Months')
plot(df.tmp2$loc_A, ylab = 'Rainfall', xlab = 'Months')

Reshape dataframe

Dataframe can be reshaped in different ways. I am showing few of them below,

#transpose
tr = t(df[,c(4,5,6)]) 
#after transpose dataframe became matrix. to convert to dataframe 
df.t = as.data.frame(tr)

Melt dataframe to obtain unique ids. This is specially usefull while using ggplot

#melt dataframe. This creates unique ids. we need 'reshape2' package for melt function
library(reshape2)
df.m = melt(df, id = c('Date','Year','Month'))
#all rainfall is moved to a single column 'value' and location names to column 'variable'
head(df.m)
##        Date Year Month variable      value
## 1 1/15/1981 1981     1    loc_A 188.682179
## 2 2/15/1981 1981     2    loc_A 124.635979
## 3 3/15/1981 1981     3    loc_A 136.694246
## 4 4/15/1981 1981     4    loc_A  60.335022
## 5 5/15/1981 1981     5    loc_A  46.528754
## 6 6/15/1981 1981     6    loc_A   4.081162

Randomly splitting data

A lot of application requires data to be randomly splitted. One of which is to train a model and another for validation. Here, we split the dataframe into two groups where the first group has 60% data and the second has 40%.

id = sample(2, nrow(df), replace = TRUE, prob = c(0.6, 0.4))

#separate rows where id == 1 and id == 2 in two variables
grp1 = df[id==1,]
grp2 = df[id==2,]

dim(df)
## [1] 144   6
dim(grp1)
## [1] 98  6
dim(grp2)
## [1] 46  6

Standardize and normalize data

Standardization: Removing the mean and variability from a data. Mean becomes 0 and standard deviation 1. Normalization: There are different ways to normalize a data. Common way is to subtract mean and divide by the range (maximum - minimum).

#standardization of rainfall in location A
scale.df= scale(df$loc_A)

#normalize rainfall in location A
nor.locA = (df$loc_A - min(df$loc_A, na.rm = T))/(max(df$loc_A, na.rm = T) - min(df$loc_A, na.rm = T))

For loop and if/else

We will read three text files in loop, apply id/else. To do this download ‘locA.txt’, ‘locB.txt’ and ’locC.txt from my github link.

#change working directory to the data location
setwd('C:/sarfaraz/Project_R_tutorials/R-tutorial/R_beginner_part2_files/data2/')

#list down files with .txt extension
files = list.files(pattern='.txt')

#total files listed
length(files)
## [1] 3

Read files in loop and apply if/else condition. I am creating two examples for this

Example 1:

Read texts in loop and print texts if missing value is more than or equal 9 in a column

#change working directory to the data location
setwd('C:/sarfaraz/Project_R_tutorials/R-tutorial/R_beginner_part2_files/data2/')

#for loop
for(i in 1:length(files)){ # iterate 3 times
  
  #read txt file. fill = T means the missing value is filled
  df.tmp = read.table(files[i],fill = T,header = T)
  
  if(length(which(is.na(df.tmp[,4])==T))<=9){
    print(paste0('i=',i,' Note:less than or equal 9 missing values'))
  }else{
      print(paste0('i=',i,' Note:more than 9 missing values'))
    }
  }
## [1] "i=1 Note:less than or equal 9 missing values"
## [1] "i=2 Note:more than 9 missing values"
## [1] "i=3 Note:less than or equal 9 missing values"

Example 2:

Same as the previous example, but skipping a loop when missing value greater than 9

#change working directory to the data location
setwd('C:/sarfaraz/Project_R_tutorials/R-tutorial/R_beginner_part2_files/data2/')

#for loop
for(i in 1:length(files)){ # iterate 3 times
  
  #read txt file. fill = T means the missing value is filled
  df.tmp = read.table(files[[i]],fill = T,header = T)
  if(length(which(is.na(df.tmp[,4])==T))<=9){
    print(paste0('i=',i,' Note:less than or equal 9 missing values'))
  }else{
      next()
    }
  }
## [1] "i=1 Note:less than or equal 9 missing values"
## [1] "i=3 Note:less than or equal 9 missing values"

Join data with common column

Let us read two sets of data that have date as a common axis. We join two dataframes using full_join command of dplyr library. Here, I show joining based on date, but this can be used for joining based on different types of values (like numeric and factor)

setwd('C:/sarfaraz/Project_R_tutorials/R-tutorial/R_beginner_part2_files/data2/')

#read two datasets
d1 = read.table('d1.txt',header = T, fill =T)
d2 = read.table('d2.txt',header = T, fill =T)

#call dplyr
library(dplyr)
df.new.join = full_join(d1, d2, by = "Date")
head(df.new.join)
##        Date Month.x Year.x      loc_A Month.y Year.y       loc_B
## 1 1/15/1981       1   1981 188.682179       1   1981 83.57845383
## 2 2/15/1981       2   1981  83.090653       2   1981 42.52718642
## 3 3/15/1981       3   1981 136.694246       3   1981 99.88195830
## 4 4/15/1981       4   1981  40.223348       4   1981          NA
## 5 5/15/1981       5   1981  46.528754       5   1981 13.56783115
## 6 6/15/1981       6   1981   4.081162       6   1981  0.01092273