The term “R” is used to refer to both the programming language and the software that interprets the scripts written using it.
RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.
To make it easier to interact with R, we will use RStudio. RStudio is the most popular IDE (Integrated Development Interface) for R. An IDE is a piece of software that provides tools to make programming easier.
Reasons to use R:
Start by giving our document a title using a comment which are denoted by a #
#Introduction to R and RStudio Class December 9, 2019
#Instructor: Tess Grynoch
#Notes by Name
#Comments allow you to add notes as you write your code
Notice how after we start making changes the file name in the tab turns red and has an * beside it. It’s a reminder that we have not saved the new changes to the document yet. As we go along, remember to hit the save icon or control + s (On PC) or command + s (On Mac).
R and Rstudio have a number of basic built-in functions and arithmetic that make it an excellent calculator.
+ | Add |
- | Subtract |
* | Multiply |
/ | Divide |
^ | Exponents |
R also follows the order of operations denoted by ()
To run a line or block of code, move your cursor to anywhere on the line or within the block and press control + enter (On PC) or command + return (On Mac).
2+2
## [1] 4
Comparison operations return whether something is true or false based on logic/Boolean
#greater than >
4>2
## [1] TRUE
#less than <
2<4
## [1] TRUE
#equals ==
2==4
## [1] FALSE
#does not equal !=
2!=4
## [1] TRUE
#less than or equal to <=
2<=2
## [1] TRUE
#greater than or equal to >=
2>=2
## [1] TRUE
You can store values in variables using a <- to point to the name you want to use. This allows you use the vaiable name in place of the value. Values that can be turned into variables include numbers, vectors, tables, functions, or plots.
4+2
## [1] 6
y <- 4
y+2
## [1] 6
R is also case sensitive so make sure to spell correctly.
y+2
Y+2 #You will get an error saying object 'Y' not found (Note you can also add a comment to the end of a line of code)
You can also overwrite a variable name
y <- 6
y+2
## [1] 8
The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated before the assignment occurs.
Variable names can contain letters, numbers, underscores and periods. They cannot start with a number nor contain spaces at all. Different people use different conventions for long variable names, these include
R is vectorized, meaning that variables and functions can have vectors as values. In contrast to physics and mathematics, a vector in R describes a set of values in a certain order of the same data type.The 6 basic data types are:
#two different ways to create vectors
1:5
## [1] 1 2 3 4 5
c(1,2,3,4,5)
## [1] 1 2 3 4 5
x <- 1:5 #we'll assign this vector as the variable x
You may not of realized it, but we just used our first function. c denotes concatenate in R. To find out more about any function or library, you can put a question mark in front of the function or library name and run the line to get the help documentation which includes the arguments the function accepts and examples of function use.
?c
You can also apply math operations and functions to vectors
2*x
## [1] 2 4 6 8 10
Caution that, if you add a character to an integer vector, the whole vector becomes characters.
y <- c(1,2,3,4,"5")
#Use the class function to find out the type of vector you have
class(y)
## [1] "character"
Never fear! You can convert your vector to a different data type
y <- as.numeric(y) #overwriting a variable
While RStudio prints the output of functions in the console and diplays variables in the top right, if we are using R through the command line, you can use print statements to view the variable. Print statements are also useful in complicated functions to explicitely state the result.
print(x)
## [1] 1 2 3 4 5
z <- 8/4
print(c('Solution A is',z,'times as strong as solution B'))
## [1] "Solution A is" "2"
## [3] "times as strong as solution B"
To expand on what is already available through RStudio, we need to install packages. Today we’ll be using tidyverse which is a collection of R packages for data science. For more information on the packages, visit the tidyverse website.
install.packages("tidyverse")
Once we install the package, we need to tell it which libraries we will be using
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1 v purrr 0.3.2
## v tibble 2.1.1 v dplyr 0.8.1
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'stringr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.3
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
To use data in R and RStudio we must first read it into the environment and assign it a variable name.
sample <- read.csv("data/sample.csv")
I’ve used the relative path to the file above but if I were to use the full destination name it would look something like “C://Users/grynochc/Desktop/Intro-R-class-20191209/Data/sample.csv” You can read other files into R using similar functions such as:
#Excel files
library(readxl)
read_xlsx
#SAS, SPSS, STATA files
library(haven)
read_sas()
read_spss()
read_stata()
Once you bring your data into RStudio, one of the first things you should do is get oriented to it. Even if you collected the data yourself or are already familiar with the data in its original format. I recommend running through some basic functions and exploratory statistics to check how the data was imported and ensure that there are no unexpected surprises. You can also view the full table by clicking on the table icon to the right of the data but this is not always useful with large datasets and I also don’t like moving between tabs.
head(sample)
## ID Gender Group BloodPressure Age Aneurisms_q1 Aneurisms_q2
## 1 Sub001 m Control 132 16.0 114 140
## 2 Sub002 m Treatment2 139 17.2 148 209
## 3 Sub003 m Treatment2 130 19.5 196 251
## 4 Sub004 f Treatment1 105 15.7 199 140
## 5 Sub005 m Treatment1 125 19.9 188 120
## 6 Sub006 M Treatment2 112 14.3 260 266
## Aneurisms_q3 Aneurisms_q4
## 1 202 237
## 2 248 248
## 3 122 177
## 4 233 220
## 5 222 228
## 6 320 294
tail(sample)
## ID Gender Group BloodPressure Age Aneurisms_q1 Aneurisms_q2
## 95 Sub095 m Control 108 13.6 111 118
## 96 Sub096 m Control 102 14.6 148 132
## 97 Sub097 F Treatment2 90 19.6 141 196
## 98 Sub098 m Treatment1 133 17.0 193 112
## 99 Sub099 M Treatment2 83 16.2 130 226
## 100 Sub100 M Treatment1 122 18.4 126 157
## Aneurisms_q3 Aneurisms_q4
## 95 173 191
## 96 200 194
## 97 322 273
## 98 123 181
## 99 286 281
## 100 129 160
str(sample) #gives class and head of each column/variable
## 'data.frame': 100 obs. of 9 variables:
## $ ID : Factor w/ 100 levels "Sub001","Sub002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 4 levels "f","F","m","M": 3 3 3 1 3 4 1 3 3 1 ...
## $ Group : Factor w/ 3 levels "Control","Treatment1",..: 1 3 3 2 2 3 1 3 3 1 ...
## $ BloodPressure: int 132 139 130 105 125 112 173 108 131 129 ...
## $ Age : num 16 17.2 19.5 15.7 19.9 14.3 17.7 19.8 19.4 18.8 ...
## $ Aneurisms_q1 : int 114 148 196 199 188 260 135 216 117 188 ...
## $ Aneurisms_q2 : int 140 209 251 140 120 266 98 238 215 144 ...
## $ Aneurisms_q3 : int 202 248 122 233 222 320 154 279 181 192 ...
## $ Aneurisms_q4 : int 237 248 177 220 228 294 245 251 272 185 ...
names(sample) #column names
## [1] "ID" "Gender" "Group" "BloodPressure"
## [5] "Age" "Aneurisms_q1" "Aneurisms_q2" "Aneurisms_q3"
## [9] "Aneurisms_q4"
dim(sample) #dimensions
## [1] 100 9
We can also examine a particular column by using $ColumnName to call that column.
sample$Age
## [1] 16.0 17.2 19.5 15.7 19.9 14.3 17.7 19.8 19.4 18.8 14.8 15.3 16.5 12.6
## [15] 14.3 15.9 18.4 18.3 15.4 14.3 12.7 15.4 17.2 17.3 16.7 19.6 15.0 16.1
## [29] 17.6 18.6 18.3 16.7 12.5 14.3 19.7 17.6 17.0 12.2 15.1 17.7 19.0 14.7
## [43] 15.2 15.3 12.9 18.4 18.1 15.6 19.5 13.5 13.5 13.7 18.7 12.2 16.9 19.5
## [57] 12.1 17.0 19.2 14.7 20.0 14.1 14.7 16.6 15.0 15.0 13.8 14.8 19.1 18.9
## [71] 17.7 17.4 15.5 13.1 12.2 17.0 17.7 19.5 19.5 12.8 17.6 17.7 14.2 19.2
## [85] 16.0 15.2 17.6 17.6 15.1 17.8 16.2 16.6 19.1 17.2 13.6 14.6 19.6 17.0
## [99] 16.2 18.4
R has a number of built in statistical functions and these can be further extended with specific libraries.
mean(sample$Age)
## [1] 16.42
min(sample$Age)
## [1] 12.1
max(sample$Age)
## [1] 20
median(sample$Age)
## [1] 16.65
Instead of running these basic statistics for each column, we can run a summary:
summary(sample)
## ID Gender Group BloodPressure Age
## Sub001 : 1 f:35 Control :30 Min. : 62.0 Min. :12.10
## Sub002 : 1 F: 4 Treatment1:35 1st Qu.:107.5 1st Qu.:14.78
## Sub003 : 1 m:46 Treatment2:35 Median :117.5 Median :16.65
## Sub004 : 1 M:15 Mean :118.6 Mean :16.42
## Sub005 : 1 3rd Qu.:133.0 3rd Qu.:18.30
## Sub006 : 1 Max. :173.0 Max. :20.00
## (Other):94
## Aneurisms_q1 Aneurisms_q2 Aneurisms_q3 Aneurisms_q4
## Min. : 65.0 Min. : 80.0 Min. :105.0 Min. :116.0
## 1st Qu.:118.0 1st Qu.:131.5 1st Qu.:182.5 1st Qu.:186.8
## Median :158.0 Median :162.5 Median :217.0 Median :219.0
## Mean :158.8 Mean :168.0 Mean :219.8 Mean :217.9
## 3rd Qu.:188.0 3rd Qu.:196.8 3rd Qu.:248.2 3rd Qu.:244.2
## Max. :260.0 Max. :283.0 Max. :323.0 Max. :315.0
##
Just as in other programs such as Excel and Python, dataframes are navigated using rows and column numbers. In R the notation is [rows, columns]. Leaving the rows blank will bring back all rows and leaving the columns blank will bring back all columns. (Note: counting starts at 1, not 0 for those familiar with Python) For example, if I wanted to view the 4th row:
sample[4,]
## ID Gender Group BloodPressure Age Aneurisms_q1 Aneurisms_q2
## 4 Sub004 f Treatment1 105 15.7 199 140
## Aneurisms_q3 Aneurisms_q4
## 4 233 220
We can also call a single datapoint using the same technique
sample[4,3]
## [1] Treatment1
## Levels: Control Treatment1 Treatment2
Or we can call a section of datapoints
sample[1:4,c(3, 5)]
## Group Age
## 1 Control 16.0
## 2 Treatment2 17.2
## 3 Treatment2 19.5
## 4 Treatment1 15.7
We noticed that there were some capital M and F in the gender column. To ensure accurate analysis we want to turn those into small m and f to match the majority of the cells. Let’s start with cleaning up the Ms and I’m going to break down the operation into its components
sample[sample$Gender=='M', ] #gives me all the rows with Gender M
## ID Gender Group BloodPressure Age Aneurisms_q1 Aneurisms_q2
## 6 Sub006 M Treatment2 112 14.3 260 266
## 26 Sub026 M Treatment1 123 19.6 115 160
## 27 Sub027 M Treatment2 126 15.0 128 249
## 39 Sub039 M Treatment1 113 15.1 132 137
## 47 Sub047 M Control 125 18.1 192 141
## 48 Sub048 M Control 99 15.6 178 180
## 64 Sub064 M Treatment1 149 16.6 189 101
## 69 Sub069 M Treatment2 148 19.1 222 199
## 77 Sub077 M Treatment1 151 17.7 168 184
## 78 Sub078 M Treatment1 121 19.5 118 170
## 79 Sub079 M Treatment1 116 19.5 169 114
## 82 Sub082 M Treatment1 62 17.7 188 108
## 83 Sub083 M Treatment2 124 14.2 169 168
## 99 Sub099 M Treatment2 83 16.2 130 226
## 100 Sub100 M Treatment1 122 18.4 126 157
## Aneurisms_q3 Aneurisms_q4
## 6 320 294
## 26 158 228
## 27 294 315
## 39 193 206
## 47 180 225
## 48 169 183
## 64 193 172
## 69 280 196
## 77 184 229
## 78 249 249
## 79 248 233
## 82 180 136
## 83 180 211
## 99 286 281
## 100 129 160
sample[sample$Gender=='M', ]$Gender #isolates the gender column of those rows
## [1] M M M M M M M M M M M M M M M
## Levels: f F m M
sample[sample$Gender=='M', ]$Gender <- 'm' #replace with small m (This is our final operation)
#let's check that it worked
summary(sample$Gender)
## f F m M
## 35 4 61 0
Run the same operation for the Fs
sample[sample$Gender=='F', ]$Gender <- 'f'
#let's check that it worked
summary(sample$Gender)
## f F m M
## 39 0 61 0
To save the cleaning that we performed on the data, we can export the new sample dataframe as a file
write.csv(sample, "./data_output/sample_v2.csv")
ggplot2 comes as part of tidyverse and can be used to create a number of different plots. Today we’ll cover a couple of basic ones, bar charts and dot plots, but the premise behind building a chart in ggplot is the same for all charts.
With any ggplot chart, you start with what data you will be using.
ggplot(data=sample)
Next, we’ll define our plot area (mapping) using the aesthetic (aes) function. This generates a blank plot.
ggplot(data=sample, mapping=aes(x=Gender))
We add the various elements to the chart as geoms with a + symbol on the previous line which will automatically indent the next line to keep all our code in a block to run together.
ggplot(data=sample, mapping=aes(x=Gender))+
geom_bar()
The mapping can also be assigned to a particular geom
ggplot(data=sample)+
geom_bar(mapping=aes(x=Gender))
You can find out the other arguments and aesthetics that can be used in the geom by viewing the help documentation
?geom_bar
For example, we can adjust the width of the columns
ggplot(data=sample, mapping=aes(x=Gender))+
geom_bar(width=0.4)
Dot plot
ggplot(data=sample, mapping=aes(x=Age, y=Aneurisms_q1))+
geom_point(aes(colour=Group))
?geom_point
q1results <- ggplot(data=sample, mapping=aes(x=Age, y=Aneurisms_q1))+
geom_point(aes(colour=Group))
To save the chart you created use the ggsave function
ggsave("./outputs/q1results.png", q1results)
Materials from a number of different sources were used in the creation of this class including: