The term “R” is used to refer to both the programming language and the software that interprets the scripts written using it.
RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.
To make it easier to interact with R, we will use RStudio. RStudio is the most popular IDE (Integrated Development Interface) for R. An IDE is a piece of software that provides tools to make programming easier.
Reasons to use R:
Let’s take a quick tour of RStudio.
RStudio is divided into four “panes”. The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout).
The Default Layout is:
To expand on the functions already available in the base R, we need to install packages. Today we’ll be using tidyverse which is a collection of R packages for data science. For more information on the packages, visit the tidyverse website. Type the following code into the console or use the install button on the packages tab to install tidyverse.
install.packages("tidyverse")
Start by giving our document a title using a comment which are denoted by a #
#Introduction to R and RStudio Class April 2024
#Instructor: Tess Grynoch
#Notes by Name
#Comments allow you to add notes as you write your code
Notice how after we start making changes the file name in the tab turns red and has an * beside it. It’s a reminder that we have not saved the new changes to the document yet. As we go along, remember to hit the save icon or control + s (On PC) or command + s (On Mac).
R and Rstudio have a number of basic built-in functions and arithmetic that make it an excellent calculator.
+ | Add |
- | Subtract |
* | Multiply |
/ | Divide |
^ | Exponents |
R also follows the order of operations denoted by ()
To run a line or block of code, move your cursor to anywhere on the line or within the block and press control + enter (On PC) or command + return (On Mac).
2+2
## [1] 4
Comparison operations return whether something is true or false based on logic/Boolean
#greater than >
4>2
## [1] TRUE
#less than <
2<4
## [1] TRUE
#equals ==
2==4
## [1] FALSE
#does not equal !=
2!=4
## [1] TRUE
#less than or equal to <=
2<=2
## [1] TRUE
#greater than or equal to >=
2>=2
## [1] TRUE
You can store values in variables using a <- to point to the name you want to use. This allows you use the variable name in place of the value. Values that can be turned into variables include numbers, vectors, tables, functions, or plots.
4+2
## [1] 6
y <- 4
y+2 #using the variable in place of the value
## [1] 6
y #to print the value of the object, type it's name
## [1] 4
(y <- 4) #or put parenthesis around the assignment
## [1] 4
R is also case sensitive so make sure to spell correctly.
y+2
Y+2 #You will get an error saying object 'Y' not found (Note you can also add a comment to the end of a line of code)
You can also overwrite a variable name
y <- 6
y+2
## [1] 8
#assign the results to a new variable
x <- y+2
#and change x
x <- 10
#Activity: What is the current content of y? 8 or 6?
The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated before the assignment occurs.
Variable names can contain letters, numbers, underscores and periods. They cannot start with a number nor contain spaces at all. Different people use different conventions for long variable names, these include
R is vectorized, meaning that variables and functions can have vectors as values. In contrast to physics and mathematics, a vector in R describes a set of values in a certain order of the same data type.The 6 basic data types are:
#two different ways to create vectors
1:5
## [1] 1 2 3 4 5
c(1,2,3,4,5)
## [1] 1 2 3 4 5
x <- 1:5 #we'll assign this vector as the variable x
You may not of realized it, but we just used our first function. c denotes combine in R. To find out more about any function or library, you can put a question mark in front of the function or library name and run the line to get the help documentation which includes the arguments the function accepts and examples of function use.
?c
You can also apply math operations and functions to vectors
2*x
## [1] 2 4 6 8 10
Caution that, if you add a character to an integer vector, the whole vector becomes characters.
y <- c(1,2,3,4,"5")
#Use the class function to find out the type of vector you have
class(y)
## [1] "character"
#Activity: What class is each of these vectors with a mixture of values?
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)
Never fear! You can convert your vector to a different data type
y <- as.numeric(y) #overwriting a variable
While RStudio prints the output of functions in the console and displays variables in the top right, if we are using R through the command line, you can use print statements to view the variable. Print statements are also useful in complicated functions to explicitly state the result.
print(x)
## [1] 1 2 3 4 5
z <- 8/4
print(c('Solution A is',z,'times as strong as solution B'))
## [1] "Solution A is" "2"
## [3] "times as strong as solution B"
While you only need to install packages once per computer, you need to indicate which libraries you are using with each R session. So, it’s good practice to list all the libraries you will be using at the top of your R script.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
To use data in R and RStudio we must first read it into the environment and assign it a variable name.
sample <- read_csv("data/sample.csv")
## Rows: 100 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): ID, Gender, Group
## dbl (6): BloodPressure, Age, Aneurisms_q1, Aneurisms_q2, Aneurisms_q3, Aneur...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I’ve used the relative path to the file above but if I were to use the full destination name it would look something like “C://Users/grynochc/Desktop/Intro-R-class-20191209/Data/sample.csv” You can read other files into R using similar functions such as:
#Excel files
library(readxl)
read_xlsx
#SAS, SPSS, STATA files
library(haven)
read_sas()
read_spss()
read_stata()
Once you bring your data into RStudio, one of the first things you should do is get oriented to it. Even if you collected the data yourself or are already familiar with the data in its original format, I recommend running through some basic functions and exploratory statistics to check how the data was imported and ensure that there are no unexpected surprises. You can also view the full table by clicking on the table icon to the right of the data but this is not always useful with large datasets and I also don’t like moving between tabs.
head(sample)
## # A tibble: 6 × 9
## ID Gender Group BloodPressure Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Sub001 m Cont… 132 16 114 140 202
## 2 Sub002 m Trea… 139 17.2 148 209 248
## 3 Sub003 m Trea… 130 19.5 196 251 122
## 4 Sub004 f Trea… 105 15.7 199 140 233
## 5 Sub005 m Trea… 125 19.9 188 120 222
## 6 Sub006 M Trea… 112 14.3 260 266 320
## # ℹ 1 more variable: Aneurisms_q4 <dbl>
tail(sample)
## # A tibble: 6 × 9
## ID Gender Group BloodPressure Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Sub095 m Cont… 108 13.6 111 118 173
## 2 Sub096 m Cont… 102 14.6 148 132 200
## 3 Sub097 F Trea… 90 19.6 141 196 322
## 4 Sub098 m Trea… 133 17 193 112 123
## 5 Sub099 M Trea… 83 16.2 130 226 286
## 6 Sub100 M Trea… 122 18.4 126 157 129
## # ℹ 1 more variable: Aneurisms_q4 <dbl>
str(sample) #gives class and head of each column/variable
## spc_tbl_ [100 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ID : chr [1:100] "Sub001" "Sub002" "Sub003" "Sub004" ...
## $ Gender : chr [1:100] "m" "m" "m" "f" ...
## $ Group : chr [1:100] "Control" "Treatment2" "Treatment2" "Treatment1" ...
## $ BloodPressure: num [1:100] 132 139 130 105 125 112 173 108 131 129 ...
## $ Age : num [1:100] 16 17.2 19.5 15.7 19.9 14.3 17.7 19.8 19.4 18.8 ...
## $ Aneurisms_q1 : num [1:100] 114 148 196 199 188 260 135 216 117 188 ...
## $ Aneurisms_q2 : num [1:100] 140 209 251 140 120 266 98 238 215 144 ...
## $ Aneurisms_q3 : num [1:100] 202 248 122 233 222 320 154 279 181 192 ...
## $ Aneurisms_q4 : num [1:100] 237 248 177 220 228 294 245 251 272 185 ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_character(),
## .. Gender = col_character(),
## .. Group = col_character(),
## .. BloodPressure = col_double(),
## .. Age = col_double(),
## .. Aneurisms_q1 = col_double(),
## .. Aneurisms_q2 = col_double(),
## .. Aneurisms_q3 = col_double(),
## .. Aneurisms_q4 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
names(sample) #column names
## [1] "ID" "Gender" "Group" "BloodPressure"
## [5] "Age" "Aneurisms_q1" "Aneurisms_q2" "Aneurisms_q3"
## [9] "Aneurisms_q4"
dim(sample) #dimensions
## [1] 100 9
nrow(sample) #number of rows
## [1] 100
ncol(sample) #number of columns
## [1] 9
We can also examine a particular column by using $ColumnName to call that column.
sample$Age
## [1] 16.0 17.2 19.5 15.7 19.9 14.3 17.7 19.8 19.4 18.8 14.8 15.3 16.5 12.6 14.3
## [16] 15.9 18.4 18.3 15.4 14.3 12.7 15.4 17.2 17.3 16.7 19.6 15.0 16.1 17.6 18.6
## [31] 18.3 16.7 12.5 14.3 19.7 17.6 17.0 12.2 15.1 17.7 19.0 14.7 15.2 15.3 12.9
## [46] 18.4 18.1 15.6 19.5 13.5 13.5 13.7 18.7 12.2 16.9 19.5 12.1 17.0 19.2 14.7
## [61] 20.0 14.1 14.7 16.6 15.0 15.0 13.8 14.8 19.1 18.9 17.7 17.4 15.5 13.1 12.2
## [76] 17.0 17.7 19.5 19.5 12.8 17.6 17.7 14.2 19.2 16.0 15.2 17.6 17.6 15.1 17.8
## [91] 16.2 16.6 19.1 17.2 13.6 14.6 19.6 17.0 16.2 18.4
R has a number of built in statistical functions and these can be further extended with specific libraries.
mean(sample$Age)
## [1] 16.42
min(sample$Age)
## [1] 12.1
max(sample$Age)
## [1] 20
median(sample$Age)
## [1] 16.65
Instead of running these basic statistics for each column, we can run a summary:
summary(sample)
## ID Gender Group BloodPressure
## Length:100 Length:100 Length:100 Min. : 62.0
## Class :character Class :character Class :character 1st Qu.:107.5
## Mode :character Mode :character Mode :character Median :117.5
## Mean :118.6
## 3rd Qu.:133.0
## Max. :173.0
## Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
## Min. :12.10 Min. : 65.0 Min. : 80.0 Min. :105.0
## 1st Qu.:14.78 1st Qu.:118.0 1st Qu.:131.5 1st Qu.:182.5
## Median :16.65 Median :158.0 Median :162.5 Median :217.0
## Mean :16.42 Mean :158.8 Mean :168.0 Mean :219.8
## 3rd Qu.:18.30 3rd Qu.:188.0 3rd Qu.:196.8 3rd Qu.:248.2
## Max. :20.00 Max. :260.0 Max. :283.0 Max. :323.0
## Aneurisms_q4
## Min. :116.0
## 1st Qu.:186.8
## Median :219.0
## Mean :217.9
## 3rd Qu.:244.2
## Max. :315.0
Just as in other programs such as Excel and Python, dataframes are navigated using rows and column numbers. In R the notation is [rows, columns]. Leaving the rows blank will bring back all rows and leaving the columns blank will bring back all columns. (Note: counting starts at 1, not 0 for those familiar with Python) For example, if I wanted to view the 4th row:
sample[4,]
## # A tibble: 1 × 9
## ID Gender Group BloodPressure Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Sub004 f Trea… 105 15.7 199 140 233
## # ℹ 1 more variable: Aneurisms_q4 <dbl>
We can also call a single datapoint using the same technique
sample[4,3]
## # A tibble: 1 × 1
## Group
## <chr>
## 1 Treatment1
Or we can call a section of datapoints
sample[1:4,c(3, 5)]
## # A tibble: 4 × 2
## Group Age
## <chr> <dbl>
## 1 Control 16
## 2 Treatment2 17.2
## 3 Treatment2 19.5
## 4 Treatment1 15.7
We noticed that there were some capital M and F in the gender column. To ensure accurate analysis we want to turn those into small m and f to match the majority of the cells. Let’s start with cleaning up the Ms and I’m going to break down the operation into its components
sample[sample$Gender=='M', ] #gives me all the rows with Gender M
## # A tibble: 15 × 9
## ID Gender Group BloodPressure Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Sub0… M Trea… 112 14.3 260 266 320
## 2 Sub0… M Trea… 123 19.6 115 160 158
## 3 Sub0… M Trea… 126 15 128 249 294
## 4 Sub0… M Trea… 113 15.1 132 137 193
## 5 Sub0… M Cont… 125 18.1 192 141 180
## 6 Sub0… M Cont… 99 15.6 178 180 169
## 7 Sub0… M Trea… 149 16.6 189 101 193
## 8 Sub0… M Trea… 148 19.1 222 199 280
## 9 Sub0… M Trea… 151 17.7 168 184 184
## 10 Sub0… M Trea… 121 19.5 118 170 249
## 11 Sub0… M Trea… 116 19.5 169 114 248
## 12 Sub0… M Trea… 62 17.7 188 108 180
## 13 Sub0… M Trea… 124 14.2 169 168 180
## 14 Sub0… M Trea… 83 16.2 130 226 286
## 15 Sub1… M Trea… 122 18.4 126 157 129
## # ℹ 1 more variable: Aneurisms_q4 <dbl>
sample[sample$Gender=='M', ]$Gender #isolates the gender column of those rows
## [1] "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "M"
sample[sample$Gender=='M', ]$Gender <- 'm' #replace with small m (This is our final operation)
#let's check that it worked
table(sample$Gender)
##
## f F m
## 35 4 61
Run the same operation for the Fs
sample[sample$Gender=='F', ]$Gender <- 'f'
#let's check that it worked
table(sample$Gender)
##
## f m
## 39 61
Another way to subset your data with the tidyverse package is the use the select and filter functions. To select columns of a dataframe, use select(). The first argument to this function is the dataframe (sample), and the subsequent arguments are the columns to keep, separated by commas.
#select columns throughout the dataframe
select(sample, Group, Gender, Aneurisms_q1)
## # A tibble: 100 × 3
## Group Gender Aneurisms_q1
## <chr> <chr> <dbl>
## 1 Control m 114
## 2 Treatment2 m 148
## 3 Treatment2 m 196
## 4 Treatment1 f 199
## 5 Treatment1 m 188
## 6 Treatment2 m 260
## 7 Control f 135
## 8 Treatment2 m 216
## 9 Treatment2 m 117
## 10 Control f 188
## # ℹ 90 more rows
#select series of connected columns
select(sample, Group, Gender, Aneurisms_q1:Aneurisms_q4)
## # A tibble: 100 × 6
## Group Gender Aneurisms_q1 Aneurisms_q2 Aneurisms_q3 Aneurisms_q4
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Control m 114 140 202 237
## 2 Treatment2 m 148 209 248 248
## 3 Treatment2 m 196 251 122 177
## 4 Treatment1 f 199 140 233 220
## 5 Treatment1 m 188 120 222 228
## 6 Treatment2 m 260 266 320 294
## 7 Control f 135 98 154 245
## 8 Treatment2 m 216 238 279 251
## 9 Treatment2 m 117 215 181 272
## 10 Control f 188 144 192 185
## # ℹ 90 more rows
To choose rows based on specific criteria, we can use the filter() function. The argument after the dataframe is the condition we want our final dataframe to adhere to (e.g. Gender is Female)
#filter records for Female patients
filter(sample, Gender == "f")
## # A tibble: 39 × 9
## ID Gender Group BloodPressure Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Sub0… f Trea… 105 15.7 199 140 233
## 2 Sub0… f Cont… 173 17.7 135 98 154
## 3 Sub0… f Cont… 129 18.8 188 144 192
## 4 Sub0… f Trea… 96 15.3 152 177 323
## 5 Sub0… f Cont… 77 16.5 112 220 225
## 6 Sub0… f Trea… 147 18.4 165 157 200
## 7 Sub0… f Trea… 92 14.3 107 188 167
## 8 Sub0… f Cont… 111 12.7 174 160 203
## 9 Sub0… f Trea… 97 17.2 187 239 281
## 10 Sub0… f Trea… 118 17.3 188 191 256
## # ℹ 29 more rows
## # ℹ 1 more variable: Aneurisms_q4 <dbl>
#can also specify multiple conditions with "and" or "or" statements
#"and" operator is a comma or &
#"or" operator is a vertical bar |
filter(sample, Gender == "f", Age > 18)
## # A tibble: 11 × 9
## ID Gender Group BloodPressure Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Sub0… f Cont… 129 18.8 188 144 192
## 2 Sub0… f Trea… 147 18.4 165 157 200
## 3 Sub0… f Cont… 108 18.6 103 148 219
## 4 Sub0… f Trea… 133 18.3 132 151 234
## 5 Sub0… f Trea… 142 19 140 184 239
## 6 Sub0… f Trea… 109 18.4 231 240 260
## 7 Sub0… f Trea… 113 18.7 153 153 236
## 8 Sub0… f Trea… 123 19.5 199 119 183
## 9 Sub0… f Cont… 94 20 166 167 232
## 10 Sub0… f Cont… 116 19.1 209 142 199
## 11 Sub0… f Trea… 90 19.6 141 196 322
## # ℹ 1 more variable: Aneurisms_q4 <dbl>
filter(sample, Aneurisms_q1 >200 | Aneurisms_q2 >200)
## # A tibble: 30 × 9
## ID Gender Group BloodPressure Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Sub0… m Trea… 139 17.2 148 209 248
## 2 Sub0… m Trea… 130 19.5 196 251 122
## 3 Sub0… m Trea… 112 14.3 260 266 320
## 4 Sub0… m Trea… 108 19.8 216 238 279
## 5 Sub0… m Trea… 131 19.4 117 215 181
## 6 Sub0… f Cont… 77 16.5 112 220 225
## 7 Sub0… m Trea… 130 18.3 158 265 243
## 8 Sub0… f Trea… 97 17.2 187 239 281
## 9 Sub0… m Trea… 126 15 128 249 294
## 10 Sub0… f Trea… 94 16.1 112 230 281
## # ℹ 20 more rows
## # ℹ 1 more variable: Aneurisms_q4 <dbl>
What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.
#Intermediate steps
sample2 <- select(sample, Group, Gender, Aneurisms_q1:Aneurisms_q4)
sample3 <- filter(sample2, Gender == "f")
#Nested functions
sample4 <- filter(select(sample, Group, Gender, Aneurisms_q1:Aneurisms_q4), Gender == "f")
The last option, pipes, are a recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.
sample %>%
filter(Gender == "f") %>%
select(Group, Gender, Aneurisms_q1:Aneurisms_q4)
## # A tibble: 39 × 6
## Group Gender Aneurisms_q1 Aneurisms_q2 Aneurisms_q3 Aneurisms_q4
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Treatment1 f 199 140 233 220
## 2 Control f 135 98 154 245
## 3 Control f 188 144 192 185
## 4 Treatment2 f 152 177 323 245
## 5 Control f 112 220 225 195
## 6 Treatment1 f 165 157 200 193
## 7 Treatment1 f 107 188 167 218
## 8 Control f 174 160 203 183
## 9 Treatment2 f 187 239 281 214
## 10 Treatment2 f 188 191 256 265
## # ℹ 29 more rows
Activity: Using pipes, subset the sample data to include patients are over 18 and had more than 200 aneurisms in Q1 and retain only the columns for Gender, Group, and Blood Pressure.
To save the cleaning that we performed on the data, we can export the new sample dataframe as a file
write.csv(sample, "./data_output/sample_v2.csv")
ggplot2 comes as part of tidyverse and can be used to create a number of different plots. Today we’ll cover a couple of basic ones, bar charts and dot plots, but the premise behind building a chart in ggplot is the same for all charts.
####Bar chart
With any ggplot chart, you start with what data you will be using.
ggplot(data=sample)
Next, we’ll define our plot area (mapping) using the aesthetic (aes) function. This generates a blank plot.
ggplot(data=sample, mapping=aes(x=Gender))
We add the various elements to the chart as geoms with a + symbol on the previous line which will automatically indent the next line to keep all our code in a block to run together.
ggplot(data=sample, mapping=aes(x=Gender))+
geom_bar()
The mapping can also be assigned to a particular geom
ggplot(data=sample)+
geom_bar(mapping=aes(x=Gender))
You can find out the other arguments and aesthetics that can be used in the geom by viewing the help documentation
?geom_bar
For example, we can adjust the width of the columns
ggplot(data=sample, mapping=aes(x=Gender))+
geom_bar(width=0.4)
Dot plot
ggplot(data=sample, mapping=aes(x=Age, y=Aneurisms_q1))+
geom_point(aes(colour=Group))
?geom_point
q1results <- ggplot(data=sample, mapping=aes(x=Age, y=Aneurisms_q1))+
geom_point(aes(colour=Group))
To save the chart you created use the ggsave function
ggsave("./fig_output/q1results.png", q1results)
Materials from a number of different sources were used in the creation
of this class including: