Brief Introduction to R and RStudio

The term “R” is used to refer to both the programming language and the software that interprets the scripts written using it.

RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.

To make it easier to interact with R, we will use RStudio. RStudio is the most popular IDE (Integrated Development Interface) for R. An IDE is a piece of software that provides tools to make programming easier.

Reasons to use R:

Create a new project

Start by giving our document a title using a comment which are denoted by a #

#Introduction to R and RStudio Class December 9, 2019
#Instructor: Tess Grynoch
#Notes by Name

#Comments allow you to add notes as you write your code

Notice how after we start making changes the file name in the tab turns red and has an * beside it. It’s a reminder that we have not saved the new changes to the document yet. As we go along, remember to hit the save icon or control + s (On PC) or command + s (On Mac).

Math operations

R and Rstudio have a number of basic built-in functions and arithmetic that make it an excellent calculator.

+ Add
- Subtract
* Multiply
/ Divide
^ Exponents

R also follows the order of operations denoted by ()

To run a line or block of code, move your cursor to anywhere on the line or within the block and press control + enter (On PC) or command + return (On Mac).

2+2
## [1] 4

Comparison operations

Comparison operations return whether something is true or false based on logic/Boolean

#greater than >
4>2
## [1] TRUE
#less than <
2<4
## [1] TRUE
#equals ==
2==4
## [1] FALSE
#does not equal !=
2!=4
## [1] TRUE
#less than or equal to <=
2<=2
## [1] TRUE
#greater than or equal to >=
2>=2
## [1] TRUE

Variable creation

You can store values in variables using a <- to point to the name you want to use. This allows you use the vaiable name in place of the value. Values that can be turned into variables include numbers, vectors, tables, functions, or plots.

4+2
## [1] 6
y <- 4
y+2
## [1] 6

R is also case sensitive so make sure to spell correctly.

y+2
Y+2 #You will get an error saying object 'Y' not found (Note you can also add a comment to the end of a line of code)

You can also overwrite a variable name

y <- 6
y+2
## [1] 8

The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated before the assignment occurs.

Variable names can contain letters, numbers, underscores and periods. They cannot start with a number nor contain spaces at all. Different people use different conventions for long variable names, these include

  • periods.between.words
  • underscores_between_words
  • camelCaseToSeparateWords What you use is up to you, but be consistent.

Vectors

R is vectorized, meaning that variables and functions can have vectors as values. In contrast to physics and mathematics, a vector in R describes a set of values in a certain order of the same data type.The 6 basic data types are:

  • character: “a”, “fish”
  • numeric (real or decimal): 2, 15.5
  • integer: 2L (the L tells R to store this as an integer)
  • logical: TRUE, FALSE
  • complex: 1+4i (complex numbers with real and imaginary parts)
#two different ways to create vectors
1:5
## [1] 1 2 3 4 5
c(1,2,3,4,5)
## [1] 1 2 3 4 5
x <- 1:5 #we'll assign this vector as the variable x

You may not of realized it, but we just used our first function. c denotes concatenate in R. To find out more about any function or library, you can put a question mark in front of the function or library name and run the line to get the help documentation which includes the arguments the function accepts and examples of function use.

?c

You can also apply math operations and functions to vectors

2*x
## [1]  2  4  6  8 10

Caution that, if you add a character to an integer vector, the whole vector becomes characters.

y <- c(1,2,3,4,"5")
#Use the class function to find out the type of vector you have
class(y)
## [1] "character"

Never fear! You can convert your vector to a different data type

y <- as.numeric(y) #overwriting a variable

Install packages and add libraries

To expand on what is already available through RStudio, we need to install packages. Today we’ll be using tidyverse which is a collection of R packages for data science. For more information on the packages, visit the tidyverse website.

install.packages("tidyverse")

Once we install the package, we need to tell it which libraries we will be using

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1     v purrr   0.3.2
## v tibble  2.1.1     v dplyr   0.8.1
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'stringr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.3
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Reading files into R

To use data in R and RStudio we must first read it into the environment and assign it a variable name.

sample <- read.csv("data/sample.csv")

I’ve used the relative path to the file above but if I were to use the full destination name it would look something like “C://Users/grynochc/Desktop/Intro-R-class-20191209/Data/sample.csv”   You can read other files into R using similar functions such as:

#Excel files
library(readxl)
read_xlsx

#SAS, SPSS, STATA files
library(haven)
read_sas() 
read_spss()
read_stata()

Data exploration and exploratory statistics

Once you bring your data into RStudio, one of the first things you should do is get oriented to it. Even if you collected the data yourself or are already familiar with the data in its original format. I recommend running through some basic functions and exploratory statistics to check how the data was imported and ensure that there are no unexpected surprises. You can also view the full table by clicking on the table icon to the right of the data but this is not always useful with large datasets and I also don’t like moving between tabs.

head(sample)
##       ID Gender      Group BloodPressure  Age Aneurisms_q1 Aneurisms_q2
## 1 Sub001      m    Control           132 16.0          114          140
## 2 Sub002      m Treatment2           139 17.2          148          209
## 3 Sub003      m Treatment2           130 19.5          196          251
## 4 Sub004      f Treatment1           105 15.7          199          140
## 5 Sub005      m Treatment1           125 19.9          188          120
## 6 Sub006      M Treatment2           112 14.3          260          266
##   Aneurisms_q3 Aneurisms_q4
## 1          202          237
## 2          248          248
## 3          122          177
## 4          233          220
## 5          222          228
## 6          320          294
tail(sample)
##         ID Gender      Group BloodPressure  Age Aneurisms_q1 Aneurisms_q2
## 95  Sub095      m    Control           108 13.6          111          118
## 96  Sub096      m    Control           102 14.6          148          132
## 97  Sub097      F Treatment2            90 19.6          141          196
## 98  Sub098      m Treatment1           133 17.0          193          112
## 99  Sub099      M Treatment2            83 16.2          130          226
## 100 Sub100      M Treatment1           122 18.4          126          157
##     Aneurisms_q3 Aneurisms_q4
## 95           173          191
## 96           200          194
## 97           322          273
## 98           123          181
## 99           286          281
## 100          129          160
str(sample) #gives class and head of each column/variable 
## 'data.frame':    100 obs. of  9 variables:
##  $ ID           : Factor w/ 100 levels "Sub001","Sub002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender       : Factor w/ 4 levels "f","F","m","M": 3 3 3 1 3 4 1 3 3 1 ...
##  $ Group        : Factor w/ 3 levels "Control","Treatment1",..: 1 3 3 2 2 3 1 3 3 1 ...
##  $ BloodPressure: int  132 139 130 105 125 112 173 108 131 129 ...
##  $ Age          : num  16 17.2 19.5 15.7 19.9 14.3 17.7 19.8 19.4 18.8 ...
##  $ Aneurisms_q1 : int  114 148 196 199 188 260 135 216 117 188 ...
##  $ Aneurisms_q2 : int  140 209 251 140 120 266 98 238 215 144 ...
##  $ Aneurisms_q3 : int  202 248 122 233 222 320 154 279 181 192 ...
##  $ Aneurisms_q4 : int  237 248 177 220 228 294 245 251 272 185 ...
names(sample) #column names
## [1] "ID"            "Gender"        "Group"         "BloodPressure"
## [5] "Age"           "Aneurisms_q1"  "Aneurisms_q2"  "Aneurisms_q3" 
## [9] "Aneurisms_q4"
dim(sample) #dimensions
## [1] 100   9

We can also examine a particular column by using $ColumnName to call that column.

sample$Age
##   [1] 16.0 17.2 19.5 15.7 19.9 14.3 17.7 19.8 19.4 18.8 14.8 15.3 16.5 12.6
##  [15] 14.3 15.9 18.4 18.3 15.4 14.3 12.7 15.4 17.2 17.3 16.7 19.6 15.0 16.1
##  [29] 17.6 18.6 18.3 16.7 12.5 14.3 19.7 17.6 17.0 12.2 15.1 17.7 19.0 14.7
##  [43] 15.2 15.3 12.9 18.4 18.1 15.6 19.5 13.5 13.5 13.7 18.7 12.2 16.9 19.5
##  [57] 12.1 17.0 19.2 14.7 20.0 14.1 14.7 16.6 15.0 15.0 13.8 14.8 19.1 18.9
##  [71] 17.7 17.4 15.5 13.1 12.2 17.0 17.7 19.5 19.5 12.8 17.6 17.7 14.2 19.2
##  [85] 16.0 15.2 17.6 17.6 15.1 17.8 16.2 16.6 19.1 17.2 13.6 14.6 19.6 17.0
##  [99] 16.2 18.4

R has a number of built in statistical functions and these can be further extended with specific libraries.

mean(sample$Age)
## [1] 16.42
min(sample$Age)
## [1] 12.1
max(sample$Age)
## [1] 20
median(sample$Age)
## [1] 16.65

Instead of running these basic statistics for each column, we can run a summary:

summary(sample)
##        ID     Gender        Group    BloodPressure        Age       
##  Sub001 : 1   f:35   Control   :30   Min.   : 62.0   Min.   :12.10  
##  Sub002 : 1   F: 4   Treatment1:35   1st Qu.:107.5   1st Qu.:14.78  
##  Sub003 : 1   m:46   Treatment2:35   Median :117.5   Median :16.65  
##  Sub004 : 1   M:15                   Mean   :118.6   Mean   :16.42  
##  Sub005 : 1                          3rd Qu.:133.0   3rd Qu.:18.30  
##  Sub006 : 1                          Max.   :173.0   Max.   :20.00  
##  (Other):94                                                         
##   Aneurisms_q1    Aneurisms_q2    Aneurisms_q3    Aneurisms_q4  
##  Min.   : 65.0   Min.   : 80.0   Min.   :105.0   Min.   :116.0  
##  1st Qu.:118.0   1st Qu.:131.5   1st Qu.:182.5   1st Qu.:186.8  
##  Median :158.0   Median :162.5   Median :217.0   Median :219.0  
##  Mean   :158.8   Mean   :168.0   Mean   :219.8   Mean   :217.9  
##  3rd Qu.:188.0   3rd Qu.:196.8   3rd Qu.:248.2   3rd Qu.:244.2  
##  Max.   :260.0   Max.   :283.0   Max.   :323.0   Max.   :315.0  
## 

Dataframe basics

Just as in other programs such as Excel and Python, dataframes are navigated using rows and column numbers. In R the notation is [rows, columns]. Leaving the rows blank will bring back all rows and leaving the columns blank will bring back all columns. (Note: counting starts at 1, not 0 for those familiar with Python)   For example, if I wanted to view the 4th row:

sample[4,]
##       ID Gender      Group BloodPressure  Age Aneurisms_q1 Aneurisms_q2
## 4 Sub004      f Treatment1           105 15.7          199          140
##   Aneurisms_q3 Aneurisms_q4
## 4          233          220

We can also call a single datapoint using the same technique

sample[4,3]
## [1] Treatment1
## Levels: Control Treatment1 Treatment2

Or we can call a section of datapoints

sample[1:4,c(3, 5)]
##        Group  Age
## 1    Control 16.0
## 2 Treatment2 17.2
## 3 Treatment2 19.5
## 4 Treatment1 15.7

Data cleaning

We noticed that there were some capital M and F in the gender column. To ensure accurate analysis we want to turn those into small m and f to match the majority of the cells. Let’s start with cleaning up the Ms and I’m going to break down the operation into its components

sample[sample$Gender=='M', ] #gives me all the rows with Gender M
##         ID Gender      Group BloodPressure  Age Aneurisms_q1 Aneurisms_q2
## 6   Sub006      M Treatment2           112 14.3          260          266
## 26  Sub026      M Treatment1           123 19.6          115          160
## 27  Sub027      M Treatment2           126 15.0          128          249
## 39  Sub039      M Treatment1           113 15.1          132          137
## 47  Sub047      M    Control           125 18.1          192          141
## 48  Sub048      M    Control            99 15.6          178          180
## 64  Sub064      M Treatment1           149 16.6          189          101
## 69  Sub069      M Treatment2           148 19.1          222          199
## 77  Sub077      M Treatment1           151 17.7          168          184
## 78  Sub078      M Treatment1           121 19.5          118          170
## 79  Sub079      M Treatment1           116 19.5          169          114
## 82  Sub082      M Treatment1            62 17.7          188          108
## 83  Sub083      M Treatment2           124 14.2          169          168
## 99  Sub099      M Treatment2            83 16.2          130          226
## 100 Sub100      M Treatment1           122 18.4          126          157
##     Aneurisms_q3 Aneurisms_q4
## 6            320          294
## 26           158          228
## 27           294          315
## 39           193          206
## 47           180          225
## 48           169          183
## 64           193          172
## 69           280          196
## 77           184          229
## 78           249          249
## 79           248          233
## 82           180          136
## 83           180          211
## 99           286          281
## 100          129          160
sample[sample$Gender=='M', ]$Gender #isolates the gender column of those rows
##  [1] M M M M M M M M M M M M M M M
## Levels: f F m M
sample[sample$Gender=='M', ]$Gender <- 'm' #replace with small m (This is our final operation)
#let's check that it worked
summary(sample$Gender)
##  f  F  m  M 
## 35  4 61  0

Run the same operation for the Fs

sample[sample$Gender=='F', ]$Gender <- 'f'
#let's check that it worked
summary(sample$Gender)
##  f  F  m  M 
## 39  0 61  0

Writing files from R

To save the cleaning that we performed on the data, we can export the new sample dataframe as a file

write.csv(sample, "./data_output/sample_v2.csv")

Basic plots

ggplot2 comes as part of tidyverse and can be used to create a number of different plots. Today we’ll cover a couple of basic ones, bar charts and dot plots, but the premise behind building a chart in ggplot is the same for all charts.

Bar chart

With any ggplot chart, you start with what data you will be using.

ggplot(data=sample)

Next, we’ll define our plot area (mapping) using the aesthetic (aes) function. This generates a blank plot.

ggplot(data=sample, mapping=aes(x=Gender))

We add the various elements to the chart as geoms with a + symbol on the previous line which will automatically indent the next line to keep all our code in a block to run together.

ggplot(data=sample, mapping=aes(x=Gender))+
  geom_bar()

The mapping can also be assigned to a particular geom

ggplot(data=sample)+
  geom_bar(mapping=aes(x=Gender))

You can find out the other arguments and aesthetics that can be used in the geom by viewing the help documentation

?geom_bar

For example, we can adjust the width of the columns

ggplot(data=sample, mapping=aes(x=Gender))+
  geom_bar(width=0.4)

Dot plot

ggplot(data=sample, mapping=aes(x=Age, y=Aneurisms_q1))+
  geom_point(aes(colour=Group))

?geom_point
q1results <- ggplot(data=sample, mapping=aes(x=Age, y=Aneurisms_q1))+
  geom_point(aes(colour=Group))

To save the chart you created use the ggsave function

ggsave("./outputs/q1results.png", q1results)

Further resources for learning R

 
Materials from a number of different sources were used in the creation of this class including: