Introduction to R and RStudio Workshop

Brief Introduction to R and RStudio

The term “R” is used to refer to both the programming language and the software that interprets the scripts written using it.

RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.

To make it easier to interact with R, we will use RStudio. RStudio is the most popular IDE (Integrated Development Interface) for R. An IDE is a piece of software that provides tools to make programming easier.

Reasons to use R:

Easy to reproduce (less pointing and clicking)
Able to analyze data and create visualizations without altering the data we input into our script regardless of the analysis we run
Analysis and visualization possibilities are limitless and cross disciplinary boundaries
Works on all types and sizes of data
Open source
Large and welcoming community

Quick tour of RStudio

Let’s take a quick tour of RStudio.

Screenshot of the RStudio_startup screen

RStudio is divided into four “panes”. The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout).

The Default Layout is:

Top Left - Source: your scripts and documents
Bottom Left - Console: what R would look and be like without RStudio
Top Right - Environment/History: look here to see what you have done
Bottom Right - Files and more: see the contents of the project/working directory here, like your Script.R file

Installing packages

To expand on the functions already available in the base R, we need to install packages. Today we’ll be using tidyverse which is a collection of R packages for data science. For more information on the packages, visit the tidyverse website. Type the following code into the console or use the install button on the packages tab to install tidyverse.

install.packages("tidyverse")

Create a new project

Under the File menu > New Project > (Choose Existing directory if you already have a folder you want to use for today’s workshop or New directory if you want to create the folder > Create project (This is now our working directory)
It’s good practice to keep all related data, analysis, and text in the same folder in order to use relative paths as opposed to the full path to the file’s location. Another best practice is to create a separate folders for your data, data outputs, and figure outputs. So let’s create three folders in our working directory called:
- data
- data_output
- fig_output
Download the data we will be working with today and add it to the data folder.
Create a new file where we will type our scripts. Go to File > New File > R script OR click the plus sign hovering over the white rectangle in the toolbar. Click the save icon on your tooolbar, or, in the menu, File > Save As and save your script as “Intro-R-class-20240426.R”

Start by giving our document a title using a comment which are denoted by a #

#Introduction to R and RStudio Class April 2024
#Instructor: Tess Grynoch
#Notes by Name

#Comments allow you to add notes as you write your code

Notice how after we start making changes the file name in the tab turns red and has an * beside it. It’s a reminder that we have not saved the new changes to the document yet. As we go along, remember to hit the save icon or control + s (On PC) or command + s (On Mac).

Math operations

R and Rstudio have a number of basic built-in functions and arithmetic that make it an excellent calculator.

+	Add
-	Subtract
*	Multiply
/	Divide
^	Exponents

R also follows the order of operations denoted by ()

To run a line or block of code, move your cursor to anywhere on the line or within the block and press control + enter (On PC) or command + return (On Mac).

2+2

## [1] 4

Comparison operations

Comparison operations return whether something is true or false based on logic/Boolean

#greater than >
4>2

## [1] TRUE

#less than <
2<4

## [1] TRUE

#equals ==
2==4

## [1] FALSE

#does not equal !=
2!=4

## [1] TRUE

#less than or equal to <=
2<=2

## [1] TRUE

#greater than or equal to >=
2>=2

## [1] TRUE

Variable/Object creation

You can store values in variables using a <- to point to the name you want to use. This allows you use the variable name in place of the value. Values that can be turned into variables include numbers, vectors, tables, functions, or plots.

4+2

## [1] 6

y <- 4
y+2 #using the variable in place of the value

## [1] 6

y #to print the value of the object, type it's name

## [1] 4

(y <- 4) #or put parenthesis around the assignment

## [1] 4

R is also case sensitive so make sure to spell correctly.

y+2
Y+2 #You will get an error saying object 'Y' not found (Note you can also add a comment to the end of a line of code)

You can also overwrite a variable name

y <- 6
y+2

## [1] 8

#assign the results to a new variable
x <- y+2

#and change x
x <- 10

#Activity: What is the current content of y? 8 or 6?

The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated before the assignment occurs.

Variable names can contain letters, numbers, underscores and periods. They cannot start with a number nor contain spaces at all. Different people use different conventions for long variable names, these include

periods.between.words
underscores_between_words
camelCaseToSeparateWords What you use is up to you, but be consistent.

Vectors

R is vectorized, meaning that variables and functions can have vectors as values. In contrast to physics and mathematics, a vector in R describes a set of values in a certain order of the same data type.The 6 basic data types are:

character: “a”, “fish”
numeric (real or decimal): 2, 15.5
integer: 2L (the L tells R to store this as an integer)
logical: TRUE, FALSE
complex: 1+4i (complex numbers with real and imaginary parts)

#two different ways to create vectors
1:5

## [1] 1 2 3 4 5

c(1,2,3,4,5)

## [1] 1 2 3 4 5

x <- 1:5 #we'll assign this vector as the variable x

You may not of realized it, but we just used our first function. c denotes combine in R. To find out more about any function or library, you can put a question mark in front of the function or library name and run the line to get the help documentation which includes the arguments the function accepts and examples of function use.

?c

You can also apply math operations and functions to vectors

2*x

## [1]  2  4  6  8 10

Caution that, if you add a character to an integer vector, the whole vector becomes characters.

y <- c(1,2,3,4,"5")
#Use the class function to find out the type of vector you have
class(y)

## [1] "character"

#Activity: What class is each of these vectors with a mixture of values?
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)

Never fear! You can convert your vector to a different data type

y <- as.numeric(y) #overwriting a variable

Print statements

While RStudio prints the output of functions in the console and displays variables in the top right, if we are using R through the command line, you can use print statements to view the variable. Print statements are also useful in complicated functions to explicitly state the result.

print(x)

## [1] 1 2 3 4 5

z <- 8/4
print(c('Solution A is',z,'times as strong as solution B'))

## [1] "Solution A is"                 "2"                            
## [3] "times as strong as solution B"

Add libraries

While you only need to install packages once per computer, you need to indicate which libraries you are using with each R session. So, it’s good practice to list all the libraries you will be using at the top of your R script.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Reading files into R

To use data in R and RStudio we must first read it into the environment and assign it a variable name.

sample <- read_csv("data/sample.csv")

## Rows: 100 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): ID, Gender, Group
## dbl (6): BloodPressure, Age, Aneurisms_q1, Aneurisms_q2, Aneurisms_q3, Aneur...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I’ve used the relative path to the file above but if I were to use the full destination name it would look something like “C://Users/grynochc/Desktop/Intro-R-class-20191209/Data/sample.csv” You can read other files into R using similar functions such as:

#Excel files
library(readxl)
read_xlsx

#SAS, SPSS, STATA files
library(haven)
read_sas() 
read_spss()
read_stata()

Data exploration and exploratory statistics

Once you bring your data into RStudio, one of the first things you should do is get oriented to it. Even if you collected the data yourself or are already familiar with the data in its original format, I recommend running through some basic functions and exploratory statistics to check how the data was imported and ensure that there are no unexpected surprises. You can also view the full table by clicking on the table icon to the right of the data but this is not always useful with large datasets and I also don’t like moving between tabs.

head(sample)

## # A tibble: 6 × 9
##   ID     Gender Group BloodPressure   Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
##   <chr>  <chr>  <chr>         <dbl> <dbl>        <dbl>        <dbl>        <dbl>
## 1 Sub001 m      Cont…           132  16            114          140          202
## 2 Sub002 m      Trea…           139  17.2          148          209          248
## 3 Sub003 m      Trea…           130  19.5          196          251          122
## 4 Sub004 f      Trea…           105  15.7          199          140          233
## 5 Sub005 m      Trea…           125  19.9          188          120          222
## 6 Sub006 M      Trea…           112  14.3          260          266          320
## # ℹ 1 more variable: Aneurisms_q4 <dbl>

tail(sample)

## # A tibble: 6 × 9
##   ID     Gender Group BloodPressure   Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
##   <chr>  <chr>  <chr>         <dbl> <dbl>        <dbl>        <dbl>        <dbl>
## 1 Sub095 m      Cont…           108  13.6          111          118          173
## 2 Sub096 m      Cont…           102  14.6          148          132          200
## 3 Sub097 F      Trea…            90  19.6          141          196          322
## 4 Sub098 m      Trea…           133  17            193          112          123
## 5 Sub099 M      Trea…            83  16.2          130          226          286
## 6 Sub100 M      Trea…           122  18.4          126          157          129
## # ℹ 1 more variable: Aneurisms_q4 <dbl>

str(sample) #gives class and head of each column/variable

## spc_tbl_ [100 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ID           : chr [1:100] "Sub001" "Sub002" "Sub003" "Sub004" ...
##  $ Gender       : chr [1:100] "m" "m" "m" "f" ...
##  $ Group        : chr [1:100] "Control" "Treatment2" "Treatment2" "Treatment1" ...
##  $ BloodPressure: num [1:100] 132 139 130 105 125 112 173 108 131 129 ...
##  $ Age          : num [1:100] 16 17.2 19.5 15.7 19.9 14.3 17.7 19.8 19.4 18.8 ...
##  $ Aneurisms_q1 : num [1:100] 114 148 196 199 188 260 135 216 117 188 ...
##  $ Aneurisms_q2 : num [1:100] 140 209 251 140 120 266 98 238 215 144 ...
##  $ Aneurisms_q3 : num [1:100] 202 248 122 233 222 320 154 279 181 192 ...
##  $ Aneurisms_q4 : num [1:100] 237 248 177 220 228 294 245 251 272 185 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ID = col_character(),
##   ..   Gender = col_character(),
##   ..   Group = col_character(),
##   ..   BloodPressure = col_double(),
##   ..   Age = col_double(),
##   ..   Aneurisms_q1 = col_double(),
##   ..   Aneurisms_q2 = col_double(),
##   ..   Aneurisms_q3 = col_double(),
##   ..   Aneurisms_q4 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

names(sample) #column names

## [1] "ID"            "Gender"        "Group"         "BloodPressure"
## [5] "Age"           "Aneurisms_q1"  "Aneurisms_q2"  "Aneurisms_q3" 
## [9] "Aneurisms_q4"

dim(sample) #dimensions

## [1] 100   9

nrow(sample) #number of rows

## [1] 100

ncol(sample) #number of columns

## [1] 9

We can also examine a particular column by using $ColumnName to call that column.

sample$Age

##   [1] 16.0 17.2 19.5 15.7 19.9 14.3 17.7 19.8 19.4 18.8 14.8 15.3 16.5 12.6 14.3
##  [16] 15.9 18.4 18.3 15.4 14.3 12.7 15.4 17.2 17.3 16.7 19.6 15.0 16.1 17.6 18.6
##  [31] 18.3 16.7 12.5 14.3 19.7 17.6 17.0 12.2 15.1 17.7 19.0 14.7 15.2 15.3 12.9
##  [46] 18.4 18.1 15.6 19.5 13.5 13.5 13.7 18.7 12.2 16.9 19.5 12.1 17.0 19.2 14.7
##  [61] 20.0 14.1 14.7 16.6 15.0 15.0 13.8 14.8 19.1 18.9 17.7 17.4 15.5 13.1 12.2
##  [76] 17.0 17.7 19.5 19.5 12.8 17.6 17.7 14.2 19.2 16.0 15.2 17.6 17.6 15.1 17.8
##  [91] 16.2 16.6 19.1 17.2 13.6 14.6 19.6 17.0 16.2 18.4

R has a number of built in statistical functions and these can be further extended with specific libraries.

mean(sample$Age)

## [1] 16.42

min(sample$Age)

## [1] 12.1

max(sample$Age)

## [1] 20

median(sample$Age)

## [1] 16.65

Instead of running these basic statistics for each column, we can run a summary:

summary(sample)

##       ID               Gender             Group           BloodPressure  
##  Length:100         Length:100         Length:100         Min.   : 62.0  
##  Class :character   Class :character   Class :character   1st Qu.:107.5  
##  Mode  :character   Mode  :character   Mode  :character   Median :117.5  
##                                                           Mean   :118.6  
##                                                           3rd Qu.:133.0  
##                                                           Max.   :173.0  
##       Age         Aneurisms_q1    Aneurisms_q2    Aneurisms_q3  
##  Min.   :12.10   Min.   : 65.0   Min.   : 80.0   Min.   :105.0  
##  1st Qu.:14.78   1st Qu.:118.0   1st Qu.:131.5   1st Qu.:182.5  
##  Median :16.65   Median :158.0   Median :162.5   Median :217.0  
##  Mean   :16.42   Mean   :158.8   Mean   :168.0   Mean   :219.8  
##  3rd Qu.:18.30   3rd Qu.:188.0   3rd Qu.:196.8   3rd Qu.:248.2  
##  Max.   :20.00   Max.   :260.0   Max.   :283.0   Max.   :323.0  
##   Aneurisms_q4  
##  Min.   :116.0  
##  1st Qu.:186.8  
##  Median :219.0  
##  Mean   :217.9  
##  3rd Qu.:244.2  
##  Max.   :315.0

Dataframe basics - that also work with tibbles

Just as in other programs such as Excel and Python, dataframes are navigated using rows and column numbers. In R the notation is [rows, columns]. Leaving the rows blank will bring back all rows and leaving the columns blank will bring back all columns. (Note: counting starts at 1, not 0 for those familiar with Python) For example, if I wanted to view the 4th row:

sample[4,]

## # A tibble: 1 × 9
##   ID     Gender Group BloodPressure   Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
##   <chr>  <chr>  <chr>         <dbl> <dbl>        <dbl>        <dbl>        <dbl>
## 1 Sub004 f      Trea…           105  15.7          199          140          233
## # ℹ 1 more variable: Aneurisms_q4 <dbl>

We can also call a single datapoint using the same technique

sample[4,3]

## # A tibble: 1 × 1
##   Group     
##   <chr>     
## 1 Treatment1

Or we can call a section of datapoints

sample[1:4,c(3, 5)]

## # A tibble: 4 × 2
##   Group        Age
##   <chr>      <dbl>
## 1 Control     16  
## 2 Treatment2  17.2
## 3 Treatment2  19.5
## 4 Treatment1  15.7

Data cleaning

We noticed that there were some capital M and F in the gender column. To ensure accurate analysis we want to turn those into small m and f to match the majority of the cells. Let’s start with cleaning up the Ms and I’m going to break down the operation into its components

sample[sample$Gender=='M', ] #gives me all the rows with Gender M

## # A tibble: 15 × 9
##    ID    Gender Group BloodPressure   Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
##    <chr> <chr>  <chr>         <dbl> <dbl>        <dbl>        <dbl>        <dbl>
##  1 Sub0… M      Trea…           112  14.3          260          266          320
##  2 Sub0… M      Trea…           123  19.6          115          160          158
##  3 Sub0… M      Trea…           126  15            128          249          294
##  4 Sub0… M      Trea…           113  15.1          132          137          193
##  5 Sub0… M      Cont…           125  18.1          192          141          180
##  6 Sub0… M      Cont…            99  15.6          178          180          169
##  7 Sub0… M      Trea…           149  16.6          189          101          193
##  8 Sub0… M      Trea…           148  19.1          222          199          280
##  9 Sub0… M      Trea…           151  17.7          168          184          184
## 10 Sub0… M      Trea…           121  19.5          118          170          249
## 11 Sub0… M      Trea…           116  19.5          169          114          248
## 12 Sub0… M      Trea…            62  17.7          188          108          180
## 13 Sub0… M      Trea…           124  14.2          169          168          180
## 14 Sub0… M      Trea…            83  16.2          130          226          286
## 15 Sub1… M      Trea…           122  18.4          126          157          129
## # ℹ 1 more variable: Aneurisms_q4 <dbl>

sample[sample$Gender=='M', ]$Gender #isolates the gender column of those rows

##  [1] "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "M"

sample[sample$Gender=='M', ]$Gender <- 'm' #replace with small m (This is our final operation)
#let's check that it worked
table(sample$Gender)

## 
##  f  F  m 
## 35  4 61

Run the same operation for the Fs

sample[sample$Gender=='F', ]$Gender <- 'f'
#let's check that it worked
table(sample$Gender)

## 
##  f  m 
## 39 61

Selecting columns and filtering rows with dplyr

Another way to subset your data with the tidyverse package is the use the select and filter functions. To select columns of a dataframe, use select(). The first argument to this function is the dataframe (sample), and the subsequent arguments are the columns to keep, separated by commas.

#select columns throughout the dataframe
select(sample, Group, Gender, Aneurisms_q1)

## # A tibble: 100 × 3
##    Group      Gender Aneurisms_q1
##    <chr>      <chr>         <dbl>
##  1 Control    m               114
##  2 Treatment2 m               148
##  3 Treatment2 m               196
##  4 Treatment1 f               199
##  5 Treatment1 m               188
##  6 Treatment2 m               260
##  7 Control    f               135
##  8 Treatment2 m               216
##  9 Treatment2 m               117
## 10 Control    f               188
## # ℹ 90 more rows

#select series of connected columns
select(sample, Group, Gender, Aneurisms_q1:Aneurisms_q4)

## # A tibble: 100 × 6
##    Group      Gender Aneurisms_q1 Aneurisms_q2 Aneurisms_q3 Aneurisms_q4
##    <chr>      <chr>         <dbl>        <dbl>        <dbl>        <dbl>
##  1 Control    m               114          140          202          237
##  2 Treatment2 m               148          209          248          248
##  3 Treatment2 m               196          251          122          177
##  4 Treatment1 f               199          140          233          220
##  5 Treatment1 m               188          120          222          228
##  6 Treatment2 m               260          266          320          294
##  7 Control    f               135           98          154          245
##  8 Treatment2 m               216          238          279          251
##  9 Treatment2 m               117          215          181          272
## 10 Control    f               188          144          192          185
## # ℹ 90 more rows

To choose rows based on specific criteria, we can use the filter() function. The argument after the dataframe is the condition we want our final dataframe to adhere to (e.g. Gender is Female)

#filter records for Female patients
filter(sample, Gender == "f")

## # A tibble: 39 × 9
##    ID    Gender Group BloodPressure   Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
##    <chr> <chr>  <chr>         <dbl> <dbl>        <dbl>        <dbl>        <dbl>
##  1 Sub0… f      Trea…           105  15.7          199          140          233
##  2 Sub0… f      Cont…           173  17.7          135           98          154
##  3 Sub0… f      Cont…           129  18.8          188          144          192
##  4 Sub0… f      Trea…            96  15.3          152          177          323
##  5 Sub0… f      Cont…            77  16.5          112          220          225
##  6 Sub0… f      Trea…           147  18.4          165          157          200
##  7 Sub0… f      Trea…            92  14.3          107          188          167
##  8 Sub0… f      Cont…           111  12.7          174          160          203
##  9 Sub0… f      Trea…            97  17.2          187          239          281
## 10 Sub0… f      Trea…           118  17.3          188          191          256
## # ℹ 29 more rows
## # ℹ 1 more variable: Aneurisms_q4 <dbl>

#can also specify multiple conditions with "and" or "or" statements
#"and" operator is a comma or &
#"or" operator is a vertical bar |
filter(sample, Gender == "f", Age > 18)

## # A tibble: 11 × 9
##    ID    Gender Group BloodPressure   Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
##    <chr> <chr>  <chr>         <dbl> <dbl>        <dbl>        <dbl>        <dbl>
##  1 Sub0… f      Cont…           129  18.8          188          144          192
##  2 Sub0… f      Trea…           147  18.4          165          157          200
##  3 Sub0… f      Cont…           108  18.6          103          148          219
##  4 Sub0… f      Trea…           133  18.3          132          151          234
##  5 Sub0… f      Trea…           142  19            140          184          239
##  6 Sub0… f      Trea…           109  18.4          231          240          260
##  7 Sub0… f      Trea…           113  18.7          153          153          236
##  8 Sub0… f      Trea…           123  19.5          199          119          183
##  9 Sub0… f      Cont…            94  20            166          167          232
## 10 Sub0… f      Cont…           116  19.1          209          142          199
## 11 Sub0… f      Trea…            90  19.6          141          196          322
## # ℹ 1 more variable: Aneurisms_q4 <dbl>

filter(sample, Aneurisms_q1 >200 | Aneurisms_q2 >200)

## # A tibble: 30 × 9
##    ID    Gender Group BloodPressure   Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
##    <chr> <chr>  <chr>         <dbl> <dbl>        <dbl>        <dbl>        <dbl>
##  1 Sub0… m      Trea…           139  17.2          148          209          248
##  2 Sub0… m      Trea…           130  19.5          196          251          122
##  3 Sub0… m      Trea…           112  14.3          260          266          320
##  4 Sub0… m      Trea…           108  19.8          216          238          279
##  5 Sub0… m      Trea…           131  19.4          117          215          181
##  6 Sub0… f      Cont…            77  16.5          112          220          225
##  7 Sub0… m      Trea…           130  18.3          158          265          243
##  8 Sub0… f      Trea…            97  17.2          187          239          281
##  9 Sub0… m      Trea…           126  15            128          249          294
## 10 Sub0… f      Trea…            94  16.1          112          230          281
## # ℹ 20 more rows
## # ℹ 1 more variable: Aneurisms_q4 <dbl>

What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.

#Intermediate steps
sample2 <- select(sample, Group, Gender, Aneurisms_q1:Aneurisms_q4)
sample3 <- filter(sample2, Gender == "f")

#Nested functions
sample4 <- filter(select(sample, Group, Gender, Aneurisms_q1:Aneurisms_q4), Gender == "f")

The last option, pipes, are a recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.

sample %>% 
  filter(Gender == "f") %>% 
  select(Group, Gender, Aneurisms_q1:Aneurisms_q4)

## # A tibble: 39 × 6
##    Group      Gender Aneurisms_q1 Aneurisms_q2 Aneurisms_q3 Aneurisms_q4
##    <chr>      <chr>         <dbl>        <dbl>        <dbl>        <dbl>
##  1 Treatment1 f               199          140          233          220
##  2 Control    f               135           98          154          245
##  3 Control    f               188          144          192          185
##  4 Treatment2 f               152          177          323          245
##  5 Control    f               112          220          225          195
##  6 Treatment1 f               165          157          200          193
##  7 Treatment1 f               107          188          167          218
##  8 Control    f               174          160          203          183
##  9 Treatment2 f               187          239          281          214
## 10 Treatment2 f               188          191          256          265
## # ℹ 29 more rows

Activity: Using pipes, subset the sample data to include patients are over 18 and had more than 200 aneurisms in Q1 and retain only the columns for Gender, Group, and Blood Pressure.

Writing files from R

To save the cleaning that we performed on the data, we can export the new sample dataframe as a file

write.csv(sample, "./data_output/sample_v2.csv")

Basic plots

ggplot2 comes as part of tidyverse and can be used to create a number of different plots. Today we’ll cover a couple of basic ones, bar charts and dot plots, but the premise behind building a chart in ggplot is the same for all charts.

####Bar chart

With any ggplot chart, you start with what data you will be using.

ggplot(data=sample)

Next, we’ll define our plot area (mapping) using the aesthetic (aes) function. This generates a blank plot.

ggplot(data=sample, mapping=aes(x=Gender))

We add the various elements to the chart as geoms with a + symbol on the previous line which will automatically indent the next line to keep all our code in a block to run together.

ggplot(data=sample, mapping=aes(x=Gender))+
  geom_bar()

The mapping can also be assigned to a particular geom

ggplot(data=sample)+
  geom_bar(mapping=aes(x=Gender))

You can find out the other arguments and aesthetics that can be used in the geom by viewing the help documentation

?geom_bar

For example, we can adjust the width of the columns

ggplot(data=sample, mapping=aes(x=Gender))+
  geom_bar(width=0.4)

Dot plot

ggplot(data=sample, mapping=aes(x=Age, y=Aneurisms_q1))+
  geom_point(aes(colour=Group))

?geom_point

q1results <- ggplot(data=sample, mapping=aes(x=Age, y=Aneurisms_q1))+
  geom_point(aes(colour=Group))

To save the chart you created use the ggsave function

ggsave("./fig_output/q1results.png", q1results)

Further resources for learning R

R for Data Science (2ed) by Hadley Wickham & Garrett Grolemund
Software Carpentry Lessons
Data Carpentry Lessons
Sage Research Methods R Tutorials. There are two video series in particular and I’ve linked the first video from each series
- Practical Data Management with R
- Introduction to Data Science with R
R/Medicine Conference June 10-14 online. Early bird pricing ends May 14.

R Cheat Sheets

Posit Cheat Sheets - ggplot2, tidyr, dplyr
Base R Cheat Sheet

Materials from a number of different sources were used in the creation of this class including:

The Carpentries’ (previously Data Carpentry’s) R for Social Scientists used under the CC-BY 4.0 license.
The Carpentries’ (previously Software Carpentry) Programming with R used under the CC-BY 4.0 license.
The Carpentries’ (previously Software Carpentry) R for Reproducible Scientific Analysis used under the the CC-BY 4.0 license.
R Course by Nick Hathaway as part of the UMMS Bootstrappers Courses
This work is licensed under a Creative Commons Attribution 4.0 International License by Tess Grynoch