Hi there. In this page, I showcase how to read in various file types into R. This R code and work is experimental in nature as I was testing out stuff.
Note that the first three sections deal with R’s haven
package.
Not all places keep up to date with software and use R. There may be certain places that use STATA.
I have once used STATA for a survival analysis course where the instructor was familiar with STATA and not R. A classmate told me that STATA is used a lot in econometrics (statistics and economics).
There are two ways to load in a STATA file into R. You can use read_stata()
or read_dta()
. The code below deals with url links.
# Reading Files In R
library(haven) # For SPSS, STATA and SAS files
library(readxl) # For excel files.
### 1) STATA file
# http://www.principlesofeconometrics.com/stata.htm
# read_stata() function:
quizzes_data <- read_stata("http://www.principlesofeconometrics.com/stata/quizzes.dta")
head(quizzes_data) # Preview data
## # A tibble: 6 x 2
## midterm missed
## <dbl> <dbl>
## 1 92.5 0
## 2 86.8 0
## 3 81.1 0
## 4 75.5 0
## 5 86.8 0
## 6 92.5 0
# read_dta() function:
housing_df <- data.frame(read_dta("http://www.principlesofeconometrics.com/stata/housing.dta"))
head(housing_df) # Preview data
## house ir
## 1 1711 13.08
## 2 1632 12.92
## 3 1800 13.17
## 4 1821 13.20
## 5 1680 12.91
## 6 1676 12.22
The main function for reading in an SPSS file is the read_spss()
function from the haven package in R.
### 2) SPSS file
# http://staff.bath.ac.uk/pssiw/stats2/page16/page16.html
# relates people's body temperatures and heart rates
spss_url <- "http://staff.bath.ac.uk/pssiw/stats2/temprate.sav"
# Use read_sav or read_spss()
temp_rate_df <- data.frame(read_spss(spss_url))
head(temp_rate_df)
## temp hrtrate
## 1 35.72222 70
## 2 35.94444 71
## 3 36.05556 74
## 4 36.11111 80
## 5 36.16667 73
## 6 36.16667 75
Loading in a SAS file into R is pretty straightforward as well. The key function here is the read_sas()
function.
### 3) SAS files
# http://libguides.library.kent.edu/SAS/SubsetData
sas_url <- "http://libguides.library.kent.edu/ld.php?content_id=11205331"
sas_dataset <- data.frame(read_sas(sas_url))
head(sas_dataset) # Preview data
## ids bday enrolldate expgradate Rank Major
## 1 43783 1995-03-22 NA
## 2 20278 1995-01-01 NA Philosophy
## 3 20389 1994-12-31 NA
## 4 22820 1994-12-01 NA business administration
## 5 24559 1994-11-10 NA
## 6 28980 1994-09-17 NA astrophysics
## Gender Athlete Height Weight Smoking Sprint MileMinDur English
## 1 0 0 72.35 NA 0 7.978 NA 88.24
## 2 0 0 70.66 179.20 0 8.004 00:06:21.249567 89.45
## 3 0 0 70.68 198.52 0 NA 00:07:00.966993 96.73
## 4 1 0 NA 198.34 NA 8.473 00:12:44.740049 74.06
## 5 1 1 67.43 128.17 2 NA 00:06:25.506137 82.61
## 6 0 1 68.45 171.61 0 4.650 00:07:24.585445 70.10
## Reading Math Writing State LiveOnCampus HowCommute CommuteTime
## 1 81.50 60.02 81.44 In state 1 NA NA
## 2 85.25 70.19 73.27 1 NA NA
## 3 86.88 71.20 84.24 In state NA NA NA
## 4 88.68 55.89 73.16 In state 1 NA NA
## 5 77.30 65.52 80.45 Out of state 1 NA NA
## 6 NA 61.40 77.48 In state 1 NA NA
## SleepTime StudyTime
## 1 7 1
## 2 5 2
## 3 8 7
## 4 2 6
## 5 7 3
## 6 8 3
From experimentation, I have found loading in Excel files into R somewhat difficult. I have found out that (I think) there is no way to load in an .xlsx file from a URL alone. The solution I propose is to save the file into a folder. Then you would set that folder as the working directory in RStudio with the Set As Working Directory from the More drop down.
Once you have set your working directory folder, you can load in the Excel file into R. Be mindful of spelling and punctuation.
excel_data = data.frame(read_excel("SuperStoreUS_2015.xlsx"))
head(excel_data, n = 10)
Loading in .csv files is not very difficult. Do make sure to include header = TRUE if the first rows are column titles and the sep = argument
.
### 5) .csv files
# Ref: http://introcs.cs.princeton.edu/java/data/
countries_data <- read.csv("http://introcs.cs.princeton.edu/java/data/countries.csv",
header = TRUE, sep = ",")
head(countries_data)
## UN.code.number X2.letter.ISO.abbreviation X3.letter.ISO.abbreviation
## 1 4 AF AFG
## 2 8 AL ALB
## 3 12 DZ DZA
## 4 20 AD AND
## 5 24 AO AGO
## 6 28 AG ATG
## name capital
## 1 Afghanistan Kabul
## 2 Albania Tirana
## 3 Algeria Algiers
## 4 Andorra Andorra la Vella
## 5 Angola Luanda
## 6 Antigua and Barbuda St. John's
From time to time I do play around with some datasets from the UCI Machine Learning Repository website. Some of them have this .data file extension. For this case, the read.table() function is used.
### 6) .data files
# Source: UCI Machine Learning Repository
# Read in the .data file:
# https://stackoverflow.com/questions/21101927/read-files-with-extension-data-into-r
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult-stretch.data"
balloons_data <- read.table(url, sep = "\t", dec = ",") # Reading in a .data file.
head(balloons_data)
## V1
## 1 YELLOW,SMALL,STRETCH,ADULT,T
## 2 YELLOW,SMALL,STRETCH,CHILD,T
## 3 YELLOW,SMALL,DIP,ADULT,T
## 4 YELLOW,SMALL,DIP,CHILD,F
## 5 YELLOW,SMALL,DIP,CHILD,F
## 6 YELLOW,LARGE,STRETCH,ADULT,T
It appears that this balloons_data comes in as a single column with values that have four commas. This single column could be separated with the separate function from the tidyr package in R.
library(tidyr)
col_names <- c("Color", "size", "act", "age", "inflated")
sep_data <- separate(balloons_data, col = "V1", into = col_names, sep = ",")
head(sep_data)
## Color size act age inflated
## 1 YELLOW SMALL STRETCH ADULT T
## 2 YELLOW SMALL STRETCH CHILD T
## 3 YELLOW SMALL DIP ADULT T
## 4 YELLOW SMALL DIP CHILD F
## 5 YELLOW SMALL DIP CHILD F
## 6 YELLOW LARGE STRETCH ADULT T