-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathSQLite-GettingStarted.Rmd
More file actions
212 lines (165 loc) · 5.26 KB
/
SQLite-GettingStarted.Rmd
File metadata and controls
212 lines (165 loc) · 5.26 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---
title: 'Airline delay in R: setting up your own SQLite database using dplyr'
author: "Nicholas J. Horton and Randall Pruim"
date: "March 23, 2015"
output:
html_document:
fig_height: 4
fig_width: 6
pdf_document:
fig_height: 4
keep_tex: yes
word_document:
fig_height: 4
fig_width: 6
---
```{r include=FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
require(mosaic)
```
```{r include=FALSE}
# Some customization.
# You can alter or delete as desired (if you know what you are doing).
# This changes the default colors in lattice plots.
trellis.par.set(theme=theme.mosaic())
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
tidy=FALSE, # display code as typed
size="small" # slightly smaller font for code
)
```
This file describes how to create an SQLite database containing a subset
of the airline delays data using the `dplyr` package in R.
1. [Create a directory to store the data and database](#set_working_dir)
2. [Install necessary R packages](#prelim)
3. [Set up the database](#create_db)
4. [Test the setup](#test_db)
### Create a directory to store the database
<a name="set_working_dir"></a>
By default, the database will be created in the directory where you compile this
RMarkdown file, but you can edit the line below to choose a different location.
```{r set-working-dir}
dbdir <- "."
# example alternative
# dbdir <- "~/Desktop/AirlineDelays"
```
Let's create the directory (if it doesn't already exist) used to store the database, and change to that working directory.
```{r create-working-directory}
if (!file.exists(dbdir)) {
if (! try(dir.create(dbdir))) {
stop("ERROR: can't create ", dbdir, "\n")
}
}
setwd(dbdir)
old <- opts_knit$set(root.dir=getwd()) # also need to move within markdown
```
### Install necessary packages
<a name="prelim"></a>
Let's just check what version of R and RStudio that you are running:
```{r version-info}
try(rstudio::versionInfo())
sessionInfo()
```
This should be at least RStudio 0.98 and R 3.1 or later.
Before you begin, you need to install necessary packages.
```{r, load-packages, message=FALSE}
lapply(
c("RSQLite", "dplyr", "tidyr", "mosaic", "knitr", "lubridate"),
function(pkg) {
if (!require(pkg, character.only=TRUE)) try(install.packages(pkg))
}
)
```
### Set up the database using dplyr
<a name="create_db"></a>
We will read in some CSV data files, change some of the names to make them more
consistent, and the store the results in an SQLite database.
```{r, fetch-files, message=FALSE}
planes <-
read.file("https://raw.githubusercontent.com/ProjectMOSAIC/databases/master/Data/plane-data.csv") %>%
rename(
TailNum = tailnum,
Type = type,
Manufacturer = manufacturer,
IssueDate = issue_date,
Model = model,
Status = status,
AircraftType = aircraft_type,
EngineType = engine_type,
Year = year
)
airports <- read.file("https://raw.githubusercontent.com/ProjectMOSAIC/databases/master/Data/airports.csv") %>%
rename(
IATA = iata,
Airport = airport,
City = city,
State = state,
Country = country,
Latitude = lat,
Longitude = long
)
# finally one with names we can keep as is
carriers <-
read.file("https://raw.githubusercontent.com/ProjectMOSAIC/databases/master/Data/carriers.csv")
```
```{r fetch-flights}
# this is one breaking with a "back-end read error" when attempting from github
flights <- read.file("http://www.amherst.edu/~nhorton/precursors/files/2014-01.csv") %>%
rename(
DayOfMonth = DayofMonth
)
```
```{r build-database}
# create a new (empty) data base
if (file.exists("Airlines.sqlite3")) { file.remove("Airlines.sqlite3") }
Airlines_db <- src_sqlite("Airlines.sqlite3", create = TRUE)
# now pump in some tables, adding indices to make access faster
copy_to(Airlines_db, planes, temporary = FALSE,
indexes = list(
"tailnum")
)
copy_to(Airlines_db, airports, temporary=FALSE,
indexes = list(
"iata",
"airport",
c("city", "state", "country")
)
)
copy_to(Airlines_db, carriers, temporary=FALSE,
indexes = list(
"code"
)
)
copy_to(Airlines_db, flights, temporary=FALSE,
indexes = list(
c("Year", "Month", "DayofMonth"),
"DayOfWeek",
"UniqueCarrier",
"FlightNum",
"TailNum",
"Origin",
"Dest"
)
)
```
### Test the setup
<a name="test_db"></a>
```{r test-extraction}
my_db <- src_sqlite(path="Airlines.sqlite3")
flights <- tbl(my_db, "flights") # link to some useful tables
airports <- tbl(my_db, "airports")
carriers <- tbl(my_db, "carriers")
planes <- tbl(my_db, "planes")
airports %>%
filter(IATA %in% c('ALB', 'BDL', 'BTV')) # what are these airports?
```
If this displays information about Albany, Bradley, and Burlington airport, then you have
successfully set up a database system and at least the `airports` table would appear
to have the correct data in it.
```{r}
opts_knit$set(root.dir=old$root.dir)
```
To perform additional tests, download and knit [this additional Rmd file]( "https://raw.githubusercontent.com/ProjectMOSAIC/databases/master/SQLite-Testing.Rmd")
```