Hello. In this page, I will talk about displaying scatterplots using the statistical program R and the R package ggvis
for data visualization.
This particular dataset is based on time or time spent on the internet. Variables of interest in this dataset are the number of internet users, the population of the country and the number of Facebook users.
In R, we load the libraries ggvis and dplyr. Loading ggvis will allow us to access the plots, graphs and other visual tools in the library. The dplyr package will allows us to use the %>% syntax for ggvis. More info on the %>% can be found here. The dataset was from here.
The code and output can be found below:
# ggvis
# The dataset:
# Load ggvis
library(ggvis)
library(dplyr)
url <- "http://sites.williams.edu/bklingen/files/2015/05/InternetUse.csv"
internet_data <- read.csv(url)
head(internet_data)
## iso2c country year Population.Size GDP.in.billions.of..US
## 1 AR Argentina 2012 41086927 476
## 2 AU Australia 2012 22683600 1532
## 3 BE Belgium 2012 11142157 483
## 4 BR Brazil 2012 198656019 2253
## 5 CA Canada 2012 34880491 1821
## 6 CL Chile 2012 17464814 270
## GDP.per.Capita.in.thousands.of..US Internet.Users Internet.Penetration
## 1 12 22926505 55.80
## 2 68 18679842 82.35
## 3 43 9136569 82.00
## 4 11 99026051 49.85
## 5 52 30264360 86.77
## 6 15 10726566 61.42
## Facebook.Users Facebook.Penetration Broadband.Subscribers
## 1 20048100 48.79 4475415
## 2 11680640 51.49 5743000
## 3 4922260 44.18 3679196
## 4 58565700 29.48 18186954
## 5 18090640 51.86 11405500
## 6 9687720 55.47 2166805
dim(internet_data)
## [1] 32 11
str(internet_data)
## 'data.frame': 32 obs. of 11 variables:
## $ iso2c : Factor w/ 32 levels "AR","AU","BE",..: 1 2 3 4 5 6 7 8 10 12 ...
## $ country : Factor w/ 32 levels "Argentina","Australia",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ Population.Size : num 4.11e+07 2.27e+07 1.11e+07 1.99e+08 3.49e+07 ...
## $ GDP.in.billions.of..US : int 476 1532 483 2253 1821 270 8227 370 263 2613 ...
## $ GDP.per.Capita.in.thousands.of..US: int 12 68 43 11 52 15 6 8 3 40 ...
## $ Internet.Users : int 22926505 18679842 9136569 99026051 30264360 10726566 571345572 23367689 35574130 54528252 ...
## $ Internet.Penetration : num 55.8 82.3 82 49.9 86.8 ...
## $ Facebook.Users : int 20048100 11680640 4922260 58565700 18090640 9687720 633300 17322000 12173540 25624760 ...
## $ Facebook.Penetration : num 48.8 51.5 44.2 29.5 51.9 ...
## $ Broadband.Subscribers : int 4475415 5743000 3679196 18186954 11405500 2166805 175624800 3975161 2287249 23960000 ...
The output looks messy but we can still extract key information. There are 32 observations (rows) and 11 variables (columns). The 32 observations are countries with their own population size, number of internet users, number of Facebook users, GDP and the like.
As mentioned earlier, the variables of interest in this dataset are the population of the country, the number of internet users, and the number of Facebook users.
The numbers in the population of the country and in other variables are really large and would be bad in the plots (overcrowding).
We put the population numbers, and the number of users in the hundred thousands. Here is the code and output:
# We want to see the relationship between population size vs internet users.
# internet_users vs facebook_users.
# Subset data and scale into one hundred thousands:
pop_size <- internet_data$Population.Size / 100000
internet_users <- internet_data$Internet.Users / 100000
facebook_users <- internet_data$Facebook.Users / 100000
# Need just population size, internet users, facebook users (Extract 3 columns):
internet_user_data <- data.frame(cbind(pop_size, internet_users, facebook_users))
summary(internet_user_data)
## pop_size internet_users facebook_users
## Min. : 71.55 Min. : 52.09 Min. : 6.333
## 1st Qu.: 297.76 1st Qu.: 171.72 1st Qu.: 90.044
## Median : 560.54 Median : 317.71 Median : 172.590
## Mean : 1497.04 Mean : 612.71 Mean : 241.333
## 3rd Qu.: 1027.42 3rd Qu.: 546.51 3rd Qu.: 266.913
## Max. :13506.95 Max. :5713.46 Max. :1660.292
We create a new dataset from the original data set with just population size, the number of internet users and the number of facebook users.
After looking at our data and cleaning it up a bit, we plot the data.
We will plot two scatterplots. The first will be comparing Population size versus the number of internet users. In the second plot, it will be the number of internet users versus the number of Facebook users.
Model 1
Here is the code and output of the first model.
# Plotting Linear Models:
# Source: http://ggvis.rstudio.com/cookbook.html
# Source: http://ggvis.rstudio.com/axes-legends.html
## Linear Model 1: Population size vs Internet Users:
internet_user_data %>%
ggvis(x = ~pop_size, y = ~internet_users) %>%
layer_points() %>%
layer_model_predictions(model = "lm", se = TRUE, stroke := "red") %>%
add_axis("x", title = "Population Size (In Hundred Thousands)", title_offset = 50) %>%
add_axis("y", title = "Number of Internet Users (In Hundred Thousands)",title_offset = 50)
## Guessing formula = internet_users ~ pop_size
Above is the visual of the scatterplot and a linear model (“line of best fit”). If we want the linear model in a more mathematical form such as \(y = mx + b\) we run this code.
net_pop_model <- lm(data = internet_user_data, internet_users ~ pop_size)
coef(net_pop_model)
## (Intercept) pop_size
## 184.4877923 0.2860429
The linear model fitted is (Units In Per Hundred Thousand):
\[\text{Number of Internet Users} = 0.2860429 \times \text{Population Size} + 184.4877 923\]
According to the model, for every unit (or 100,000) increase of the population, the number of internet users increases by 0.2860429 * 1 = 0.2860429 (or about 28604 users).
One can notice the three most rightward points on the plot. In statistics, such extreme values are called outliers. There are ways of determining whether or not a point is an outlier but it won’t be discussed here. You could say that if a point is far away from the rest of the points and the line then it is likely an outlier.
One way of finding the outliers here is to find the points with a population size of over 3000 (per 100 thousand). The code is below.
# What are the three extreme points (outliers) from the scatterplot?
## Extract the three outliers:
three_outliers <- subset(internet_user_data, pop_size > 3000)
three_outliers
## pop_size internet_users facebook_users
## 7 13506.95 5713.456 6.3330
## 13 12366.87 1555.759 627.1368
## 31 3139.14 2543.495 1660.2924
# Indices of rows 7, 13 and 31 of outliers, extracting countries of outliers.
internet_data[c(7, 13, 31), "country"]
## [1] China India United States
## 32 Levels: Argentina Australia Belgium Brazil Canada Chile ... Venezuela, RB
Those three points large population sizes and a large number of internet users belong to the countries of China, India and the United States.
Model 2
In the second model, we compare the number of internet users to the number of Facebook users.
### Linear Model 2: Internet Users vs Facebook Users:
internet_user_data %>%
ggvis(x = ~internet_users, y = ~facebook_users) %>%
layer_points() %>%
layer_model_predictions(model = "lm", se = TRUE, stroke := "blue") %>%
add_axis("x", title = "Number of Internet Users (In Hundred Thousands)", title_offset = 50) %>%
add_axis("y", title = "Number of Facebook Users (In Hundred Thousands)",title_offset = 50)
## Guessing formula = facebook_users ~ internet_users
The slope here is positive (upward sloping) and is less steep (more flat). Let’s find out what the linear model is.
net_fb_model <- lm(data = internet_user_data, facebook_users ~ internet_users)
coef(net_fb_model)
## (Intercept) internet_users
## 189.96371821 0.08383939
The linear model here is:
\[\text{Number of Facebook Users} = 0.08383939 \times \text{Number of Internet Users} + 189.96371821\]
Again we have three outliers in this second plot.
# What are the three extreme points (outliers) from the scatterplot?
## Extract the three outliers:
three_outliers_net_fb <- subset(internet_user_data, internet_users > 1500)
three_outliers_net_fb
## pop_size internet_users facebook_users
## 7 13506.95 5713.456 6.3330
## 13 12366.87 1555.759 627.1368
## 31 3139.14 2543.495 1660.2924
# Indices of rows 7, 13 and 31 of outliers, extracting countries of outliers.
internet_data[c(7, 13, 31), "country"]
## [1] China India United States
## 32 Levels: Argentina Australia Belgium Brazil Canada Chile ... Venezuela, RB
The three outliers are from China, India and the United States once again.
I am aware that the linear model should go through the origin as if there is no population then there cannot be any internet users nor Facebook users. Also, no internet users means no Facebook users. However, I do not know at this time how to plot a linear model through the origin (intercept of zero) using ggvis.
In R, one can plot a linear model through the origin. As an example we can have:
net_fb_model_origin <- lm(data = internet_user_data, facebook_users ~ 0 +internet_users)
coef(net_fb_model_origin)
## internet_users
## 0.1636588
\[\text{Number of Facebook Users} = 0.1637 \times \text{Number of Internet Users}\]