Hello. In this page, I will talk about displaying scatterplots using the statistical program R and the R package ggvis for data visualization.

 

Table of Contents

 

 

The Dataset

 

This particular dataset is based on time or time spent on the internet. Variables of interest in this dataset are the number of internet users, the population of the country and the number of Facebook users.

 

Looking at The Data

 

In R, we load the libraries ggvis and dplyr. Loading ggvis will allow us to access the plots, graphs and other visual tools in the library. The dplyr package will allows us to use the %>% syntax for ggvis. More info on the %>% can be found here. The dataset was from here.

The code and output can be found below:

 

# ggvis

# The dataset:

# Load ggvis
library(ggvis)
library(dplyr)

url <- "http://sites.williams.edu/bklingen/files/2015/05/InternetUse.csv"

internet_data <- read.csv(url)

head(internet_data)
##   iso2c   country year Population.Size GDP.in.billions.of..US
## 1    AR Argentina 2012        41086927                    476
## 2    AU Australia 2012        22683600                   1532
## 3    BE   Belgium 2012        11142157                    483
## 4    BR    Brazil 2012       198656019                   2253
## 5    CA    Canada 2012        34880491                   1821
## 6    CL     Chile 2012        17464814                    270
##   GDP.per.Capita.in.thousands.of..US Internet.Users Internet.Penetration
## 1                                 12       22926505                55.80
## 2                                 68       18679842                82.35
## 3                                 43        9136569                82.00
## 4                                 11       99026051                49.85
## 5                                 52       30264360                86.77
## 6                                 15       10726566                61.42
##   Facebook.Users Facebook.Penetration Broadband.Subscribers
## 1       20048100                48.79               4475415
## 2       11680640                51.49               5743000
## 3        4922260                44.18               3679196
## 4       58565700                29.48              18186954
## 5       18090640                51.86              11405500
## 6        9687720                55.47               2166805
dim(internet_data)
## [1] 32 11
str(internet_data)
## 'data.frame':    32 obs. of  11 variables:
##  $ iso2c                             : Factor w/ 32 levels "AR","AU","BE",..: 1 2 3 4 5 6 7 8 10 12 ...
##  $ country                           : Factor w/ 32 levels "Argentina","Australia",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ year                              : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ Population.Size                   : num  4.11e+07 2.27e+07 1.11e+07 1.99e+08 3.49e+07 ...
##  $ GDP.in.billions.of..US            : int  476 1532 483 2253 1821 270 8227 370 263 2613 ...
##  $ GDP.per.Capita.in.thousands.of..US: int  12 68 43 11 52 15 6 8 3 40 ...
##  $ Internet.Users                    : int  22926505 18679842 9136569 99026051 30264360 10726566 571345572 23367689 35574130 54528252 ...
##  $ Internet.Penetration              : num  55.8 82.3 82 49.9 86.8 ...
##  $ Facebook.Users                    : int  20048100 11680640 4922260 58565700 18090640 9687720 633300 17322000 12173540 25624760 ...
##  $ Facebook.Penetration              : num  48.8 51.5 44.2 29.5 51.9 ...
##  $ Broadband.Subscribers             : int  4475415 5743000 3679196 18186954 11405500 2166805 175624800 3975161 2287249 23960000 ...

 

The output looks messy but we can still extract key information. There are 32 observations (rows) and 11 variables (columns). The 32 observations are countries with their own population size, number of internet users, number of Facebook users, GDP and the like.

As mentioned earlier, the variables of interest in this dataset are the population of the country, the number of internet users, and the number of Facebook users.

The numbers in the population of the country and in other variables are really large and would be bad in the plots (overcrowding).

We put the population numbers, and the number of users in the hundred thousands. Here is the code and output:

# We want to see the relationship between population size vs internet users.
# internet_users vs facebook_users.

# Subset data and scale into one hundred thousands:

pop_size <- internet_data$Population.Size / 100000

internet_users <- internet_data$Internet.Users / 100000

facebook_users <- internet_data$Facebook.Users / 100000

# Need just population size, internet users, facebook users (Extract 3 columns):

internet_user_data <- data.frame(cbind(pop_size, internet_users, facebook_users))

summary(internet_user_data)
##     pop_size        internet_users    facebook_users    
##  Min.   :   71.55   Min.   :  52.09   Min.   :   6.333  
##  1st Qu.:  297.76   1st Qu.: 171.72   1st Qu.:  90.044  
##  Median :  560.54   Median : 317.71   Median : 172.590  
##  Mean   : 1497.04   Mean   : 612.71   Mean   : 241.333  
##  3rd Qu.: 1027.42   3rd Qu.: 546.51   3rd Qu.: 266.913  
##  Max.   :13506.95   Max.   :5713.46   Max.   :1660.292

 

We create a new dataset from the original data set with just population size, the number of internet users and the number of facebook users.

 

The Scatterplots and Linear Models

After looking at our data and cleaning it up a bit, we plot the data.

We will plot two scatterplots. The first will be comparing Population size versus the number of internet users. In the second plot, it will be the number of internet users versus the number of Facebook users.

 

Model 1

 

Here is the code and output of the first model.

# Plotting Linear Models:
# Source: http://ggvis.rstudio.com/cookbook.html
# Source: http://ggvis.rstudio.com/axes-legends.html


## Linear Model 1: Population size vs Internet Users:

internet_user_data %>%
ggvis(x = ~pop_size, y = ~internet_users) %>%
  layer_points() %>%
  layer_model_predictions(model = "lm", se = TRUE, stroke := "red") %>%
  add_axis("x", title = "Population Size (In Hundred Thousands)",  title_offset = 50) %>%
  add_axis("y", title = "Number of Internet Users (In Hundred Thousands)",title_offset = 50)
## Guessing formula = internet_users ~ pop_size

 

Above is the visual of the scatterplot and a linear model (“line of best fit”). If we want the linear model in a more mathematical form such as \(y = mx + b\) we run this code.

 

net_pop_model <- lm(data = internet_user_data, internet_users ~ pop_size)

coef(net_pop_model)
## (Intercept)    pop_size 
## 184.4877923   0.2860429

 

The linear model fitted is (Units In Per Hundred Thousand):

 

\[\text{Number of Internet Users} = 0.2860429 \times \text{Population Size} + 184.4877 923\]

 

According to the model, for every unit (or 100,000) increase of the population, the number of internet users increases by 0.2860429 * 1 = 0.2860429 (or about 28604 users).

One can notice the three most rightward points on the plot. In statistics, such extreme values are called outliers. There are ways of determining whether or not a point is an outlier but it won’t be discussed here. You could say that if a point is far away from the rest of the points and the line then it is likely an outlier.

One way of finding the outliers here is to find the points with a population size of over 3000 (per 100 thousand). The code is below.

 

# What are the three extreme points (outliers) from the scatterplot?

## Extract the three outliers:

three_outliers <- subset(internet_user_data, pop_size > 3000)

three_outliers
##    pop_size internet_users facebook_users
## 7  13506.95       5713.456         6.3330
## 13 12366.87       1555.759       627.1368
## 31  3139.14       2543.495      1660.2924
# Indices of rows 7, 13 and 31 of outliers, extracting countries of outliers.

internet_data[c(7, 13, 31), "country"]
## [1] China         India         United States
## 32 Levels: Argentina Australia Belgium Brazil Canada Chile ... Venezuela, RB

 

Those three points large population sizes and a large number of internet users belong to the countries of China, India and the United States.

 

Model 2

 

In the second model, we compare the number of internet users to the number of Facebook users.

 

### Linear Model 2: Internet Users vs Facebook Users:

internet_user_data %>%
  ggvis(x = ~internet_users, y = ~facebook_users) %>%
  layer_points() %>%
  layer_model_predictions(model = "lm", se = TRUE, stroke := "blue") %>%
  add_axis("x", title = "Number of Internet Users (In Hundred Thousands)", title_offset = 50) %>%
  add_axis("y", title = "Number of Facebook Users (In Hundred Thousands)",title_offset = 50)
## Guessing formula = facebook_users ~ internet_users

 

The slope here is positive (upward sloping) and is less steep (more flat). Let’s find out what the linear model is.

 

net_fb_model <- lm(data = internet_user_data, facebook_users ~ internet_users)

coef(net_fb_model)
##    (Intercept) internet_users 
##   189.96371821     0.08383939

 

The linear model here is:

 

\[\text{Number of Facebook Users} = 0.08383939 \times \text{Number of Internet Users} + 189.96371821\]

 

Again we have three outliers in this second plot.

 

# What are the three extreme points (outliers) from the scatterplot?

## Extract the three outliers:

three_outliers_net_fb <- subset(internet_user_data, internet_users > 1500)

three_outliers_net_fb
##    pop_size internet_users facebook_users
## 7  13506.95       5713.456         6.3330
## 13 12366.87       1555.759       627.1368
## 31  3139.14       2543.495      1660.2924
# Indices of rows 7, 13 and 31 of outliers, extracting countries of outliers.

internet_data[c(7, 13, 31), "country"]
## [1] China         India         United States
## 32 Levels: Argentina Australia Belgium Brazil Canada Chile ... Venezuela, RB

 

The three outliers are from China, India and the United States once again.

 

Notes and Thoughts

 

I am aware that the linear model should go through the origin as if there is no population then there cannot be any internet users nor Facebook users. Also, no internet users means no Facebook users. However, I do not know at this time how to plot a linear model through the origin (intercept of zero) using ggvis.

In R, one can plot a linear model through the origin. As an example we can have:

 

net_fb_model_origin <- lm(data = internet_user_data, facebook_users ~ 0 +internet_users)
coef(net_fb_model_origin)
## internet_users 
##      0.1636588

 

\[\text{Number of Facebook Users} = 0.1637 \times \text{Number of Internet Users}\]

 

References