This laboratory was inspired by An Introduction to Statistical Learning, with Applications in R book, section 3.6.2 Simple Linear Regression at page 110. Please refer to it for for a detailed explanation of models and the nomenclature used in this post.
In this laboratory we will use Boston
data set which comes with the MASS
library.
It contains some information about housing values in suburbs of Boston.
Loading data set in R
In order to load the data set in R we can use the following commands:
>library(MASS)
>fix(Boston)
Then, we can use summary
function to learn more about the data:
> summary(Boston)
crim zn indus chas
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
nox rm age dis
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
rad tax ptratio black
Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
lstat medv
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00
Loading data set in Azure Machine Learning
The Boston
data set is not available in the predefined set of Saved Datasets
.
However, it can easily be loaded using Execute R Script
available under R Language Modules
.
Drag this module to the experiment canvas and set the following script to be executed:
# Load MASS library
library(MASS);
# Assign data set to the current workspace
data.frame <- Boston;
# Select frame to be sent to the output Dataset port
maml.mapOutputPort("data.frame");
Your experiment canvas should look like this:
Your properties pane should look like this:
Once you save and run the experiment you should be able to right-click on the output port and select Visualize
:
This will open a new dialog with the basic information regarding the data:
Using Descriptive Statistics module
The default data set visualization in Azure does not show all the values that are printed by summary
function in R.
In particular first and third quartiles are missing.
In order to get their values one can use Descriptive Statistics
module.
Drag it to the experiment surface and connect it with the existing Execute R Script
module.
Your experiment canvas should look like this:
Now when you visualize the output port of the Descriptive Statistics
module you will see more statistics including quartiles missed previously.
In the next part
In the next part we will look into selecting data for the one-dimensional regression.
References
- Housing Values in Suburbs of Boston
- Microsoft Azure Machine Learning (Trial)
- Microsoft Machine Learning Blog
- Statistical Learning course at Stanford Online
- An Introduction to Statistical Learning with Applications in R (Springer, Amazon)
- The Comprehensive R Archive Network
- RStudio
This post and all the resources are available on GitHub:
https://github.com/StanislawSwierc/it-is-not-overengineering/tree/master