wtorek, 29 lipca 2014

3.6.2 Simple Linear Regression - loading data set

This laboratory was inspired by An Introduction to Statistical Learning, with Applications in R book, section 3.6.2 Simple Linear Regression at page 110. Please refer to it for for a detailed explanation of models and the nomenclature used in this post.

In this laboratory we will use Boston data set which comes with the MASS library. It contains some information about housing values in suburbs of Boston.

Loading data set in R

In order to load the data set in R we can use the following commands:

>library(MASS)
>fix(Boston)

Then, we can use summary function to learn more about the data:

> summary(Boston)
      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  

Loading data set in Azure Machine Learning

The Boston data set is not available in the predefined set of Saved Datasets. However, it can easily be loaded using Execute R Script available under R Language Modules. Drag this module to the experiment canvas and set the following script to be executed:

# Load MASS library
library(MASS);

# Assign data set to the current workspace
data.frame <- Boston;

# Select frame to be sent to the output Dataset port
maml.mapOutputPort("data.frame");

Your experiment canvas should look like this:

Your properties pane should look like this:

Once you save and run the experiment you should be able to right-click on the output port and select Visualize:

This will open a new dialog with the basic information regarding the data:

Using Descriptive Statistics module

The default data set visualization in Azure does not show all the values that are printed by summary function in R. In particular first and third quartiles are missing. In order to get their values one can use Descriptive Statistics module. Drag it to the experiment surface and connect it with the existing Execute R Script module.

Your experiment canvas should look like this:

Now when you visualize the output port of the Descriptive Statistics module you will see more statistics including quartiles missed previously.

In the next part

In the next part we will look into selecting data for the one-dimensional regression.

References


This post and all the resources are available on GitHub:

https://github.com/StanislawSwierc/it-is-not-overengineering/tree/master

Brak komentarzy:

Prześlij komentarz