wtorek, 29 lipca 2014

3.6.2 Simple Linear Regression - loading data set

This laboratory was inspired by An Introduction to Statistical Learning, with Applications in R book, section 3.6.2 Simple Linear Regression at page 110. Please refer to it for for a detailed explanation of models and the nomenclature used in this post.

In this laboratory we will use Boston data set which comes with the MASS library. It contains some information about housing values in suburbs of Boston.

Loading data set in R

In order to load the data set in R we can use the following commands:

>library(MASS)
>fix(Boston)

Then, we can use summary function to learn more about the data:

> summary(Boston)
      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  

Loading data set in Azure Machine Learning

The Boston data set is not available in the predefined set of Saved Datasets. However, it can easily be loaded using Execute R Script available under R Language Modules. Drag this module to the experiment canvas and set the following script to be executed:

# Load MASS library
library(MASS);

# Assign data set to the current workspace
data.frame <- Boston;

# Select frame to be sent to the output Dataset port
maml.mapOutputPort("data.frame");

Your experiment canvas should look like this:

Your properties pane should look like this:

Once you save and run the experiment you should be able to right-click on the output port and select Visualize:

This will open a new dialog with the basic information regarding the data:

Using Descriptive Statistics module

The default data set visualization in Azure does not show all the values that are printed by summary function in R. In particular first and third quartiles are missing. In order to get their values one can use Descriptive Statistics module. Drag it to the experiment surface and connect it with the existing Execute R Script module.

Your experiment canvas should look like this:

Now when you visualize the output port of the Descriptive Statistics module you will see more statistics including quartiles missed previously.

In the next part

In the next part we will look into selecting data for the one-dimensional regression.

References


This post and all the resources are available on GitHub:

https://github.com/StanislawSwierc/it-is-not-overengineering/tree/master

poniedziałek, 28 lipca 2014

Introduction to Statistical Learning with Azure Machine Learning

Recently Microsoft announced release a preview version of a Azure Machine Learning service. The announcement appeared around the same time at The Official Microsoft Blog and Machine Learning Blog. Personally, I believe this is an important step forward because it fills the gap between data scientists capable of creating elaborate models and people who want to use these models in production environment.

The service itself is described as:

The problem? Machine learning traditionally requires complex software, high-end computers, and seasoned data scientists who understand it all. For many startups and even large enterprises, it's simply too hard and expensive. Enter Azure Machine Learning, a fully-managed cloud service for predictive analytics. By leveraging the cloud, Azure Machine Learning makes machine learning more accessible to a much broader audience. Predicting future outcomes is now attainable.

Getting started

The service is there, everyone can access it by either using existing Azure subscription or creating free trial account. There are some training materials but they all focus on how to use the system. They make assumption that the user is already familiar with the algorithms and models available in the platform. However that will not be always the case. The number of different models and their parameters is high. Therefore it is important to establish the link between them and the subject matter literature and show a path one can follow to master the platform. In the following posts I will try to do just that.

Introduction to Statistical Learning...

The learning path I would like to present was created by Trevor Hastie and Rob Tibshirani, two professors a Stanford University, who have been teaching statistical learning for many years and recently created an online course at Stanford Online. I highly recommend registering for this course! Not only is it free but additionally students get access to a pdf version of the textbook used in the course - An Introduction to Statistical Learning, with Applications in R by James, Witten, Hastie and Tibshirani (Springer, 2013).

This book is a great starting point for learning about machine learning. At the end of each chapter there is a lab section which shows how a newly introduced model can be exercised in R. This hands-on experience is critical to understand how the models behave and how to select the right values for their parameters to get the best results.

with Azure Machine Learning

All the labs in the book above are in R, but the models used are generally available and most of them can also be found in Azure Machine Learning. This gave me an idea for a series of posts which will use the same data and models but a different environment. Because the examples I will present will be covered also in the book it should be easier for the reader follow and get back to specific sections to get a deeper understanding about how specific models work.

References


This post and all the resources are available on GitHub:

https://github.com/StanislawSwierc/it-is-not-overengineering/tree/master