It is not overengineering!: 2014

niedziela, 7 września 2014

3.6.2 Simple Linear Regression - running prediction

This laboratory was inspired by An Introduction to Statistical Learning, with Applications in R book, section 3.6.2 Simple Linear Regression at page 110. Please refer to this resource for a detailed explanation of models and the nomenclature used in this post.

In the previous post we've seen how to train a linear regression model. This post explains how to use the model to make predictions on the new data.

Running prediction in R

Once we have trained model we can use predict function to produce a prediction.

> predict(lm.fit, data.frame(lstat = c(5, 10, 15)))

       1        2        3 
29.80359 25.05335 20.30310

Alternatively, for lm models we can set the interval parameter to compute prediction intervals.

> predict(
    lm.fit,
    data.frame(lstat = c(5, 10, 15)), interval = "prediction")

       fit       lwr      upr
1 29.80359 17.565675 42.04151
2 25.05335 12.827626 37.27907
3 20.30310  8.077742 32.52846

From this results we can read that for example, the predicted value of the medv for the lstat of 10 is 25.05335 and its 95% prediction interval is (12.827626, 37.27907).

Running prediction in Azure Machine Learning

The process of running prediction in Azure is slightly different because it is optimized for the web. Instead of calling a function locally we will publish the trained model as an Azure Web Service.

At first it may seem ridiculous to call a web service to run prediction on such a simple model, but on the second thought, it is actually great. What we will get it a fully operational, very scalable service which we can use straight away. Thanks to the nicely factored API we can continue to improve models without the need to update our clients. Finally, as we will see, the services are instrumented. The information about usage patterns will be saved and available though Azure portal.

The process of publishing the model has been described well in the documentation.

Publishing an Azure ML Web Service

The first step is to add a Score Model to the the experiment. It has two inputs, a trained model and a data set to score. This module has no configuration because it can infer it from the context.

Next we need to click on its data set input and output and select "Set as Publish Input"

and "Set as Publish Output" options accordingly.

Once experiment has all the inputs and outputs set you can run it and the "Publish Web Service" command will be enabled.

The system will create a service and redirect you to its management site.

From there you can test it directly in the browser or select API help page to see how to access it programaticallly.

At the bottom there are samples in C#, Python and R! Lets copy the code into R Studio. In order to pass server authorization we need to replace the dummy API key with a genuine key from the service site. We will also set the value of the lstat to 10.

library("RCurl")
library("RJSONIO")

# Accept SSL certificates issued by public Certificate Authorities
options(RCurlOptions = list(
    cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

h = basicTextGatherer()
req = list(Id="score00001",
 Instance=list(FeatureVector=list(
    "lstat"= "10",
    "medv"= "0"
 ),GlobalParameters=fromJSON('{}')))

body = toJSON(req)
api_key = "abc123" # Replace this with the API key for the web service
authz_hdr = paste('Bearer', api_key, sep=' ')

h$reset()
curlPerform(
    url = "https://ussouthcentral.services.azureml.net/workspaces/...",
    httpheader=c(
        'Content-Type' = "application/json",
        'Authorization' = authz_hdr),
    postfields=body,
    writefunction = h$update,
    verbose = TRUE
    )

result = h$value()
print(result)

This will produce the following output. Please notice that curlPerform is called with verbose = TRUE, thus there will be a lot of diagnostic information. It can be very helpful during development but you will most likely want to suppress it when you create a client library that makes use of the service.

* About to connect() to ussouthcentral.services.azureml.net port 443 (#0)
*   Trying 191.238.226.212... * connected
* Connected to ussouthcentral.services.azureml.net (191.238.226.212)
    port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: C:/Users/stansw/Documents/R/win-library/3.1/RCurl/CurlSSL/
   cacert.pem CApath: none
* SSL connection using AES128-SHA
* Server certificate:
*      subject: CN=ussouthcentral.services.azureml.net
*      start date: 2014-07-01 19:23:34 GMT
*      expire date: 2016-06-30 19:23:34 GMT
*      subjectAltName: ussouthcentral.services.azureml.net matched
*      issuer: C=US; ST=Washington; L=Redmond; O=Microsoft Corporation;
         OU=Microsoft IT; CN=Microsoft IT SSL SHA2
*      SSL certificate verify ok.
> POST /workspaces/fb65c4e602654cb6a9fe4aae12daf762/services/
    8a8527dd062548e5b600e6023c0a69a0/score HTTP/1.1
Host: ussouthcentral.services.azureml.net
Accept: */*
Content-Type: application/json
Authorization: Bearer abc123
Content-Length: 116

< HTTP/1.1 200 OK
< Content-Length: 28
< Content-Type: application/json; charset=utf-8
< Server: Microsoft-HTTPAPI/2.0
< x-ms-request-id: 44bbb8b4-cf0d-4b70-8ca0-83326c5265f5
< Date: Mon, 08 Sep 2014 05:30:49 GMT
< 
* Connection #0 to host ussouthcentral.services.azureml.net left intact
OK 
 0 

[1] "[\"10\",\"0\",\"25.0533473418032\"]"

The last line is the most interesting bit. It tells us that for lstat value 10 the model prediction value is 25.0533473418032. As expected, this value is precisely what we received when we run the model inside R.

Summary

In this laboratory we saw how to run the prediction both in R and in Azure Machine Learning Studio.
Both models returned the same value.
When working in R it was very easy to get some statistical information about the prediction such as the 95% intervals.
By publishing our experiment we created a fully operational Web Service hosted in Azure.

In the next part

In the next part we will expand the feature space and training a multiple linear regression model.

References

This post and all the resources are available on GitHub:

https://github.com/StanislawSwierc/it-is-not-overengineering/tree/master

sobota, 2 sierpnia 2014

3.6.2 Simple Linear Regression - fitting the model

This laboratory was inspired by An Introduction to Statistical Learning, with Applications in R book, section 3.6.2 Simple Linear Regression at page 110. Please refer to it for for a detailed explanation of models and the nomenclature used in this post.

Previously we've seen how to load the Boston from the MASS library. Now we will look into how we can fir a linear regression model. We will try to predict median value of owner-occupied homes in $1000s (medv) based on just a single predictor which is the lower status of the population in percent (lstat).

Fitting linear regression model in R

In R one can fit a linear regression model using lm() function. Its basic syntax is lm(y~x, data, where y is the response, x is predictor and data is the data set.

In order to fit the model to Boston data we can call:

> lm.fit = lm(medv~lstat, data=Boston)

For basic information about the model we can type:

> lm.fit

Call:
lm(formula = medv ~ lstat, data = Boston)

Coefficients:
(Intercept)        lstat  
      34.55        -0.95

It will print the function call used to creat the model as well as fitted coefficients.

In order to get more detailed information we can type:

> summary(lm.fit)

Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,    Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

This gives us information about residuals, p-values and standard errors for the coefficients, as well as statistics for the model.

Fitting linear regression model in Azure Machine Learning

In order to repeat the same experiment in Azure Machine Learning we will start with modules created last time.

In the first step we need to select the columns we want to work with. Drag one 'Project Columns' module (Data Transformation -> Manipulation) to the experiment canvas and connect it with existing Execute R Script module:

In the properties pane click on the Launch column selector:

Select columns: medv and lstat.

With the right data we can proceed to fitting the model. Drag the Linear Regression module (Machine Learning -> Initialize Model -> Regression) to the experiment canvas. To train the model we will also need one Train Model (Machine Learning -> Train).

Connect all the modules. Select Train Model and in the properties pane click on Lauch column selector to choose response column. This type only medv because that's the quantity we want to predict.

The complete model should look like that:

Run it to fit the model to the data.

You can visualize the output port of the Train Model module to see the result.

We can see that the coefficient values obtain from Azure Machine Learning are different that what we got in R. Instead of value 34.55 for the intercept (bias) we have 25.80. Whereas coefficient for lstat changed from -0.95 to -11.43.

The reason why we observed this discrepant is because Azure Machine Learning uses more advanced model with learning rate and regularization, which we will get to in the future laboratories when we reach chapter 6 Linear Model Selection and Regularization ISLR. For now we will disable these features to reach parity between two models we've seen so far.

Select Linear Regression module, go to the properties pane and select the following configuration.

Rerun the model and visualize the result.

Now we can see that the coefficient values match what we got at the beginning. Just as with R the model is described by its coefficients and we need to use other functions to get more information about its performance

In the next part

In the next part we will look into evaluating the trained model.

References

This post and all the resources are available on GitHub:

https://github.com/StanislawSwierc/it-is-not-overengineering/tree/master

wtorek, 29 lipca 2014

3.6.2 Simple Linear Regression - loading data set

This laboratory was inspired by An Introduction to Statistical Learning, with Applications in R book, section 3.6.2 Simple Linear Regression at page 110. Please refer to it for for a detailed explanation of models and the nomenclature used in this post.

In this laboratory we will use Boston data set which comes with the MASS library. It contains some information about housing values in suburbs of Boston.

Loading data set in R

In order to load the data set in R we can use the following commands:

>library(MASS)
>fix(Boston)

Then, we can use summary function to learn more about the data:

> summary(Boston)
      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00

Loading data set in Azure Machine Learning

The Boston data set is not available in the predefined set of Saved Datasets. However, it can easily be loaded using Execute R Script available under R Language Modules. Drag this module to the experiment canvas and set the following script to be executed:

# Load MASS library
library(MASS);

# Assign data set to the current workspace
data.frame <- Boston;

# Select frame to be sent to the output Dataset port
maml.mapOutputPort("data.frame");

Your experiment canvas should look like this:

Your properties pane should look like this:

Once you save and run the experiment you should be able to right-click on the output port and select Visualize:

This will open a new dialog with the basic information regarding the data:

Using Descriptive Statistics module

The default data set visualization in Azure does not show all the values that are printed by summary function in R. In particular first and third quartiles are missing. In order to get their values one can use Descriptive Statistics module. Drag it to the experiment surface and connect it with the existing Execute R Script module.

Your experiment canvas should look like this:

Now when you visualize the output port of the Descriptive Statistics module you will see more statistics including quartiles missed previously.

In the next part

In the next part we will look into selecting data for the one-dimensional regression.

References

This post and all the resources are available on GitHub:

https://github.com/StanislawSwierc/it-is-not-overengineering/tree/master

poniedziałek, 28 lipca 2014

Introduction to Statistical Learning with Azure Machine Learning

Recently Microsoft announced release a preview version of a Azure Machine Learning service. The announcement appeared around the same time at The Official Microsoft Blog and Machine Learning Blog. Personally, I believe this is an important step forward because it fills the gap between data scientists capable of creating elaborate models and people who want to use these models in production environment.

The service itself is described as:

The problem? Machine learning traditionally requires complex software, high-end computers, and seasoned data scientists who understand it all. For many startups and even large enterprises, it's simply too hard and expensive. Enter Azure Machine Learning, a fully-managed cloud service for predictive analytics. By leveraging the cloud, Azure Machine Learning makes machine learning more accessible to a much broader audience. Predicting future outcomes is now attainable.

Getting started

The service is there, everyone can access it by either using existing Azure subscription or creating free trial account. There are some training materials but they all focus on how to use the system. They make assumption that the user is already familiar with the algorithms and models available in the platform. However that will not be always the case. The number of different models and their parameters is high. Therefore it is important to establish the link between them and the subject matter literature and show a path one can follow to master the platform. In the following posts I will try to do just that.

Introduction to Statistical Learning...

The learning path I would like to present was created by Trevor Hastie and Rob Tibshirani, two professors a Stanford University, who have been teaching statistical learning for many years and recently created an online course at Stanford Online. I highly recommend registering for this course! Not only is it free but additionally students get access to a pdf version of the textbook used in the course - An Introduction to Statistical Learning, with Applications in R by James, Witten, Hastie and Tibshirani (Springer, 2013).

This book is a great starting point for learning about machine learning. At the end of each chapter there is a lab section which shows how a newly introduced model can be exercised in R. This hands-on experience is critical to understand how the models behave and how to select the right values for their parameters to get the best results.

with Azure Machine Learning

All the labs in the book above are in R, but the models used are generally available and most of them can also be found in Azure Machine Learning. This gave me an idea for a series of posts which will use the same data and models but a different environment. Because the examples I will present will be covered also in the book it should be easier for the reader follow and get back to specific sections to get a deeper understanding about how specific models work.

References

This post and all the resources are available on GitHub:

https://github.com/StanislawSwierc/it-is-not-overengineering/tree/master