niedziela, 7 września 2014

3.6.2 Simple Linear Regression - running prediction

This laboratory was inspired by An Introduction to Statistical Learning, with Applications in R book, section 3.6.2 Simple Linear Regression at page 110. Please refer to this resource for a detailed explanation of models and the nomenclature used in this post.

In the previous post we've seen how to train a linear regression model. This post explains how to use the model to make predictions on the new data.

Running prediction in R

Once we have trained model we can use predict function to produce a prediction.

> predict(lm.fit, data.frame(lstat = c(5, 10, 15)))

       1        2        3 
29.80359 25.05335 20.30310 

Alternatively, for lm models we can set the interval parameter to compute prediction intervals.

> predict(
    lm.fit,
    data.frame(lstat = c(5, 10, 15)), interval = "prediction")

       fit       lwr      upr
1 29.80359 17.565675 42.04151
2 25.05335 12.827626 37.27907
3 20.30310  8.077742 32.52846

From this results we can read that for example, the predicted value of the medv for the lstat of 10 is 25.05335 and its 95% prediction interval is (12.827626, 37.27907).

Running prediction in Azure Machine Learning

The process of running prediction in Azure is slightly different because it is optimized for the web. Instead of calling a function locally we will publish the trained model as an Azure Web Service.

At first it may seem ridiculous to call a web service to run prediction on such a simple model, but on the second thought, it is actually great. What we will get it a fully operational, very scalable service which we can use straight away. Thanks to the nicely factored API we can continue to improve models without the need to update our clients. Finally, as we will see, the services are instrumented. The information about usage patterns will be saved and available though Azure portal.

The process of publishing the model has been described well in the documentation.

The first step is to add a Score Model to the the experiment. It has two inputs, a trained model and a data set to score. This module has no configuration because it can infer it from the context.

Score Model

Next we need to click on its data set input and output and select "Set as Publish Input"

Set as Publish Input

and "Set as Publish Output" options accordingly.

Set as Publish Output

Once experiment has all the inputs and outputs set you can run it and the "Publish Web Service" command will be enabled.

Publish as Web Service

The system will create a service and redirect you to its management site.

Service management

From there you can test it directly in the browser or select API help page to see how to access it programaticallly.

Service API

At the bottom there are samples in C#, Python and R! Lets copy the code into R Studio. In order to pass server authorization we need to replace the dummy API key with a genuine key from the service site. We will also set the value of the lstat to 10.

library("RCurl")
library("RJSONIO")

# Accept SSL certificates issued by public Certificate Authorities
options(RCurlOptions = list(
    cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

h = basicTextGatherer()
req = list(Id="score00001",
 Instance=list(FeatureVector=list(
    "lstat"= "10",
    "medv"= "0"
 ),GlobalParameters=fromJSON('{}')))

body = toJSON(req)
api_key = "abc123" # Replace this with the API key for the web service
authz_hdr = paste('Bearer', api_key, sep=' ')

h$reset()
curlPerform(
    url = "https://ussouthcentral.services.azureml.net/workspaces/...",
    httpheader=c(
        'Content-Type' = "application/json",
        'Authorization' = authz_hdr),
    postfields=body,
    writefunction = h$update,
    verbose = TRUE
    )

result = h$value()
print(result)

This will produce the following output. Please notice that curlPerform is called with verbose = TRUE, thus there will be a lot of diagnostic information. It can be very helpful during development but you will most likely want to suppress it when you create a client library that makes use of the service.

* About to connect() to ussouthcentral.services.azureml.net port 443 (#0)
*   Trying 191.238.226.212... * connected
* Connected to ussouthcentral.services.azureml.net (191.238.226.212)
    port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: C:/Users/stansw/Documents/R/win-library/3.1/RCurl/CurlSSL/
   cacert.pem CApath: none
* SSL connection using AES128-SHA
* Server certificate:
*      subject: CN=ussouthcentral.services.azureml.net
*      start date: 2014-07-01 19:23:34 GMT
*      expire date: 2016-06-30 19:23:34 GMT
*      subjectAltName: ussouthcentral.services.azureml.net matched
*      issuer: C=US; ST=Washington; L=Redmond; O=Microsoft Corporation;
         OU=Microsoft IT; CN=Microsoft IT SSL SHA2
*      SSL certificate verify ok.
> POST /workspaces/fb65c4e602654cb6a9fe4aae12daf762/services/
    8a8527dd062548e5b600e6023c0a69a0/score HTTP/1.1
Host: ussouthcentral.services.azureml.net
Accept: */*
Content-Type: application/json
Authorization: Bearer abc123
Content-Length: 116

< HTTP/1.1 200 OK
< Content-Length: 28
< Content-Type: application/json; charset=utf-8
< Server: Microsoft-HTTPAPI/2.0
< x-ms-request-id: 44bbb8b4-cf0d-4b70-8ca0-83326c5265f5
< Date: Mon, 08 Sep 2014 05:30:49 GMT
< 
* Connection #0 to host ussouthcentral.services.azureml.net left intact
OK 
 0 

[1] "[\"10\",\"0\",\"25.0533473418032\"]"

The last line is the most interesting bit. It tells us that for lstat value 10 the model prediction value is 25.0533473418032. As expected, this value is precisely what we received when we run the model inside R.

Summary

  • In this laboratory we saw how to run the prediction both in R and in Azure Machine Learning Studio.
  • Both models returned the same value.
  • When working in R it was very easy to get some statistical information about the prediction such as the 95% intervals.
  • By publishing our experiment we created a fully operational Web Service hosted in Azure.

In the next part

In the next part we will expand the feature space and training a multiple linear regression model.

References


This post and all the resources are available on GitHub:

https://github.com/StanislawSwierc/it-is-not-overengineering/tree/master