sobota, 2 sierpnia 2014

3.6.2 Simple Linear Regression - fitting the model

This laboratory was inspired by An Introduction to Statistical Learning, with Applications in R book, section 3.6.2 Simple Linear Regression at page 110. Please refer to it for for a detailed explanation of models and the nomenclature used in this post.

Previously we've seen how to load the Boston from the MASS library. Now we will look into how we can fir a linear regression model. We will try to predict median value of owner-occupied homes in $1000s (medv) based on just a single predictor which is the lower status of the population in percent (lstat).

Fitting linear regression model in R

In R one can fit a linear regression model using lm() function. Its basic syntax is lm(y~x, data, where y is the response, x is predictor and data is the data set.

In order to fit the model to Boston data we can call:

> lm.fit = lm(medv~lstat, data=Boston)

For basic information about the model we can type:

> lm.fit

Call:
lm(formula = medv ~ lstat, data = Boston)

Coefficients:
(Intercept)        lstat  
      34.55        -0.95  

It will print the function call used to creat the model as well as fitted coefficients.

In order to get more detailed information we can type:

> summary(lm.fit)

Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,    Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

This gives us information about residuals, p-values and standard errors for the coefficients, as well as statistics for the model.

Fitting linear regression model in Azure Machine Learning

In order to repeat the same experiment in Azure Machine Learning we will start with modules created last time.

In the first step we need to select the columns we want to work with. Drag one 'Project Columns' module (Data Transformation -> Manipulation) to the experiment canvas and connect it with existing Execute R Script module:

In the properties pane click on the Launch column selector:

Select columns: medv and lstat.

With the right data we can proceed to fitting the model. Drag the Linear Regression module (Machine Learning -> Initialize Model -> Regression) to the experiment canvas. To train the model we will also need one Train Model (Machine Learning -> Train).

Connect all the modules. Select Train Model and in the properties pane click on Lauch column selector to choose response column. This type only medv because that's the quantity we want to predict.

The complete model should look like that:

Run it to fit the model to the data.

You can visualize the output port of the Train Model module to see the result.

We can see that the coefficient values obtain from Azure Machine Learning are different that what we got in R. Instead of value 34.55 for the intercept (bias) we have 25.80. Whereas coefficient for lstat changed from -0.95 to -11.43.

The reason why we observed this discrepant is because Azure Machine Learning uses more advanced model with learning rate and regularization, which we will get to in the future laboratories when we reach chapter 6 Linear Model Selection and Regularization ISLR. For now we will disable these features to reach parity between two models we've seen so far.

Select Linear Regression module, go to the properties pane and select the following configuration.

Rerun the model and visualize the result.

Now we can see that the coefficient values match what we got at the beginning. Just as with R the model is described by its coefficients and we need to use other functions to get more information about its performance

In the next part

In the next part we will look into evaluating the trained model.

References


This post and all the resources are available on GitHub:

https://github.com/StanislawSwierc/it-is-not-overengineering/tree/master