This laboratory was inspired by An Introduction to Statistical Learning, with Applications in R book, section 3.6.2 Simple Linear Regression at page 110. Please refer to it for for a detailed explanation of models and the nomenclature used in this post.
Previously we've seen how to load the Boston
from the MASS
library.
Now we will look into how we can fir a linear regression model.
We will try to predict median value of owner-occupied homes in $1000s (medv
) based on just a single predictor which is the lower status of the population in percent (lstat
).
Fitting linear regression model in R
In R one can fit a linear regression model using lm()
function.
Its basic syntax is lm(y~x, data
, where y
is the response, x
is predictor and data
is the data set.
In order to fit the model to Boston
data we can call:
> lm.fit = lm(medv~lstat, data=Boston)
For basic information about the model we can type:
> lm.fit
Call:
lm(formula = medv ~ lstat, data = Boston)
Coefficients:
(Intercept) lstat
34.55 -0.95
It will print the function call used to creat the model as well as fitted coefficients.
In order to get more detailed information we can type:
> summary(lm.fit)
Call:
lm(formula = medv ~ lstat, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
This gives us information about residuals, p-values and standard errors for the coefficients, as well as statistics for the model.
Fitting linear regression model in Azure Machine Learning
In order to repeat the same experiment in Azure Machine Learning we will start with modules created last time.
In the first step we need to select the columns we want to work with.
Drag one 'Project Columns' module (Data Transformation -> Manipulation
) to the experiment canvas and connect it with existing Execute R Script
module:
In the properties pane click on the Launch column selector
:
Select columns: medv
and lstat
.
With the right data we can proceed to fitting the model.
Drag the Linear Regression
module (Machine Learning -> Initialize Model -> Regression
) to the experiment canvas.
To train the model we will also need one Train Model
(Machine Learning -> Train
).
Connect all the modules.
Select Train Model
and in the properties pane click on Lauch column selector
to choose response column.
This type only medv
because that's the quantity we want to predict.
The complete model should look like that:
Run it to fit the model to the data.
You can visualize the output port of the Train Model
module to see the result.
We can see that the coefficient values obtain from Azure Machine Learning are different that what we got in R.
Instead of value 34.55 for the intercept (bias) we have 25.80.
Whereas coefficient for lstat
changed from -0.95 to -11.43.
The reason why we observed this discrepant is because Azure Machine Learning uses more advanced model with learning rate and regularization, which we will get to in the future laboratories when we reach chapter 6 Linear Model Selection and Regularization ISLR. For now we will disable these features to reach parity between two models we've seen so far.
Select Linear Regression
module, go to the properties pane and select the following configuration.
Rerun the model and visualize the result.
Now we can see that the coefficient values match what we got at the beginning. Just as with R the model is described by its coefficients and we need to use other functions to get more information about its performance
In the next part
In the next part we will look into evaluating the trained model.
References
- Housing Values in Suburbs of Boston
- Microsoft Azure Machine Learning (Trial)
- Microsoft Machine Learning Blog
- Statistical Learning course at Stanford Online
- An Introduction to Statistical Learning with Applications in R (Springer, Amazon)
- The Comprehensive R Archive Network
- RStudio
This post and all the resources are available on GitHub:
https://github.com/StanislawSwierc/it-is-not-overengineering/tree/master