In this article, let’s learn the basics of forecasting and linear regression analysis, a basic statistical technique for modeling relationships between dependent and explanatory variables. Also, we will look at how R programming language, a statistical programming language, implements linear regression through a couple of scenarios.
Let’s start by considering the following scenarios.
Scenario 1: Every year, as part of organizations annual planning process, a requirement is to come up with a revenue target upon which the budget of the rest of the organization is based. The revenue is a function of sales, and therefore the requirement is to approximately forecast the sales for the year. Depending on this forecast, the budget can be allocated within the organization. Looking at the organizations history, we can assume that the number of sales is based on the number of salespeople and the level of promotional activity. How can we use these factors to forecast sales?
Scenario 2: An insurance company was facing heavy losses on vehicle insurance products. The company had data regarding the policy number, policy type, years of driving experience, age of the vehicle, usage of the vehicle, gender of the driver, marital status of the driver, type of fuel used in the vehicle and the capped losses for the policy. Could there be a relation between the driver’s profile, the vehicle’s profile, and the losses incurred on its insurance?
The first scenario demands a prediction of sales based on the number of sales people and promotions. The second scenario demands a relationship between a vehicle, its driver, and losses accrued on the vehicle as a result of an insurance policy that covers it. These are classic questions that a linear regression can easily answer.
What is linear regression?
Forecasting and linear regression is a statistical technique for generating simple, interpretable relationships between a given factor of interest, and possible factors that influence this factor of interest. The factor of interest is called as a dependent variable, and the possible influencing factors are called explanatory variables. Linear regression builds a model of the dependent variable as a function of the given independent, explanatory variables. This model can further be used to forecast the values of the dependent variable, given new values of the explanatory variables.
What are the use cases?
Determining relationships: Linear regression is extensively used to determine relationship between the factor of interest and the corresponding possible factors of influence. Biology, behavioral and social sciences use linear regression extensively to find out relationships between various measured factors. In healthcare, it has been used to study the causes of health and disease conditions in defined populations.
Forecasting: Linear regression can also be used to forecast trend lines, stock prices, GDP, income, expenditure, demands, risks, and many other factors.
What is the output?
A linear regression quantties the influence of each explanatory variable as a coeffcient. A positive coeffcient shows a positive influence, while a negative coeffcient shows a negative influence on the relationship. The actual value of the coeffcient decides the magnitude of influence. The greater the value of the coeffcient, the greater its influence.
The linear regression also gives a measure of confidence in the relationships that it has determined. The higher the confidence, the better the model for relationship determination. A regression with high confidence values can be used for reliable forecasting.
What are the limitations?
Linear regression is the simplest form of relationship models, which assume that the relationship between the factor of interest and the factors aecting it is linear in nature. Therefore, this regression cannot be used to do very complex analytics, but provide a good starting point for analysis.
How to use linear regression?
Linear regression is natively supported in R, a statistical programming language. We’ll show how to run regression in R, and how to interpret its results. We’ll also show how to use it for forecasting.
For generating relationships, and the model:
Figure 1 shows the commands to execute in linear regression. Table 1 explains the contents in the numbered boxes. Figure 2 shows the summary of the results of regression, on executing the summary function on the output of lm, the linear regression function. Table 2 explains the various outputs seen in the summary.
For forecasting using the generated model:
The regression function returns a linear model, which is based on the input training data. This linear model can be used to perform prediction as shown in figure 3. As can be seen in the figure, the predict.lm function is used for predicting values of the factor of interest. The function takes two inputs, the model, as generated using the regression function lm, and the values for the influencing factors.
Number Explanation 1 This box shows the sample input data. As we can see, there are two columns, Production and Cost. We have used the data for monthly production costs and output for a hosiery mill, which is available at http://www.stat.ufl.edu/~winner/data/millcost.dat. 2 This box shows the summary of the data. The summary gives the minimum, 1st quartile (25th percentile), median (50th percentile), mean, 3rd quartile (75th percentile) and maximum values for the given data. 3 This box shows the command to execute linear regression on data. The function, lm, takes in a formula as an input. The formula is of the form y x1+x2+: : :+xn, where y is the factor of interest, and x1; : : : ; xn are the possible influencing factors. In our case, Production is the factor of interest, and we have only one factor of in uence, that is Cost
Figure 2: Interpreting the results of regression
Number Explanation 4 This box shows the summary of residuals. Residual is the difference between the actual value and the value calculated by the regression, that is the error in calculation. The residuals section in summary shows the first quartile, median, third quartile, minimum, maximum and the mean values of residuals. Ideally, a plot of these residuals should follow a bell curve, that is, there should be a few residuals with value 0, a few residuals with high values, but many residuals with intermediate values. 5 The Estimate column coecient for each influencing factor shows the magnitude of influence, and the positivity or negativity of influence. The other columns give various error measures with given estimated coefficient. 6 The number of stars depict the goodness of the regression. The more the stars, the more accurate the regression. 7 The R-squared values give a confidence measure of how accurately the regression can predict. The values fall between the range zero and one, one being highest possible accuracy, and zero is no accuracy at all.
I believe we have understood the power of linear regression and how it can be used for specific use cases. If you have any comments or questions, do share them below.