Feature Selection for House Price Prediction

LASSO Regression for Data Science Project - spring 2025

Author

Mathari Pavani and Shalini Jonnadula

Published

April 28, 2025

Introduction

LASSO regression (Least Absolute Shrinkage and Selection Operator) is a well-known statistical regression technique for feature selection and regularization. LASSO adds an L1 penalty to regression coefficients, which pushes some coefficients towards zero, thus inducing sparsity in the model and retaining the most significant predictors. This property renders LASSO particularly useful in improving model interpretability and predictability, as it selects a subset of predictors that have an important impact on variation in the outcome variable. The principal benefit of LASSO is that it is capable of handling high-dimensional data, where other regression methods will struggle with overfitting due to the inclusion of numerous features(Tibshirani 1996).

In reality, house price forecasting is an important but challenging activity. With so many variables influencing the price of a house—location, size, amenities, and condition of the economy, to mention but a few—the sheer volume of potential predictors makes it hard to build a useful predictive model. One of the main challenges of house price forecasting is selecting the most informative features from a vast set of variables.

LASSO is a variation of linear regression, which has been augmented with the addition of a penalty term that sets some of the regression coefficients to zero, effectively removing irrelevant or redundant features. This is an extremely helpful operation in high-dimensional data, where traditional linear regression models overfit because there are so many predictor variables. In contrast, LASSO enhances the model’s generalization ability and therefore provides a more precise and stable estimate of house prices.

In this project, we will be working on different implementations of LASSO regression with a special focus on L1 regularization and its effect on feature selection. We will apply this technique to a house prices dataset, selecting the most predictive features and validating the accuracy of the model in predicting house prices based on the selected features.

Furthermore, we will also make a comparison between the efficiency of LASSO and other regularizing methods such as Ridge regression and Elastic Net, which are also attempting to prevent overfitting by incorporating different penalizing methods (Hastie, Tibshirani, and Wainwright 2017). From the comparison, we will outline the advantages and disadvantages of the use of LASSO to handle high-dimensional data and its applicability for house price estimation.

The ultimate goal of this project is to demonstrate the success of LASSO regression application for feature selection for predictive models targeting dimensionality reduction, model interpretability, and improved predictive ability. The application of this technique extends beyond real estate and hence has applications in a wide range of other fields such as finance, medicine, and more where large datasets with possibly redundant predictors exist. For example, in finance, LASSO is used to predict credit risks, and in medicine it is used to make patient risk predictions.

The objective of this project is to provide valuable information regarding the way LASSO regression works, how it differs from other forms of regularization, and how it should be used effectively in practical applications, for example, predicting house prices.

Literature Review

LASSO regression is one of the most important contributors to model sparsity and predictive accuracy, particularly when handling high-dimensional data. In Huan Xu, Constantine Caramanis, and Shie Mannor’s work, the authors describe the connection between robust regression and LASSO, showing that under certain uncertainty conditions, the robust regression problem is equivalent to LASSO. This connection is important as it shows how LASSO can offer robustness to outliers while preserving the sparsity property, which is very important for feature selection. This sparsity and robustness concept is highly applicable to our project, where LASSO will be used to choose the most significant features to use in predicting house prices. The idea that LASSO can handle noisy data and still select important predictors resonates with our goal of building a stable and accurate model of house prices.(Xu, Caramanis, and Mannor 2012)

A statistical method called LASSO regression is used to solve problems in predictive modelling, such as overfitting and optimism bias, particularly when working with big datasets and many of possible predictors. By essentially eliminating less significant variables, it reduces model complexity and enhances generalisation by lowering regression coefficients towards zero. The parameter (λ), which controls the amount of shrinkage, is usually chosen using k-fold cross-validation. (Jonas and Cook 2018)

In the paper by Carter Hill, the author explains how both LASSO and Ridge regression serve as alternatives to the traditional least squares regression, especially in the context of high-dimensional datasets. Hill discusses how both methods help mitigate overfitting, a common issue when working with a large number of predictors, and select the most important features for prediction. While Ridge regression penalizes the size of the coefficients, LASSO tends to shrink some coefficients to zero, effectively performing feature selection. Hill emphasizes the advantages of these methods in improving model interpretability and predictive accuracy. However, both methods have limitations in distinguishing correlation from causation, which is an important consideration in practical applications. This paper is particularly valuable for our project as it provides a foundation for understanding how LASSO helps in selecting the most predictive features for house price estimation by addressing overfitting and ensuring that only relevant variables are included.(Hill 2016)

The paper by Trevor Hastie, Robert Tibshirani, and Martin Wainwright provides a comprehensive review of LASSO in high-dimensional regression, covering its theoretical properties, including sparsity and oracle inequalities. It also explores practical aspects like coordinate descent algorithms and extensions such as Elastic Net, Group LASSO, and Adaptive LASSO. These extensions address challenges like multicollinearity and grouped variable selection. This paper’s insights on the theory and applications of LASSO, along with its comparisons to other regularization techniques, will help guide the application of LASSO in our house price prediction project by offering valuable methods for dealing with high-dimensional datasets.(Trevor Hastie 2015)

The paper by Ji Hyung Lee, Zhentao Shi, and Zhan Gao explores the use of LASSO in predictive regression when some regressors are heterogeneous time series. They discuss the limitations of Adaptive LASSO in such cases, where it cannot completely eliminate inactive cointegrating variables. To address this, the authors introduce Twin Adaptive LASSO (TAlasso), a two-step selection method. Their method improves consistency and enhances the prediction accuracy for datasets involving mixed-root regressors. This paper will help in understanding how LASSO-based methods can be extended for better accuracy in predictive modeling for house price prediction when handling complex relationships in the data.(Ji Hyung Lee 2020)

This paper’s main objective is to increase the accuracy of temperature forecasting in smart buildings by optimising HVAC (heating, ventilation, and air conditioning) systems through the use of model predictive controllers (MPC). This upgrade aims to keep building occupants comfortable while lowering energy use.(Spencer, Alfandi, and Al-Obeidat 2018)

The purpose of this work is to examine how to choose the most informative genetic markers and phenotypic characteristics associated with a variable of interest using Least Absolute Shrinkage and Selection Operator (LASSO) regression. The objective is to use both genetic and phenotypic data to enhance risk prediction algorithms.(Fontanarosa and Dai 2011)

The paper addresses the interference from suspended particles, particularly in high turbidity conditions, by proposing a unique online water quality measurement approach for wastewater treatment plants. Inaccurate detection results from traditional methods’ frequent disregard for the organic components in suspended particles. The study presents a combined method to address this issue, minimising chemical and physical interferences by using UV-vis spectroscopy to track substance alteration during the oxidative digestion process. A brand-new machine learning method called Bidirectional Dictionary LASSO Regression (BD-LASSO) is also created. By learning spectral and feature information, BD-LASSO enhances the extraction of pertinent data, decreases turbidity interference, and increases detection speed and accuracy speed and accuracy.(Geng et al. 2023)

Because traditional plant breeding is expensive and time-consuming, cutting-edge techniques like machine learning and remote sensing are required to increase wheat output. This work investigates the use of LASSO regression and Support Vector Regression (SVR) in conjunction with Sequential Forward Selection (SFS) to forecast wheat grain yield using vegetation indices based on Unmanned Aerial Vehicles (UAVs). The study derived vegetation indices (NDVI, EVI, and MTCI) from multispectral UAV photos at various development stages using data from 600 wheat plots in Norway (2018).The NDVI was the best indicator of grain yield, according to the results, and it got better as the plants grew older. Model accuracy was improved by include EVI and MTCI at earlier growth stages. Up to 90% of yield fluctuation was explained by models that included all indices and time periods. Collinearity was added by adding distinct spectral bands, but prediction accuracy was not increased. LASSO regression proved more effective and computationally less expensive, even though both machine learning techniques performed well.(Shafiee et al. 2021)

Together, these papers will guide our approach to using LASSO for feature selection, ensuring robustness, accuracy, and interpretability in our house price prediction model. Each study presents a facet of LASSO’s potential that can be leveraged to optimize our model and handle the complexity of high-dimensional housing data effectively.

Methods

Overview of Lasso Regression

Lasso Regression (Least Absolute Shrinkage and Selection Operator) is a type of linear regression that includes an L1 regularization term in its loss function. The regularization term penalizes the coefficients of less important features, shrinking them towards zero and effectively performing feature selection. This makes Lasso particularly useful when working with high-dimensional datasets that may have irrelevant or redundant features, as it helps to simplify the model and reduce overfitting.

Lasso regression modifies the standard linear regression objective function by adding the L1 penalty term: \[ Minimize:\sum_{i=1}^n (y_i - \hat{y}^i)^2 + \lambda \sum_{j=1}^p |\beta_j| \] Where:
$y_i$ = is the actual value (sale price).
$\hat{y}_i$ = is the predicted value.
$\beta_j$ = represents the coefficients of the model.
$n$ = is the number of observations.
$p$ = is the number of features (predictors).
$\lambda$ = is the regularization parameter that controls the strength of the penalty.

Formula Explanation

In traditional linear regression: \[ y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon_i \] Where:
$y_i$ = is the target variable (sale price).
$x_1, x_2, \ldots, x_p$ = are the predictor variables (features such as square footage, number of bathrooms, etc.).
$\beta_0, \beta_1, \ldots, \beta_p$ = are the coefficients (parameters) of the model.
$\epsilon_i$ = represents the error term.

Key Characteristics of Lasso Regression

1. Automatic Feature Selection: Lasso performs both regularization and feature selection, which is advantageous when there are many features, and some may be irrelevant to predicting the target variable.
2. Sparsity: The L1 penalty used in Lasso forces many coefficients to become exactly zero, resulting in a sparse model where only the most relevant features remain.
3. Regularization Effect: Lasso prevents overfitting by shrinking the coefficients, which is particularly useful when dealing with high-dimensional data.

Impact of Lambda on Coefficients

The regularization parameter $\lambda$ determines how aggressively the model penalizes coefficients. Higher values of $\lambda$ increase the penalty, pushing more coefficients to zero, resulting in a simpler model. Conversely, when $\lambda$ is small or zero, Lasso behaves like ordinary least squares regression, where no regularization occurs.

Graphical Interpretation:

Lasso Path: The plot below shows how the coefficients of the features change as λ increases. As λ becomes larger, more coefficients shrink towards zero, resulting in a simpler and more interpretable model.
This plot demonstrates how the coefficients are penalized as λ increases, leading to the elimination of less important features.
Cross-Validation for Lambda Selection: The optimal λ value can be selected using cross-validation, which tests different values of λ and evaluates the model’s performance. A cross-validation plot typically shows the model’s error for different values of λ, with the best λ corresponding to the point where the error is minimized. This plot helps identify the best value of λ, which ensures the model balances complexity and generalization.

Algorithm Used for Lasso Regression

1 Data Preparation: Feature Scaling: It is essential to scale the features before applying Lasso regression. Since Lasso involves an L1 penalty, features with larger magnitudes could disproportionately influence the regularization term. Standardizing the data (e.g., using z-scores) ensures that each feature contributes equally to the penalty.
2 Fitting the Model: The Lasso regression model is fitted using optimization techniques such as coordinate descent or Least Angle Regression (LARS). These methods iteratively minimize the objective function by adjusting the model’s coefficients.
3 Model Evaluation: After fitting the model, its performance is evaluated using metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), or R-squared on the test data. These metrics help assess how well the model generalizes to unseen data.
4 Feature Selection: As Lasso regression penalizes less important features, the coefficients for irrelevant features will be driven to zero, effectively performing feature selection.

Justification for Using Lasso Regression

Feature Selection: In a high-dimensional dataset like the Ames Housing dataset, many features may be irrelevant or redundant. Lasso helps reduce the model’s complexity by automatically selecting the most important features.
Handling Multicollinearity: Lasso is robust to multicollinearity, where predictor variables are highly correlated. In such cases, Lasso assigns zero weights to some correlated features, ensuring that the model does not overfit to the noise caused by correlated predictors.
Improved Model Interpretability: By shrinking some coefficients to zero, Lasso creates a simpler and more interpretable model, making it easier to understand which features are most important for predicting house prices.
Prevents Overfitting: With the regularization term, Lasso reduces the risk of overfitting, especially when there is a large number of features relative to the number of observations. It ensures that the model remains generalizable.

Analysis

Data Exploration and Visualization

AMES HOUSING DATA: AMES Housing Dataset

The Ames Housing Dataset is a popular dataset that provides detailed information about residential properties in Ames, Iowa. It was originally used in a Kaggle competition and has become widely used for machine learning and data analysis practice, particularly in regression tasks.

Data Description

The dataset contains 2930 observations and 81 predictor variables. These variables include both numeric and categorical data that represent various characteristics of the properties, such as the size of the house, construction materials, condition, and various other factors influencing the sale price.

Summary of Variable Types: - Numerical Variables: 38 - Categorical Variables: 24 - Ordinal Variables: 19

# Loading packages

Code

library(ggplot2)
library(dplyr)
library(glmnet)
library(caret)
library(tidyverse)

# Data Loading and Initial Inspection

Code

ames_data <- read.csv("AmesHousing.csv")
cat("Number of observations:", nrow(ames_data), "\n")

Number of observations: 2930

Code

cat("Number of features:", ncol(ames_data), "\n")

Number of features: 82

# Check for missing values

Code

missing_data <- colSums(is.na(ames_data))
cat("Missing values per feature:\n")

Missing values per feature:

Code

print(missing_data[missing_data > 0])

  Lot.Frontage          Alley   Mas.Vnr.Area      Bsmt.Qual      Bsmt.Cond 
           490           2732             23             79             79 
 Bsmt.Exposure BsmtFin.Type.1   BsmtFin.SF.1 BsmtFin.Type.2   BsmtFin.SF.2 
            79             79              1             79              1 
   Bsmt.Unf.SF  Total.Bsmt.SF Bsmt.Full.Bath Bsmt.Half.Bath   Fireplace.Qu 
             1              1              2              2           1422 
   Garage.Type  Garage.Yr.Blt  Garage.Finish    Garage.Cars    Garage.Area 
           157            159            157              1              1 
   Garage.Qual    Garage.Cond        Pool.QC          Fence   Misc.Feature 
           158            158           2917           2358           2824

# Handling Missing values

Code

# Categorical variables
ames_data$Alley[is.na(ames_data$Alley)] <- "None"
ames_data$Pool.QC[is.na(ames_data$Pool.QC)] <- "None"
ames_data$Fence[is.na(ames_data$Fence)] <- "None"
ames_data$Misc.Feature[is.na(ames_data$Misc.Feature)] <- "None"

# Numerical variables (use median for continuous variables)
ames_data$Mas.Vnr.Area[is.na(ames_data$Mas.Vnr.Area)] <- median(ames_data$Mas.Vnr.Area, na.rm = TRUE)
# Impute missing Lot.Frontage with the median value
ames_data$Lot.Frontage[is.na(ames_data$Lot.Frontage)] <- median(ames_data$Lot.Frontage, na.rm = TRUE)


# basement-related values (assume 'None' for categorical and 0 for numerical)
ames_data$Bsmt.Qual[is.na(ames_data$Bsmt.Qual)] <- "None"
ames_data$Bsmt.Cond[is.na(ames_data$Bsmt.Cond)] <- "None"
ames_data$Bsmt.Exposure[is.na(ames_data$Bsmt.Exposure)] <- "None"
ames_data$BsmtFin.Type.1[is.na(ames_data$BsmtFin.Type.1)] <- "None"
ames_data$BsmtFin.SF.1[is.na(ames_data$BsmtFin.SF.1)] <- 0
ames_data$BsmtFin.Type.2[is.na(ames_data$BsmtFin.Type.2)] <- "None"
ames_data$BsmtFin.SF.2[is.na(ames_data$BsmtFin.SF.2)] <- 0
ames_data$Bsmt.Unf.SF[is.na(ames_data$Bsmt.Unf.SF)] <- 0
ames_data$Total.Bsmt.SF[is.na(ames_data$Total.Bsmt.SF)] <- 0
ames_data$Bsmt.Full.Bath[is.na(ames_data$Bsmt.Full.Bath)] <- 0
ames_data$Bsmt.Half.Bath[is.na(ames_data$Bsmt.Half.Bath)] <- 0

# Garage-related features (assume 'None' for categorical and 0 for numerical)
ames_data$Garage.Type[is.na(ames_data$Garage.Type)] <- "None"
ames_data$Garage.Yr.Blt[is.na(ames_data$Garage.Yr.Blt)] <- 0
ames_data$Garage.Finish[is.na(ames_data$Garage.Finish)] <- "None"
ames_data$Garage.Qual[is.na(ames_data$Garage.Qual)] <- "None"
ames_data$Garage.Cond[is.na(ames_data$Garage.Cond)] <- "None"
ames_data$Garage.Cars[is.na(ames_data$Garage.Cars)] <- 0
ames_data$Garage.Area[is.na(ames_data$Garage.Area)] <- 0

# Fireplaces
ames_data$Fireplace.Qu[is.na(ames_data$Fireplace.Qu)] <- "None"

# Check if all missing values are handled
cat("Missing values per feature after imputation:\n")

Missing values per feature after imputation:

Code

missing_data_after <- colSums(is.na(ames_data))
print(missing_data_after[missing_data_after > 0])

named numeric(0)

Key Features:

The dataset has been refined to focus on a subset of 11 key features, which are selected based on their relevance to predicting the SalePrice of properties. These features include a combination of both numerical and binary categorical data, representing essential characteristics of the properties, such as total living space, house age, garage size, and amenities. The selected features are

SalePrice: This is the target variable, representing the sale price of the house.
TotalBathrooms: A combined measure of the total number of bathrooms (full + half).
TotalSquareFootage: Total square footage combining living space and basement areas.
HouseAge: Age of the house at the time of sale.
TotalGarageSize: Combined size of the garage area and car capacity.
TotalPorchArea: Combined porch areas across multiple types of porches.
HasPool: Indicator of whether the property has a pool.
HasFence: Indicator of whether the property has a fence.
HasFireplace: Indicator of whether the property has a fireplace.
HasMiscFeature: Indicator of whether the property has a miscellaneous feature (e.g., shed or tennis court).
OverallQualityScore: A composite score of the house’s overall quality and condition.

These features were selected based on their importance in influencing the SalePrice and were combined from various original variables in the dataset.

# Create new features

Code

ames_data$TotalBathrooms <- ames_data$Full.Bath + (ames_data$Half.Bath * 0.5) + ames_data$Bsmt.Full.Bath + (ames_data$Bsmt.Half.Bath * 0.5)

ames_data$TotalSquareFootage <- ames_data$Gr.Liv.Area + ames_data$BsmtFin.SF.1 + ames_data$BsmtFin.SF.2 + ames_data$X1st.Flr.SF + ames_data$X2nd.Flr.SF

# Calculate House Age
ames_data$HouseAge <- ames_data$Yr.Sold - ames_data$Year.Built
# Calculate Garage Age (if Garage.Yr.Blt exists)
ames_data$GarageAge <- ifelse(!is.na(ames_data$Garage.Yr.Blt), ames_data$Yr.Sold - ames_data$Garage.Yr.Blt, NA)
# Calculate Remodel Age (if Year.Remod.Add exists)
ames_data$RemodelAge <- ifelse(!is.na(ames_data$Year.Remod.Add), ames_data$Yr.Sold - ames_data$Year.Remod.Add, NA)
# Combine all three into a new feature for total property age considering remodel/garage build
ames_data$TotalPropertyAge <- pmin(ames_data$HouseAge, ames_data$GarageAge, ames_data$RemodelAge, na.rm = TRUE)


ames_data$TotalGarageSize <- ames_data$Garage.Area + (ames_data$Garage.Cars * 200)

ames_data$TotalPorchArea <- ames_data$Wood.Deck.SF + ames_data$Open.Porch.SF + ames_data$Enclosed.Porch + ames_data$X3Ssn.Porch + ames_data$Screen.Porch

ames_data$HasPool <- ifelse(ames_data$Pool.Area > 0, 1, 0)
ames_data$HasFence <- ifelse(!is.na(ames_data$Fence), 1, 0)
ames_data$HasFireplace <- ifelse(ames_data$Fireplaces > 0, 1, 0)
ames_data$HasMiscFeature <- ifelse(!is.na(ames_data$Misc.Feature), 1, 0)
ames_data$OverallQualityScore <- ames_data$Overall.Qual * ames_data$Overall.Cond

# Check the first few rows of the new dataset

# Creating new dataset with requried features

Code

# Select the required columns for the new dataset
Selected_Ames_Data <- ames_data[c("SalePrice", "TotalBathrooms", "TotalSquareFootage", "HouseAge", "TotalGarageSize", "TotalPorchArea", "HasPool", "HasFence", "HasFireplace", "HasMiscFeature", "OverallQualityScore")]  
head(Selected_Ames_Data)

  SalePrice TotalBathrooms TotalSquareFootage HouseAge TotalGarageSize
1    215000            2.0               3951       50             928
2    105000            1.0               2404       49             930
3    172000            1.5               3581       52             512
4    244000            3.5               5285       42             922
5    189900            2.5               4049       13             882
6    195500            2.5               3810       12             870
  TotalPorchArea HasPool HasFence HasFireplace HasMiscFeature
1            272       0        1            1              1
2            260       0        1            0              1
3            429       0        1            0              1
4              0       0        1            1              1
5            246       0        1            1              1
6            396       0        1            1              1
  OverallQualityScore
1                  30
2                  30
3                  36
4                  35
5                  25
6                  36

# Data Spitting

Code

set.seed(123)
trainIndex <- createDataPartition(Selected_Ames_Data$SalePrice, p = 0.8, list = FALSE)
train <- Selected_Ames_Data[trainIndex, ]
test <- Selected_Ames_Data[-trainIndex, ]

# Standardize Numerical Features

Code

# Standardize numerical features
preProc <- preProcess(train[, -1], method = c("center", "scale"))
train_scaled <- predict(preProc, train)
test_scaled <- predict(preProc, test)

# Train the LASSO Model

Code

# Convert to matrix for LASSO
x_train <- model.matrix(SalePrice ~ ., train_scaled)[, -1]
y_train <- train_scaled$SalePrice
x_test <- model.matrix(SalePrice ~ ., test_scaled)[, -1]
y_test <- test_scaled$SalePrice

# Train LASSO model using cross-validation
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lambda <- cv_lasso$lambda.min

# Fit final model
lasso_model <- glmnet(x_train, y_train, alpha = 1, lambda = best_lambda)

# Model Evaluation

Code

# Predict on test data
y_pred <- predict(lasso_model, s = best_lambda, newx = x_test)

# Compute MSE and R-squared
mse <- mean((y_test - y_pred)^2)
r_squared <- 1 - sum((y_test - y_pred)^2) / sum((y_test - mean(y_test))^2)

cat("MSE:", mse, "\nR-squared:", r_squared)

MSE: 1282393660 
R-squared: 0.7947703

# Top Features (Model Interpretation)

Code

# Extract feature importance
lasso_coeffs <- coef(lasso_model, s = best_lambda)
feature_importance <- data.frame(Feature = rownames(lasso_coeffs), Coefficient = as.vector(lasso_coeffs))
feature_importance <- feature_importance[order(abs(feature_importance$Coefficient), decreasing = TRUE), ]

# Display top 5 important features
head(feature_importance, 5)

               Feature Coefficient
1          (Intercept)   180581.73
3   TotalSquareFootage    35081.10
4             HouseAge   -19539.64
11 OverallQualityScore    18144.33
5      TotalGarageSize    13584.84

# Cross validation

Code

# Perform 5-fold cross-validation to evaluate the model
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 5)

# Best lambda value
best_lambda_cv <- cv_lasso$lambda.min
cat("Best Lambda from Cross-Validation: ", best_lambda_cv, "\n")

Best Lambda from Cross-Validation:  2566.186

Code

# Predict on test data using the best lambda
lasso_cv_model <- glmnet(x_train, y_train, alpha = 1, lambda = best_lambda_cv)
y_pred_cv <- predict(lasso_cv_model, s = best_lambda_cv, newx = x_test)

# Calculate MSE and R-squared for cross-validation
mse_cv <- mean((y_test - y_pred_cv)^2)
r_squared_cv <- 1 - sum((y_test - y_pred_cv)^2) / sum((y_test - mean(y_test))^2)

cat("MSE from Cross-Validation: ", mse_cv, "\n")

MSE from Cross-Validation:  1285840101

Code

cat("R-squared from Cross-Validation: ", r_squared_cv, "\n")

R-squared from Cross-Validation:  0.7942187

Visualization

Code

# Load necessary libraries
library(ggplot2)
library(scales)

# 1. Distribution of SalePrice
ggplot(ames_data, aes(x = SalePrice)) +
  geom_histogram(binwidth = 10000, fill = "skyblue", color = "black", alpha = 0.7) +
  scale_x_continuous(labels = label_comma())+
  labs(title = "Distribution of SalePrice", x = "SalePrice", y = "Frequency") +
  theme_minimal()

# 2. TotalBathrooms vs SalePrice
ggplot(ames_data, aes(x = TotalBathrooms, y = SalePrice)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", linetype = "dashed") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "TotalBathrooms vs SalePrice", x = "TotalBathrooms", y = "SalePrice") +
  theme_minimal()

# 3. TotalSquareFootage vs SalePrice
ggplot(ames_data, aes(x = TotalSquareFootage, y = SalePrice)) +
  geom_point(color = "green", alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", linetype = "dashed") +
  labs(title = "TotalSquareFootage vs SalePrice", x = "TotalSquareFootage", y = "SalePrice") +
  theme_minimal()

# 4. HouseAge vs SalePrice
ggplot(ames_data, aes(x = HouseAge, y = SalePrice)) +
  geom_point(color = "purple", alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", linetype = "dashed") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "HouseAge vs SalePrice", x = "HouseAge", y = "SalePrice") +
  theme_minimal()

# 5. TotalGarageSize vs SalePrice
ggplot(ames_data, aes(x = TotalGarageSize, y = SalePrice)) +
  geom_point(color = "orange", alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", linetype = "dashed") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "TotalGarageSize vs SalePrice", x = "TotalGarageSize", y = "SalePrice") +
  theme_minimal()

# 6. TotalPorchArea vs SalePrice
ggplot(ames_data, aes(x = TotalPorchArea, y = SalePrice)) +
  geom_point(color = "brown", alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", linetype = "dashed") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "TotalPorchArea vs SalePrice", x = "TotalPorchArea", y = "SalePrice") +
  theme_minimal()

# 7. HasPool vs SalePrice (Bar plot)
ggplot(ames_data, aes(x = factor(HasPool), y = SalePrice)) +
  geom_boxplot(fill = "lightblue") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "HasPool vs SalePrice", x = "HasPool", y = "SalePrice") +
  theme_minimal()

# 8. HasFence vs SalePrice (Bar plot)
ggplot(ames_data, aes(x = factor(HasFence), y = SalePrice)) +
  geom_boxplot(fill = "lightgreen") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "HasFence vs SalePrice", x = "HasFence", y = "SalePrice") +
  theme_minimal()

# 9. HasFireplace vs SalePrice (Bar plot)
ggplot(ames_data, aes(x = factor(HasFireplace), y = SalePrice)) +
  geom_boxplot(fill = "lightcoral") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "HasFireplace vs SalePrice", x = "HasFireplace", y = "SalePrice") +
  theme_minimal()

# 10. OverallQualityScore vs SalePrice
ggplot(ames_data, aes(x = OverallQualityScore, y = SalePrice)) +
  geom_point(color = "darkred", alpha = 0.6) +
  geom_smooth(method = "lm", color = "blue", linetype = "dashed") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "OverallQualityScore vs SalePrice", x = "OverallQualityScore", y = "SalePrice") +
  theme_minimal()

# Residuals

Code

# Calculate residuals and clean up any non-finite values
residuals_cv_clean <- na.omit(y_test - y_pred_cv)

# Plot histogram of cleaned residuals without scientific notation
ggplot(data.frame(residuals_cv_clean), aes(x = residuals_cv_clean)) +
  geom_histogram(binwidth = 50000, fill = "lightblue", color = "black") +
  scale_x_continuous(labels = scales::label_number()) +  # Remove scientific notation on x-axis
  labs(title = "Residuals Distribution (Cleaned)", x = "Residuals", y = "Frequency") +
  theme_minimal()

# Cross-Validation Curve

Code

# Plot cross-validation curve
plot(cv_lasso)
title("Cross-validation Error vs Log(Lambda)", line = 2.5)

# Coefficient Path

Code

lasso_full <- glmnet(x_train, y_train, alpha = 1)

# Plot coefficient paths
plot(lasso_full, xvar = "lambda", label = TRUE)
title("LASSO Coefficient Path")

# Output Non-Zero Coefficients

Code

coef_lasso <- coef(lasso_model)
non_zero_coef <- coef_lasso[coef_lasso[, 1] != 0, , drop = FALSE]
# Create a tidy dataframe from non-zero coefficients
non_zero_df <- as.data.frame(as.matrix(non_zero_coef))
non_zero_df$Feature <- rownames(non_zero_df)
colnames(non_zero_df)[1] <- "Coefficient"
non_zero_df <- non_zero_df[non_zero_df$Feature != "(Intercept)", ]  # Remove intercept if present
print(non_zero_coef)

7 x 1 sparse Matrix of class "dgCMatrix"
                            s0
(Intercept)         180581.729
TotalSquareFootage   35081.097
HouseAge            -19539.639
TotalGarageSize      13584.843
TotalPorchArea        2288.810
HasFireplace          3980.436
OverallQualityScore  18144.329

Code

# Plot
ggplot(non_zero_df, aes(x = reorder(Feature, Coefficient), y = Coefficient)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Selected Features by LASSO", x = "Feature", y = "Coefficient") +
  theme_minimal()

Modeling and Results

In the data preprocessing and cleaning phase, we began by handling missing data for both categorical and numerical variables. Categorical variables like Alley, Pool.QC, Fence, and Misc.Feature were imputed with the category “None,” assuming that missing values represent the absence of these features. For numerical variables such as Mas.Vnr.Area and Lot.Frontage, missing values were imputed with the median of the respective columns to avoid biasing the model. Features related to basements and garages were treated by setting missing values to “None” for categorical data and 0 for numerical data, indicating the absence of basements or garages in some cases. We also engineered new features like TotalBathrooms (combining full and half baths) and OverallQualityScore (combining overall quality and condition) to better capture the key attributes of the houses. Numerical variables were standardized to ensure all features contributed equally to the model, particularly in regularization techniques like LASSO regression.

The key findings from our model show that several features significantly influence the SalePrice. The most important predictor was TotalSquareFootage, with a coefficient of 35,081.10, indicating that larger homes are strongly correlated with higher sale prices. HouseAge had a negative relationship with sale price, with an estimated decrease of 19,539.64 in price per additional year of age. OverallQualityScore also emerged as a crucial factor, with a positive coefficient of 18,144.33, suggesting that homes in better overall condition are valued higher. Features like Garage Size and the presence of a Fireplace also contributed positively to the model, with coefficients of 13,584.84 and 3,980.44, respectively. In contrast, features such as Pool, Fence, and Misc.Feature had coefficients of 0, indicating they did not significantly affect the price prediction in this model.

To visually support these findings, scatter plots and histograms were created. The distribution of SalePrice revealed a right-skewed pattern, with most homes priced below $500,000 but a few high-value outliers. A scatter plot of TotalSquareFootage vs SalePrice clearly demonstrated a positive correlation, with larger homes commanding higher prices. Additionally, a boxplot comparing HasFireplace with SalePrice highlighted that homes with fireplaces tend to sell for more, reinforcing the positive relationship between specific features and house price. These visuals effectively supported the claim that the most significant predictors of home price are related to the size and quality of the home, as well as the presence of specific desirable features.

After analyzing the housing data, we found that certain factors significantly impact house prices. The most important ones are:

Total Square Footage: The larger the house, the higher the price. House Age: Older homes tend to have a lower price, likely due to wear and tear or outdated features. Overall Quality Score: Homes with higher quality ratings generally have higher prices. Garage Size: Bigger garages can increase a home’s value, reflecting the importance of storage and parking. Interestingly, some features, like the number of bathrooms or whether the house has a pool, did not significantly affect the house price in this model. These results suggest that while many factors affect housing prices, size, age, quality, and amenities like a garage are the most impactful.

By using LASSO regression, we were able to narrow down these key predictors, giving us a clear understanding of what drives house prices in this dataset. This insight can help buyers, sellers, and real estate agents make more informed decisions

Conclusion

Key Findings:

The most important predictors of sale price are TotalSquareFootage, HouseAge, OverallQualityScore, and TotalGarageSize. Homes with more space, higher quality, and larger garages tend to command higher prices.

Features such as Pool, Fence, and Misc.Feature were not significant in predicting SalePrice, likely due to either their low prevalence or limited impact.

Implications:

Homebuyers likely place the highest value on the size of the home (square footage) and its overall quality, with features such as a larger garage and fireplace adding value.

Features related to the condition of the property, like OverallQualityScore, are important in pricing, suggesting that buyers are willing to pay more for homes in better condition.

References

Fontanarosa, J. B., and Y. Dai. 2011. “Using LASSO Regression to Detect Predictive Aggregate Effects in Genetic Studies.” BMC Proceedings 5 (S9): S69. https://doi.org/10.1186/1753-6561-5-S9-S69.

Geng, Jingxuan, Chunhua Yang, Yonggang Li, Lijuan Lan, Fengxue Zhang, Jie Han, and Can Zhou. 2023. “A Bidirectional Dictionary LASSO Regression Method for Online Water Quality Detection in Wastewater Treatment Plants.” Chemometrics and Intelligent Laboratory Systems. https://www.sciencedirect.com/science/article/abs/pii/S0169743923000679.

Hastie, Trevor, Robert Tibshirani, and Martin Wainwright. 2017. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press. https://www.crcpress.com/Statistical-Learning-with-Sparsity-The-Lasso-and-Generalizations/Hastie-Tibshirani-Wainwright/p/book/9781498762436.

Hill, Carter. 2016. “Explaining Ridge Regression and LASSO.” Journal of Applied Econometrics 31 (4): 716–31. https://onlinelibrary.wiley.com/doi/abs/10.1002/jae.2471.

Ji Hyung Lee, Zhan Gao, Zhentao Shi. 2020. “On LASSO for Predictive Regression.” Statistical Modelling 20: 213–34. https://journals.sagepub.com/doi/10.1177/1471082X20914168.

Jonas, R, and J Cook. 2018. “LASSO Regression.” British Journal of Surgery 105 (10). https://academic.oup.com/bjs/article-abstract/105/10/1348/6122951.

Shafiee, Sahameh, Lars Martin Lied, Ingunn Burud, Jon Arne Dieseth, Muath Alsheikh, and Morten Lillemo. 2021. “Sequential Forward Selection and Support Vector Regression in Comparison to LASSO Regression for Spring Wheat Yield Prediction Based on UAV Imagery.” Computers and Electronics in Agriculture. https://www.sciencedirect.com/science/article/pii/S0168169921000545 .

Spencer, Bruce, Omar Alfandi, and Feras Al-Obeidat. 2018. “A Refinement of Lasso Regression Applied to Temperature Forecasting.” Procedia Computer Science 130: 728–35. https://www.sciencedirect.com/science/article/pii/S1877050918304897.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 58. https://www.jstor.org/stable/2986091.

Trevor Hastie, Martin Wainwright, Robert Tibshirani. 2015. “LASSO for High-Dimensional Regression: Methods, Theory, and Applications.” Statistical Science 30: 337–58. https://projecteuclid.org/euclid.ss/1437508694.

Xu, Huan, Constantine Caramanis, and Shie Mannor. 2012. “Robust Regression and Lasso.” IEEE Transactions on Information Theory 58: 5601–22. https://ieeexplore.ieee.org/document/6244789.