Machine Learning for Economics and Finance
+Problem Set 2 - Solution
Ole Wilms
September 09, 2025
Important Instructions
+-
+
- In this problem set you are asked to apply the machine learning techniques we covered in the past weeks +
- In case you struggle with some problems, please post your questions on the OpenOlat discussion board. +
- We will discuss the solutions for the problem set on MONTH DAY +
Setup
++Assume the same setup as in Problem Set 1 but now you try to improve the return predictions +using the machine learning approaches we have discussed in class. For this you are asked to use +the same training and test datasets we constructed in Problem Set 1. +
+Preliminary
+Libraries and Dataset
+import pyreadr +import pandas as pd +import numpy as np +import statsmodels.api as sm +from statsmodels.formula.api import ols +from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, RidgeCV, LassoCV +from sklearn.model_selection import cross_val_score, KFold, GridSearchCV +from sklearn.metrics import r2_score, mean_squared_error, accuracy_score +from matplotlib.pyplot import subplots +from statsmodels.api import OLS +import sklearn.model_selection as skm +import sklearn.linear_model as skl +from sklearn.preprocessing import StandardScaler +from functools import partial +from sklearn.pipeline import Pipeline +from sklearn.decomposition import PCA +from sklearn.cross_decomposition import PLSRegression +from sklearn.model_selection import train_test_split +from sklearn.compose import make_column_transformer +from sklearn.preprocessing import OneHotEncoder +from sklearn.compose import TransformedTargetRegressor +from sklearn.pipeline import make_pipeline +from sklearn.metrics import PredictionErrorDisplay, median_absolute_error +import matplotlib.pyplot as plt +import seaborn as sns +from sklearn.model_selection import cross_validate + +# Voreinstellunger für folgende matplotlib Abbildungen: +plt.rcParams.update({ + 'figure.figsize': (10, 6), + 'text.usetex': True, + 'font.family': 'serif', + 'font.size': 12, + 'xtick.color': 'black', + 'ytick.color': 'black', + 'axes.titlesize': 14, + 'axes.labelcolor': 'black', + 'axes.edgecolor': 'black', + 'axes.grid': False, + 'lines.linewidth': 2 +}) + +# %% [2] +# Load and prepare data (your existing code) +df = pyreadr.read_r('stockmarketdata.rds') +df = df[None] +dfRAW = df.copy(deep=True) + +# Lead returns (shift by -1) +df['ret'] = df['ret'].shift(-1) + +# Remove missing values +df = df.dropna() ++
Splitting the Dataset into train- and testdata
+# Split data into train and test sample +df['date'] = df['date'].astype(np.int64) +split_date = 19944 +split_ind = df.index[df['date'] == split_date][0] + +train_data = df.loc[:split_ind] +train_data = train_data.drop('date', axis=1) +test_data = df.loc[split_ind + 1:] +test_data = test_data.drop('date', axis=1) + +# Prepare X and y +X_train = train_data.drop(columns=['ret']) +y_train = train_data['ret'] +X_test = test_data.drop(columns=['ret']) +y_test = test_data['ret'] ++
Extra: Data Exploration and Visualization
+print(df.info())
+
+<class 'pandas.core.frame.DataFrame'> +Index: 272 entries, 92 to 363 +Data columns (total 8 columns): + # Column Non-Null Count Dtype +--- ------ -------------- ----- + 0 date 272 non-null int64 + 1 ret 272 non-null float64 + 2 DP 272 non-null float64 + 3 CS 272 non-null float64 + 4 ntis 272 non-null float64 + 5 cay 272 non-null float64 + 6 TS 272 non-null float64 + 7 svar 272 non-null float64 +dtypes: float64(7), int64(1) +memory usage: 27.2 KB +None ++ + +
# Descriptive Statistics of the Data +print(df.describe().T) ++
count mean std min 25% 50% 75% max +date 272.0 19857.500000 196.642857 19521.000000 19689.250000 19857.500000 20025.750000 20194.000000 +ret 272.0 0.028608 0.078468 -0.252154 -0.016725 0.035821 0.076914 0.228211 +DP 272.0 -3.553095 0.404929 -4.493159 -3.906620 -3.498891 -3.279654 -2.778536 +CS 272.0 0.009002 0.003875 0.003243 0.006465 0.008044 0.010598 0.031668 +ntis 272.0 0.011831 0.019359 -0.051831 0.001356 0.015415 0.025945 0.048391 +cay 272.0 0.002189 0.022592 -0.047607 -0.016987 0.007717 0.018797 0.042897 +TS 272.0 0.016710 0.013993 -0.035000 0.006725 0.016000 0.027325 0.045300 +svar 272.0 0.005822 0.009721 0.000370 0.002142 0.003476 0.005951 0.114436 ++ + + +
# Histogramm aller Spalten außer "date" +axes = df.loc[:, df.columns != "date"].hist( + grid=False, + edgecolor="black", + figsize=(10, 8) +) + +# Alle Subplots (axes ist 2D-Array, daher flach machen) +for ax in np.ravel(axes): + ax.tick_params(axis="x", colors="black") + ax.tick_params(axis="y", colors="black") + ax.title.set_color("black") + ax.xaxis.label.set_color("black") + ax.yaxis.label.set_color("black") + ax.spines["bottom"].set_color("black") + ax.spines["left"].set_color("black") + ax.spines["top"].set_color("black") + ax.spines["right"].set_color("black") + +plt.tight_layout(rect=[0, 0, 1, 0.96]) # Some space fpr the titel +plt.savefig(fname, transparent=True) +plt.close() +fname ++
Linear Model
+# Train linear model for comparison (your existing code) +model_all = LinearRegression() +model_all.fit(X_train, y_train) +y_pred_train_lm = model_all.predict(X_train) +mse_train_lm = mean_squared_error(y_train, y_pred_train_lm) +y_pred_test_lm = model_all.predict(X_test) +mse_test_lm = mean_squared_error(y_test, y_pred_test_lm) + +# print("LINEAR MODEL RESULTS:") +print(f"In-sample MSE: {mse_train_lm:.5f}") +print(f"Out-of-sample MSE: {mse_test_lm:.5f}") ++
In-sample MSE: 0.00505 +Out-of-sample MSE: 0.00861 ++
1. Question-Set-1: Shrinkage Methods
+1.1. Ridge Regression
++Fit a ridge regression using the training data. Determine the optimal penalty parameter \(\lambda\) +using 5-fold cross validation (set the seed to \(2\) before you run the CV). Provide a plot of the +cross-validation MSE as a function of log(\(\lambda\)) and interpret the outome. +
+ +# Set up Ridge regression with cross-validation +np.random.seed(2) # Set seed for reproducibility + +# Define lambda grid (alpha in sklearn) +lambda_grid = np.logspace(-6, 2, 100) # From log(-6) to log(2) + +# Perform Ridge regression with 5-fold cross-validation +# Note: store_cv_results only works with cv=None, so we'll use manual CV for plotting +ridge_cv = RidgeCV(alphas=lambda_grid, cv=5, scoring='neg_mean_squared_error') +ridge_cv.fit(X_train, y_train) + +# Get best lambda (alpha in sklearn) +best_lambda_ridge = ridge_cv.alpha_ +print(f"Best lambda (Ridge): {best_lambda_ridge:.6f}") +print(f"Log of best lambda: {np.log(best_lambda_ridge):.4f}") ++
Best lambda (Ridge): 0.040370 +Log of best lambda: -3.2097 ++ + +
from sklearn.model_selection import cross_validate + +# Manual cross-validation to get detailed results for plotting (like R's cv.glmnet) +print("Performing detailed cross-validation for plotting...") + +cv_mse_mean = [] +cv_mse_std = [] + +for alpha in lambda_grid: + ridge_temp = Ridge(alpha=alpha) + cv_results = cross_validate(ridge_temp, X_train, y_train, + cv=5, scoring='neg_mean_squared_error', + return_train_score=False) + cv_scores = -cv_results['test_score'] # Convert to positive MSE + cv_mse_mean.append(cv_scores.mean()) + cv_mse_std.append(cv_scores.std()) + +cv_mse_mean = np.array(cv_mse_mean) +cv_mse_std = np.array(cv_mse_std) + +# Index von best_lambda_ridge im Grid finden: +idx_best = np.where(lambda_grid == best_lambda_ridge)[0][0] +best_mse = cv_mse_mean[idx_best] + +# Index für λ_1SE finden: +# Bestes MSE + 1SE-Grenze +threshold = best_mse + cv_mse_std[idx_best] + +# Kandidaten-Lambdas finden, die <= threshold sind +candidates = np.where(cv_mse_mean <= threshold)[0] + +# Nimm das größte λ (also den einfachsten / regularisiertesten Kandidaten) +idx_1se = candidates[-1] +best_lambda_1se = lambda_grid[idx_1se] +best_mse_1se = cv_mse_mean[idx_1se] + + + +# --- NEUER CODE FÜR y2-ACHSE: Anzahl Variablen berechnen --- +n_nonzero = [] +for alpha in lambda_grid: + ridge_temp = Ridge(alpha=alpha) + ridge_temp.fit(X_train, y_train) + # Ridge behält alle Variablen, aber wir können trotzdem die Anzahl Features anzeigen + n_nonzero.append(X_train.shape[1]) # Für Ridge immer alle Features + # Alternative: Zeige "effective" number of parameters basierend auf Koeffizienten-Größe + # n_nonzero.append(np.sum(np.abs(ridge_temp.coef_) > np.max(np.abs(ridge_temp.coef_)) * 0.01)) + +plt.style.use('default') +# --- GEÄNDERT: fig, ax1 für subplot mit y2-Achse --- +fig, ax1 = plt.subplots(figsize=(10, 6)) + +ax1.errorbar(np.log(lambda_grid), + cv_mse_mean, + yerr=cv_mse_std, + capsize=3, + color="red", # Marker/Linie Rot + ecolor="grey", # Error Bars grau + elinewidth=1, # (optional) Breite der Error Bars + fmt='o', # (optional) schwarze '-'Linie + 'o'Punkte '-o' + markersize=4, + ) + +ax1.axvline(np.log(best_lambda_ridge), + color='black', + linestyle='--', + linewidth=1, + label=(f'Best log(λ) = {np.log(best_lambda_ridge):.3f}\n' + f'(Best λ = {best_lambda_ridge:.6f})\n' + f'CV-MSE = {best_mse:.5f}' + ) + ) + +ax1.axvline(np.log(best_lambda_1se), + color='darkgrey', + linestyle='--', + linewidth=1, + label=(f'1SE log(λ) = {np.log(best_lambda_1se):.3f}\n' + f'(1SE λ = {best_lambda_1se:.6f})\n' + f'CV-MSE = {best_mse_1se:.5f}' + ) + ) + +ax1.set_xlabel('log(λ)', fontsize=12) +ax1.set_ylabel('Cross-Validation MSE', fontsize=12) +# ax1.set_title('Ridge Regression: Cross-Validation MSE vs log(λ)') +ax1.grid(False) + +# --- Legende unter dem Plot, zentriert, nebeneinander --- +ax1.legend(bbox_to_anchor=(0.5, -0.15), + loc='upper center', + ncol=3, + frameon=True, + framealpha=0 # 0 = unsichtbar, 1 = voll sichtbar + ) + +# --- NEUER CODE: y2-Achse oben für Anzahl Variablen --- +ax2 = ax1.twiny() +ax2.set_xlim(ax1.get_xlim()) + +# Ticks und Labels für die Anzahl der Variablen setzen +log_lambdas = np.log(lambda_grid) +# Zeige max 20 Ticks um Überlappung zu vermeiden +n_ticks = min(20, len(lambda_grid)) +tick_indices = np.linspace(0, len(lambda_grid)-1, n_ticks, dtype=int) + +ax2.set_xticks(log_lambdas[tick_indices]) +ax2.set_xticklabels([str(n_nonzero[i]) for i in tick_indices]) +ax2.set_xlabel('Number of Variables', fontsize=12) + +# Ticks nach innen richten (wie in R) +# ax2.tick_params(axis='x', direction='in', pad=-15) +ax2.tick_params(axis='x', direction='out')#, pad=-15) + +plt.tight_layout() +plt.savefig(fname, transparent=True) +plt.close() +fname ++
1.2. Ridge Regression MSE
++Prepare a slide with a table that reports training MSE and test MSE for different models. +Fill in the MSE from the linear model using all features from Problem Set 1. Now compute +the training and test MSE for the ridge regression with the optimal penalty parameter +\(\lambda\) from Q1.1. +
+ +# Fit Ridge regression with optimal lambda +ridge_optimal = Ridge(alpha=best_lambda_ridge) +ridge_optimal.fit(X_train, y_train) + +# Calculate training and test MSE +y_pred_ridge_train = ridge_optimal.predict(X_train) +mse_ridge_train = mean_squared_error(y_train, y_pred_ridge_train) + +y_pred_ridge_test = ridge_optimal.predict(X_test) +mse_ridge_test = mean_squared_error(y_test, y_pred_ridge_test) + +print(f"Ridge Train MSE: {mse_ridge_train:.5f}") +print(f"Ridge Test MSE: {mse_ridge_test:.5f}") + +# Show coefficients +ridge_coefs = pd.DataFrame({ + 'Variable': X_train.columns, + 'Ridge_Coefficient': ridge_optimal.coef_ +}) +print("\nRidge Coefficients:") +print(ridge_coefs) ++
Ridge Train MSE: 0.00511 +Ridge Test MSE: 0.00878 + +Ridge Coefficients: + Variable Ridge_Coefficient +0 DP 0.086305 +1 CS 0.064603 +2 ntis -0.260392 +3 cay 0.389788 +4 TS 0.327515 +5 svar 0.156850 ++
1.3. Lasso Regression
++Redo the two tasks above using Lasso instead of Ridge. Again fix the seed to \(2\). Provide a +plot of the cross-validation MSE as a function of log(\(\lambda\)) and interpret. Provide a +table that shows the coefficient of the Lasso with the optimal penalty parameter \(\lambda\). +Compute the training and test MSE of this Lasso model and add it to the table from Q1.2. +
+ +# Lasso with cross-validation +lasso_cv = LassoCV(alphas=lambda_grid, cv=5, random_state=2, max_iter=10000) +lasso_cv.fit(X_train, y_train) + +# Get best lambda +best_lambda_lasso = lasso_cv.alpha_ +print(f"Best lambda (Lasso): {best_lambda_lasso:.6f}") +print(f"Log of best lambda: {np.log(best_lambda_lasso):.4f}") ++
Best lambda (Lasso): 0.000152 +Log of best lambda: -8.7917 ++ + + +
# Manual cross-validation for plotting (to match R's detailed CV output) +print("Performing detailed cross-validation for Lasso plotting...") +cv_mse_lasso = [] +cv_mse_lasso_std = [] + +for alpha in lambda_grid: + lasso_temp = Lasso(alpha=alpha, random_state=2, max_iter=10000) + cv_results = cross_validate(lasso_temp, X_train, y_train, + cv=5, scoring='neg_mean_squared_error', + return_train_score=False) + cv_scores = -cv_results['test_score'] # Convert to positive MSE + cv_mse_lasso.append(cv_scores.mean()) + cv_mse_lasso_std.append(cv_scores.std()) + +cv_mse_lasso = np.array(cv_mse_lasso) +cv_mse_lasso_std = np.array(cv_mse_lasso_std) + +# Index von best_lambda_lasso im Grid finden: +idx_best_lasso = np.where(lambda_grid == best_lambda_lasso)[0][0] +best_mse_lasso = cv_mse_lasso[idx_best_lasso] + +# Index für λ_1SE finden: +# Bestes MSE + 1SE-Grenze +threshold_lasso = best_mse_lasso + cv_mse_lasso_std[idx_best_lasso] +# Kandidaten-Lambdas finden, die <= threshold sind +candidates_lasso = np.where(cv_mse_lasso <= threshold_lasso)[0] +# Nimm das größte λ (also den einfachsten / regularisiertesten Kandidaten) +idx_1se_lasso = candidates_lasso[-1] +best_lambda_1se_lasso = lambda_grid[idx_1se_lasso] +best_mse_1se_lasso = cv_mse_lasso[idx_1se_lasso] + +# --- NEUER CODE FÜR y2-ACHSE: Anzahl Variablen berechnen --- +n_nonzero_lasso = [] +for alpha in lambda_grid: + lasso_temp = Lasso(alpha=alpha, random_state=2, max_iter=10000) + lasso_temp.fit(X_train, y_train) + # Für Lasso: Zähle nicht-null Koeffizienten + n_nonzero_lasso.append(np.sum(np.abs(lasso_temp.coef_) > 1e-10)) + +plt.style.use('default') +# --- GEÄNDERT: fig, ax1 für subplot mit y2-Achse --- +fig, ax1 = plt.subplots(figsize=(10, 6)) + +ax1.errorbar(np.log(lambda_grid), + cv_mse_lasso, + yerr=cv_mse_lasso_std, + capsize=3, + color="red", # Marker/Linie Rot + ecolor="grey", # Error Bars grau + elinewidth=1, # (optional) Breite der Error Bars + fmt='o', # (optional) schwarze '-'Linie + 'o'Punkte '-o' + markersize=4, + ) + +ax1.axvline(np.log(best_lambda_lasso), + color='black', + linestyle='--', + linewidth=1, + label=(f'Best log(λ) = {np.log(best_lambda_lasso):.3f}\n' + f'(Best λ = {best_lambda_lasso:.6f})\n' + f'CV-MSE = {best_mse_lasso:.5f}' + ) + ) + +ax1.axvline(np.log(best_lambda_1se_lasso), + color='darkgrey', + linestyle='--', + linewidth=1, + label=(f'1SE log(λ) = {np.log(best_lambda_1se_lasso):.3f}\n' + f'(1SE λ = {best_lambda_1se_lasso:.6f})\n' + f'CV-MSE = {best_mse_1se_lasso:.5f}' + ) + ) + +ax1.set_xlabel('log(λ)', fontsize=12) +ax1.set_ylabel('Cross-Validation MSE', fontsize=12) +# ax1.set_title('Lasso Regression: Cross-Validation MSE vs log(λ)') +ax1.grid(False) + +# --- Legende unter dem Plot, zentriert, nebeneinander --- +ax1.legend(bbox_to_anchor=(0.5, -0.15), + loc='upper center', + ncol=3, + frameon=True, + framealpha=0 # 0 = unsichtbar, 1 = voll sichtbar + ) + +# --- NEUER CODE: y2-Achse oben für Anzahl Variablen --- +ax2 = ax1.twiny() +ax2.set_xlim(ax1.get_xlim()) + +# Ticks und Labels für die Anzahl der Variablen setzen +log_lambdas_lasso = np.log(lambda_grid) +# Zeige max 20 Ticks um Überlappung zu vermeiden +n_ticks = min(20, len(lambda_grid)) +tick_indices = np.linspace(0, len(lambda_grid)-1, n_ticks, dtype=int) + +ax2.set_xticks(log_lambdas_lasso[tick_indices]) +ax2.set_xticklabels([str(n_nonzero_lasso[i]) for i in tick_indices]) +ax2.set_xlabel('Number of Variables', fontsize=12) + +# Ticks nach außen richten (wie in Ridge) +ax2.tick_params(axis='x', direction='out') + +plt.tight_layout() +plt.savefig(fname, transparent=True) +plt.close() +fname ++
# Fit Lasso with optimal lambda +lasso_optimal = Lasso(alpha=best_lambda_lasso, random_state=2, max_iter=10000) +lasso_optimal.fit(X_train, y_train) + +# Calculate training and test MSE +y_pred_lasso_train = lasso_optimal.predict(X_train) +mse_lasso_train = mean_squared_error(y_train, y_pred_lasso_train) + +y_pred_lasso_test = lasso_optimal.predict(X_test) +mse_lasso_test = mean_squared_error(y_test, y_pred_lasso_test) + +print(f"Lasso Train MSE: {mse_lasso_train:.5f}") +print(f"Lasso Test MSE: {mse_lasso_test:.5f}") ++
Lasso Train MSE: 0.00524 +Lasso Test MSE: 0.00916 ++ + + +
# Show coefficients +lasso_coefs = pd.DataFrame({ + 'Variable': X_train.columns, + 'Lasso_Coefficient': lasso_optimal.coef_ +}) +print("Lasso Coefficients:") +print(lasso_coefs) + +# Count non-zero coefficients +non_zero_coefs = np.sum(lasso_optimal.coef_ != 0) +print(f"\nNumber of non-zero coefficients: {non_zero_coefs}") ++
Lasso Coefficients: + Variable Lasso_Coefficient +0 DP 0.085394 +1 CS 0.000000 +2 ntis -0.000000 +3 cay 0.443235 +4 TS 0.000000 +5 svar 0.000000 + +Number of non-zero coefficients: 2 ++
1.4. Sparse Lasso Regression (3 Variables)
++Now suppose your boss tells you that he only trusts sparse models with few variables. Use +the Lasso and choose the tuning parameter \(\lambda\) such that the model only considers \(3\) +out of the six variables. Report the coefficients and compare them to the coefficients from +the optimal model from Q1.3 and interpret. Compute the training and test MSE of this +Lasso model and add it to the table from Q1.2. Interpret. +
+ +# Find lambda that gives exactly 3 non-zero coefficients +# We'll search through different lambda values +lambda_test_range = np.logspace(-4, -1, 1000) # More focused range +n_features_list = [] + +for alpha in lambda_test_range: + lasso_temp = Lasso(alpha=alpha, random_state=2, max_iter=10000) + lasso_temp.fit(X_train, y_train) + n_features = np.sum(lasso_temp.coef_ != 0) + n_features_list.append(n_features) + +# Find alpha that gives exactly 3 features +target_features = 3 +suitable_alphas = [alpha for alpha, n_feat in zip(lambda_test_range, n_features_list) + if n_feat == target_features] + +if suitable_alphas: + # Use the middle value from suitable alphas + sparse_lambda = suitable_alphas[len(suitable_alphas)//2] + print(f"Lambda for 3 variables: {sparse_lambda:.6f}") + + # Fit sparse Lasso + lasso_sparse = Lasso(alpha=sparse_lambda, random_state=2, max_iter=10000) + lasso_sparse.fit(X_train, y_train) + + # Calculate MSE + y_pred_lasso_sparse_train = lasso_sparse.predict(X_train) + mse_lasso_sparse_train = mean_squared_error(y_train, y_pred_lasso_sparse_train) + + y_pred_lasso_sparse_test = lasso_sparse.predict(X_test) + mse_lasso_sparse_test = mean_squared_error(y_test, y_pred_lasso_sparse_test) + + print(f"Sparse Lasso Train MSE: {mse_lasso_sparse_train:.5f}") + print(f"Sparse Lasso Test MSE: {mse_lasso_sparse_test:.5f}") + + # Show coefficients + sparse_lasso_coefs = pd.DataFrame({ + 'Variable': X_train.columns, + 'Sparse_Lasso_Coefficient': lasso_sparse.coef_ + }) + print("\nSparse Lasso Coefficients (3 variables):") + print(sparse_lasso_coefs) + + # Show which variables are selected + selected_vars = X_train.columns[lasso_sparse.coef_ != 0].tolist() + print(f"\nSelected variables: {selected_vars}") + +else: + print("Could not find lambda that gives exactly 3 variables") + # Use a reasonable approximation + sparse_lambda = 0.0125 # From R code + lasso_sparse = Lasso(alpha=sparse_lambda, random_state=2, max_iter=10000) + lasso_sparse.fit(X_train, y_train) + + y_pred_lasso_sparse_train = lasso_sparse.predict(X_train) + mse_lasso_sparse_train = mean_squared_error(y_train, y_pred_lasso_sparse_train) + + y_pred_lasso_sparse_test = lasso_sparse.predict(X_test) + mse_lasso_sparse_test = mean_squared_error(y_test, y_pred_lasso_sparse_test) + + print(f"Sparse Lasso Train MSE: {mse_lasso_sparse_train:.5f}") + print(f"Sparse Lasso Test MSE: {mse_lasso_sparse_test:.5f}") + + n_selected = np.sum(lasso_sparse.coef_ != 0) + print(f"Number of selected variables: {n_selected}") ++
Lambda for 3 variables: 0.000128 +Sparse Lasso Train MSE: 0.00522 +Sparse Lasso Test MSE: 0.00925 + +Sparse Lasso Coefficients (3 variables): + Variable Sparse_Lasso_Coefficient +0 DP 0.086918 +1 CS 0.000000 +2 ntis -0.000000 +3 cay 0.481694 +4 TS 0.023727 +5 svar 0.000000 + +Selected variables: ['DP', 'cay', 'TS'] ++ + + +
# Comprehensive results table +results_dict = { + 'Model': ['Linear Regression', 'Ridge Regression', 'Lasso Regression', 'Sparse Lasso (3 vars)'], + 'In-sample MSE': [mse_train_lm, mse_ridge_train, mse_lasso_train, mse_lasso_sparse_train], + 'Out-of-sample MSE': [mse_test_lm, mse_ridge_test, mse_lasso_test, mse_lasso_sparse_test] +} + +results_df = pd.DataFrame(results_dict) +results_df = results_df.round(5) + +print(results_df.to_string(index=False)) ++
Model In-sample MSE Out-of-sample MSE + Linear Regression 0.00505 0.00861 + Ridge Regression 0.00511 0.00878 + Lasso Regression 0.00524 0.00916 +Sparse Lasso (3 vars) 0.00522 0.00925 ++ + + +
# Compare table of all coefficients +coef_comparison = pd.DataFrame({ + 'Variable': X_train.columns, + 'Linear': model_all.coef_, + 'Ridge': ridge_optimal.coef_, + 'Lasso_Optimal': lasso_optimal.coef_, + 'Lasso_Sparse': lasso_sparse.coef_ +}).round(6) + +print(coef_comparison.to_string(index=False)) ++
Variable Linear Ridge Lasso_Optimal Lasso_Sparse + DP 0.083873 0.086305 0.085394 0.086918 + CS 0.474950 0.064603 0.000000 0.000000 + ntis -0.394479 -0.260392 -0.000000 -0.000000 + cay 0.421456 0.389788 0.443235 0.481694 + TS 0.631040 0.327515 0.000000 0.023727 + svar 0.802650 0.156850 0.000000 0.000000 ++ + +
# Lasso coefficient path +lasso_path = Lasso(random_state=2, max_iter=10000) +lambda_path = np.logspace(-4, 0, 100) +coefs_lasso = [] +n_nonzero_path_lasso = [] # NEU: Anzahl nicht-null Koeffizienten tracken + +for alpha in lambda_path: + lasso_path.set_params(alpha=alpha) + lasso_path.fit(X_train, y_train) + coefs_lasso.append(lasso_path.coef_.copy()) + # NEU: Anzahl nicht-null Koeffizienten berechnen + n_nonzero_path_lasso.append(np.sum(np.abs(lasso_path.coef_) > 1e-10)) + +coefs_lasso = np.array(coefs_lasso) + +plt.style.use('default') +# --- GEÄNDERT: fig, ax1 für subplot mit y2-Achse --- +fig, ax1 = plt.subplots(figsize=(10, 6)) + +# Lasso path - Koeffizienten-Pfade plotten +for i, feature in enumerate(X_train.columns): + ax1.plot(np.log(lambda_path), coefs_lasso[:, i], label=feature) + +ax1.set_xlabel('log(λ)', fontsize=12) +ax1.set_ylabel('Coefficients', fontsize=12) +# ax1.set_title('Lasso Coefficient Paths', fontsize=14) +ax1.legend() +ax1.grid(False) + +# --- NEUER CODE: y2-Achse oben für Anzahl Variablen --- +ax2 = ax1.twiny() +ax2.set_xlim(ax1.get_xlim()) + +# Ticks und Labels für die Anzahl der Variablen setzen +log_lambdas_path = np.log(lambda_path) +# Zeige max 15 Ticks um Überlappung zu vermeiden +n_ticks = min(15, len(lambda_path)) +tick_indices = np.linspace(0, len(lambda_path)-1, n_ticks, dtype=int) + +ax2.set_xticks(log_lambdas_path[tick_indices]) +ax2.set_xticklabels([str(n_nonzero_path_lasso[i]) for i in tick_indices]) +ax2.set_xlabel('Number of Variables', fontsize=12) + +# Ticks nach außen richten +ax2.tick_params(axis='x', direction='out') + +plt.tight_layout() +plt.savefig(fname, transparent=True) +plt.close() +fname ++
# Ridge coefficient path +ridge_path = Ridge() +coefs_ridge = [] +n_nonzero_path_ridge = [] # NEU: Anzahl nicht-null Koeffizienten tracken + +for alpha in lambda_path: + ridge_path.set_params(alpha=alpha) + ridge_path.fit(X_train, y_train) + coefs_ridge.append(ridge_path.coef_.copy()) + # NEU: Anzahl nicht-null Koeffizienten berechnen + n_nonzero_path_ridge.append(np.sum(np.abs(ridge_path.coef_) > 1e-10)) + +coefs_ridge = np.array(coefs_ridge) + +plt.style.use('default') +# --- GEÄNDERT: fig, ax1 für subplot mit y2-Achse --- +fig, ax1 = plt.subplots(figsize=(10, 6)) + +# Ridge path - Koeffizienten-Pfade plotten +for i, feature in enumerate(X_train.columns): + ax1.plot(np.log(lambda_path), coefs_ridge[:, i], label=feature) + +ax1.set_xlabel('log(λ)', fontsize=12) +ax1.set_ylabel('Coefficients', fontsize=12) +# ax1.set_title('Ridge Coefficient Paths', fontsize=14) +ax1.legend() +ax1.grid(False) + +# --- NEUER CODE: y2-Achse oben für Anzahl Variablen --- +ax2 = ax1.twiny() +ax2.set_xlim(ax1.get_xlim()) + +# Ticks und Labels für die Anzahl der Variablen setzen +log_lambdas_path = np.log(lambda_path) +# Zeige max 15 Ticks um Überlappung zu vermeiden +n_ticks = min(15, len(lambda_path)) +tick_indices = np.linspace(0, len(lambda_path)-1, n_ticks, dtype=int) + +ax2.set_xticks(log_lambdas_path[tick_indices]) +ax2.set_xticklabels([str(n_nonzero_path_ridge[i]) for i in tick_indices]) +ax2.set_xlabel('Number of Variables', fontsize=12) + +# Ticks nach außen richten +ax2.tick_params(axis='x', direction='out') + +plt.tight_layout() +plt.savefig(fname, transparent=True) +plt.close() +fname ++
2. Question-Set-2: Tree-Based Methods
+from sklearn.tree import DecisionTreeRegressor, plot_tree, export_text +from sklearn.ensemble import RandomForestRegressor +from sklearn.model_selection import cross_val_score, validation_curve +import matplotlib.pyplot as plt +import numpy as np +import pandas as pd +from sklearn.metrics import mean_squared_error +import seaborn as sns ++
2.1. Large Regression Tree
++Fit a large regression tree using the training data. Report the number of terminal nodes +as well as the most important variables for splitting the tree. +
+ +# Fit a large regression tree (minimal restrictions to allow deep tree) +# In sklearn, we need to set parameters to allow a "large" tree similar to R's tree() +large_tree = DecisionTreeRegressor( + random_state=2, + min_samples_split=2, # Minimum samples to split (very low to allow deep tree) + min_samples_leaf=1, # Minimum samples in leaf (very low) + max_depth=None, # No depth limit initially + min_impurity_decrease=0 # No minimum impurity decrease required +) + +large_tree.fit(X_train, y_train) + +# Get tree information +n_nodes = large_tree.tree_.node_count +n_leaves = large_tree.get_n_leaves() +max_depth = large_tree.get_depth() + +print(f"Number of nodes: {n_nodes}") +print(f"Number of terminal nodes (leaves): {n_leaves}") +print(f"Maximum depth: {max_depth}") ++
Number of nodes: 343 +Number of terminal nodes (leaves): 172 +Maximum depth: 16 ++ + + +
# Feature importance (equivalent to R's splitting importance) +feature_importance = pd.DataFrame({ + 'Variable': X_train.columns, + 'Importance': large_tree.feature_importances_ +}).sort_values('Importance', ascending=False) + +print("Feature Importance (most important variables for splitting):") +print(feature_importance) + +# Most important variable +most_important_var = feature_importance.iloc[0]['Variable'] +print(f"\nMost important variable for splitting: {most_important_var}") ++
Feature Importance (most important variables for splitting): + Variable Importance +4 TS 0.233574 +0 DP 0.228339 +5 svar 0.194723 +1 CS 0.186572 +3 cay 0.117219 +2 ntis 0.039573 + +Most important variable for splitting: TS ++ +
# Visualize the tree (showing only top levels due to size) +plt.style.use('default') +plt.figure(figsize=(20, 12)) +plot_tree(large_tree, + feature_names=X_train.columns, + filled=True, + max_depth=3, # Show only first 3 levels for readability + fontsize=10) +# plt.title("Large Regression Tree (First 3 levels)") +plt.tight_layout() +plt.savefig(fname, transparent=True) +plt.close() +fname ++
2.2. Train and Test MSE for the Large Tree
++Compute the training and test MSE of the tree and add it to the table from Q1.2. +
+ +# Calculate training MSE +y_pred_large_tree_train = large_tree.predict(X_train) +mse_large_tree_train = mean_squared_error(y_train, y_pred_large_tree_train) + +# Calculate test MSE +y_pred_large_tree_test = large_tree.predict(X_test) +mse_large_tree_test = mean_squared_error(y_test, y_pred_large_tree_test) + +print(f"Large Tree Train MSE: {mse_large_tree_train:.6f}") +print(f"Large Tree Test MSE: {mse_large_tree_test:.6f}") ++
Large Tree Train MSE: 0.000000 +Large Tree Test MSE: 0.045659 ++
2.3. Cross-Validation for optimal Tree Pruning
++Again set the seed to \(2\) and use 5-fold cross validation to determine the optimal pruning +parameter for the large tree. Provide a plot of the prediction error against the size of +the tree. Report the optimal tree size and provide a plot of the pruned tree. Which +variables are important for splitting the pruned tree? +
+ +# Set seed for reproducibility +np.random.seed(2) + +# In sklearn, pruning is done by varying max_leaf_nodes or ccp_alpha +# We'll use cost complexity pruning (ccp_alpha) which is more similar to R's approach + +# First, get the cost complexity pruning path +path = large_tree.cost_complexity_pruning_path(X_train, y_train) +ccp_alphas, impurities = path.ccp_alphas, path.impurities + +# Remove the last alpha value (it would result in a tree with only root) +ccp_alphas = ccp_alphas[:-1] + +# Perform cross-validation for different alpha values +cv_scores = [] +tree_sizes = [] + +for ccp_alpha in ccp_alphas: + tree_temp = DecisionTreeRegressor(random_state=2, ccp_alpha=ccp_alpha) + tree_temp.fit(X_train, y_train) + + # Get tree size (number of leaves) + tree_sizes.append(tree_temp.get_n_leaves()) + + # 5-fold cross-validation + scores = cross_val_score(tree_temp, X_train, y_train, + cv=5, scoring='neg_mean_squared_error') + cv_scores.append(-scores.mean()) # Convert back to positive MSE + +cv_scores = np.array(cv_scores) +tree_sizes = np.array(tree_sizes) + +# Find the optimal tree size (minimum CV error) +best_idx = np.argmin(cv_scores) +best_size = tree_sizes[best_idx] +best_alpha = ccp_alphas[best_idx] + +print(f"Optimal tree size (number of leaves): {best_size}") +print(f"Optimal ccp_alpha: {best_alpha:.6f}") ++
Optimal tree size (number of leaves): 3 +Optimal ccp_alpha: 0.000424 ++ + +
from matplotlib.ticker import MultipleLocator + +# Plot prediction error against tree size +plt.style.use('default') +plt.figure(figsize=(10, 6)) +plt.plot(tree_sizes, cv_scores, '-o', color='black', markersize=3) +plt.xlim(0, 20) +# X-Ticks in 2er Schritten (nur ganze Zahlen) +plt.gca().xaxis.set_major_locator(MultipleLocator(2)) +plt.axvline(x=best_size, color='red', linestyle='--', + label=f'Optimal size = {best_size}') +plt.xlabel('Tree Size (Number of Leaves)') +plt.ylabel('Cross-Validation MSE') +# plt.title('Prediction Error vs Tree Size') +plt.legend() +plt.grid(False) +plt.tight_layout() +plt.savefig(fname, transparent=True) +plt.close() +fname ++
# Fit the pruned tree with optimal alpha +pruned_tree = DecisionTreeRegressor(random_state=2, ccp_alpha=best_alpha) +pruned_tree.fit(X_train, y_train) + +print(f"Pruned tree statistics:") +print(f" Number of leaves: {pruned_tree.get_n_leaves()}") +print(f" Maximum depth: {pruned_tree.get_depth()}") + +# Feature importance for pruned tree +pruned_importance = pd.DataFrame({ + 'Variable': X_train.columns, + 'Importance': pruned_tree.feature_importances_ +}).sort_values('Importance', ascending=False) ++
Pruned tree statistics: + Number of leaves: 3 + Maximum depth: 2 ++ + + +
print("Feature Importance for Pruned Tree:") +print(pruned_importance) + +# Variables used in splitting (non-zero importance) +important_vars = pruned_importance[pruned_importance['Importance'] > 0]['Variable'].tolist() +print(f"\nVariables important for splitting the pruned tree: {important_vars}") ++
Feature Importance for Pruned Tree: + Variable Importance +5 svar 0.511469 +4 TS 0.488531 +1 CS 0.000000 +0 DP 0.000000 +3 cay 0.000000 +2 ntis 0.000000 + +Variables important for splitting the pruned tree: ['svar', 'TS'] ++ +
# Visualize the pruned tree +plt.style.use('default') +plt.figure(figsize=(12, 8)) +plot_tree(pruned_tree, + feature_names=X_train.columns, + filled=True, + fontsize=12) +# plt.title("Pruned Regression Tree") +plt.tight_layout() +plt.savefig(fname, transparent=True) +plt.close() +fname ++
2.4. Train and Test MSE for Pruned Tree
++Compute the training and test MSE of the pruned tree and add it to the table from Q1.2. +
+ +# Calculate training MSE for pruned tree +y_pred_pruned_tree_train = pruned_tree.predict(X_train) +mse_pruned_tree_train = mean_squared_error(y_train, y_pred_pruned_tree_train) + +# Calculate test MSE for pruned tree +y_pred_pruned_tree_test = pruned_tree.predict(X_test) +mse_pruned_tree_test = mean_squared_error(y_test, y_pred_pruned_tree_test) + +print(f"Pruned Tree Train MSE: {mse_pruned_tree_train:.6f}") +print(f"Pruned Tree Test MSE: {mse_pruned_tree_test:.6f}") ++
Pruned Tree Train MSE: 0.004898 +Pruned Tree Test MSE: 0.008131 ++
2.5. Random Forest
++Finally, use random forest to improve the predictions. Motivate your choice for the tuning +parameters. Report the training and test MSE and add it to the table from Q1.2. Which +variables are most important in the random forest? +
+ +# Random Forest with tuned parameters +rf_model = RandomForestRegressor( + n_estimators=100, # corresponds tontree in R + max_features=2, # corresponds tomtry in R (number of features to consider at each split) + random_state=2, + n_jobs=-1, # Use all available cores + oob_score=True +) + +rf_model.fit(X_train, y_train) + +print(f"Random Forest Parameters:") +print(f" Number of trees (n_estimators): {rf_model.n_estimators}") +print(f" Features considered per split (max_features): {rf_model.max_features}") +print(f" Out-of-bag score: {rf_model.oob_score_}") + +# Parameter justification +print(f"\nParameter Justification:") +print(f" - n_estimators=100: Sufficient trees for stable predictions without overfitting") +print(f" - max_features=2: For regression, typically sqrt(p) or p/3, where p={len(X_train.columns)}") +print(f" sqrt({len(X_train.columns)}) = {int(np.sqrt(len(X_train.columns)))}, so 2 is reasonable") ++
Random Forest Parameters: + Number of trees (n_estimators): 100 + Features considered per split (max_features): 2 + Out-of-bag score: -0.009214729335368155 + +Parameter Justification: + - n_estimators=100: Sufficient trees for stable predictions without overfitting + - max_features=2: For regression, typically sqrt(p) or p/3, where p=6 + sqrt(6) = 2, so 2 is reasonable ++ + +
# Plot OOB error evolution (if OOB scoring is enabled) +# Note: In "RandomForestRegressor", "oob_score_" represents OOB-R^2 (Bestimmtheitsmaß), not the OOB-MSE. +# Note: sklearn doesn't store OOB evolution by default, so we'll create a simple version +oob_errors = [] +for n_est in range(10, 101, 10): + rf_temp = RandomForestRegressor(n_estimators=n_est, max_features=2, + random_state=2, oob_score=True) + rf_temp.fit(X_train, y_train) + oob_errors.append(1 - rf_temp.oob_score_) # Convert R^2 to "error": (1 - R^2) + +best_idx = np.argmin(oob_errors) +best_size = list(range(10, 101, 10))[best_idx] +best_oob_error = oob_errors[best_idx] + +plt.style.use('default') +plt.figure(figsize=(10, 6)) +plt.plot(range(10, 101, 10), oob_errors, '-o', color='black', label="OOB Error") +plt.axvline(x=best_size, color='red', linestyle='--', + label=f'Optimal size = {best_size}\nOOB Error = {best_oob_error:.4f}') + +plt.xlabel('Number of Trees') +plt.ylabel('OOB Error') +# plt.title('Random Forest: OOB Error vs Number of Trees') +plt.grid(False) +plt.legend() +plt.tight_layout() +plt.savefig(fname, transparent=True) +plt.close() +fname ++
# Calculate training and test MSE +y_pred_rf_train = rf_model.predict(X_train) +mse_rf_train = mean_squared_error(y_train, y_pred_rf_train) + +y_pred_rf_test = rf_model.predict(X_test) +mse_rf_test = mean_squared_error(y_test, y_pred_rf_test) + +print(f"Random Forest Train MSE: {mse_rf_train:.6f}") +print(f"Random Forest Test MSE: {mse_rf_test:.6f}") ++
Random Forest Train MSE: 0.000739 +Random Forest Test MSE: 0.012266 ++ + + +
# Feature importance +rf_importance = pd.DataFrame({ + 'Variable': X_train.columns, + 'Importance': rf_model.feature_importances_ +}).sort_values('Importance', ascending=False) + +print("Feature Importance (Random Forest):") +print(rf_importance) + +# Most important variable +most_important_rf = rf_importance.iloc[0]['Variable'] +print(f"\nMost important variable in Random Forest: {most_important_rf}") ++
Feature Importance (Random Forest): + Variable Importance +0 DP 0.188921 +3 cay 0.173580 +4 TS 0.169850 +1 CS 0.163270 +5 svar 0.161956 +2 ntis 0.142423 + +Most important variable in Random Forest: DP ++ + +
# Variable importance plot +plt.style.use('default') +plt.figure(figsize=(10, 6)) +plt.barh(range(len(rf_importance)), rf_importance['Importance']) +plt.yticks(range(len(rf_importance)), rf_importance['Variable']) +plt.xlabel('IncNodePurity') +# plt.title('Random Forest: Variable Importance Plot') +plt.gca().invert_yaxis() # Highest importance at top +plt.tight_layout() +plt.savefig(fname, transparent=True) +plt.close() +fname ++
# COMPREHENSIVE RESULTS TABLE +results_dict = { + 'Model': [ + 'Linear Regression', + 'Ridge Regression', + 'Lasso Regression', + 'Sparse Lasso (3 vars)', + 'Large Regression Tree', + 'Pruned Regression Tree', + 'Random Forest' + ], + 'In-sample MSE': [ + mse_train_lm, + mse_ridge_train, + mse_lasso_train, + mse_lasso_sparse_train, + mse_large_tree_train, + mse_pruned_tree_train, + mse_rf_train + ], + 'Out-of-sample MSE': [ + mse_test_lm, + mse_ridge_test, + mse_lasso_test, + mse_lasso_sparse_test, + mse_large_tree_test, + mse_pruned_tree_test, + mse_rf_test + ] +} + +comprehensive_results = pd.DataFrame(results_dict).round(6) + +#print(comprehensive_results.to_string(index=False)) + +# DataFrame -> Liste von Listen - org-mode Tabelle +table = [list(comprehensive_results.columns), None] +for row in comprehensive_results.itertuples(index=False): + table.append(list(row)) + +table ++
| Model | +In-sample MSE | +Out-of-sample MSE | +
|---|---|---|
| Linear Regression | +0.005047 | +0.008608 | +
| Ridge Regression | +0.005111 | +0.008779 | +
| Lasso Regression | +0.005235 | +0.009159 | +
| Sparse Lasso (3 vars) | +0.005218 | +0.00925 | +
| Large Regression Tree | +0.0 | +0.045659 | +
| Pruned Regression Tree | +0.004898 | +0.008131 | +
| Random Forest | +0.000739 | +0.012266 | +
2.6. Model selection analysis
++Supposed it is the beginning of \(2020\) and you have access to both the in-sample and out-of- +sample errors for the different methods. Which model do you choose to predict stock +markets in the future and why? +
+ +# Analysis of overfitting vs generalization +comprehensive_results['Overfitting_Gap'] = (comprehensive_results['Out-of-sample MSE'] - + comprehensive_results['In-sample MSE']) + +print("Overfitting Analysis (Out-of-sample MSE - In-sample MSE):") +print(comprehensive_results[['Model', 'Overfitting_Gap']].sort_values('Overfitting_Gap')) + +# Find best model based on out-of-sample performance +best_model_idx = comprehensive_results['Out-of-sample MSE'].idxmin() +best_model = comprehensive_results.loc[best_model_idx, 'Model'] +best_oos_mse = comprehensive_results.loc[best_model_idx, 'Out-of-sample MSE'] + +print(f"\nRecommended Model for Future Predictions:") +print(f" Model: {best_model}") +print(f" Out-of-sample MSE: {best_oos_mse:.6f}") + +print(f"\nJustification:") +if best_model == 'Sparse Lasso (3 vars)': + print(" - Lowest out-of-sample MSE indicates best generalization") + print(" - Sparse model is interpretable and less prone to overfitting") + print(" - Good balance between bias and variance") +elif best_model == 'Pruned Regression Tree': + print(" - Good balance between complexity and generalization") + print(" - Pruning reduces overfitting compared to large tree") + print(" - Interpretable model structure") +else: + print(" - Best out-of-sample performance") + print(" - Shows good generalization to unseen data") ++
Overfitting Analysis (Out-of-sample MSE - In-sample MSE): + Model Overfitting_Gap +5 Pruned Regression Tree 0.003233 +0 Linear Regression 0.003561 +1 Ridge Regression 0.003668 +2 Lasso Regression 0.003924 +3 Sparse Lasso (3 vars) 0.004032 +6 Random Forest 0.011527 +4 Large Regression Tree 0.045659 + +Recommended Model for Future Predictions: + Model: Pruned Regression Tree + Out-of-sample MSE: 0.008131 + +Justification: + - Good balance between complexity and generalization + - Pruning reduces overfitting compared to large tree + - Interpretable model structure ++ +
# Visualization of model performance comparison +plt.style.use('default') +plt.figure(figsize=(12, 8)) + +models = comprehensive_results['Model'] +train_mse = comprehensive_results['In-sample MSE'] +test_mse = comprehensive_results['Out-of-sample MSE'] + +x = np.arange(len(models)) +width = 0.35 + +plt.bar(x - width/2, train_mse, width, label='In-sample MSE', alpha=0.8) +plt.bar(x + width/2, test_mse, width, label='Out-of-sample MSE', alpha=0.8) + +plt.xlabel('Model Types') +plt.ylabel('MSE') +# plt.title('Model Performance Comparison: In-sample vs Out-of-sample MSE') +plt.xticks(x, models, rotation=45, ha='right') +plt.legend() +plt.grid(True, alpha=0.3) +plt.tight_layout() +plt.savefig(fname, transparent=True) +plt.close() +fname ++
Appendix
++The dataset contains the following variables: +
+-
+
- ret: the quarterly return of the US stock market (a number of 0.01 is a \(1\) \% return per +quarter) +
- date: the date in format yyyyq (\(19941\) means the first quarter of \(1994\)) +
- DP: the dividend to price ratio of the stock market (a valuation measure whether +prices are high or low relative to the dividends payed) +
- CS: the credit spread defined as the difference in yields between high rated corporate +bonds (save investments) and low rated corporate bonds (corporations that might go +bankrupt). CS measures the additional return investors require to invest in risky firms +compared to well established firms with lower risks +
- ntis: A measure for corporate issuing activity (IPO’s, stock repurchases,…) +
- cay: a measure of the wealth-to-consumption ratio (how much is consumed relative to +total wealth) +
- TS: the term spread is the difference between the long term yield on government bonds +and short term yields. +
- svar: a measure for the stock market variance +
+For a full description of the data, see Welch und Goyal (\(2007\)). Google is also very +helpful if you are interested in obtaining more intuition about the variables. +
+References
++Welch, I. and A. Goyal (\(2007\), \(03\)). A Comprehensive Look at The Empirical Performance of +Equity Premium Prediction. The Review of Financial Studies \(21\) (\(4\)), \(1455-1508\). +
+