major upload of (python) course material & solutions

2025-12-03 14:39:45 +01:00
parent 52552e20cb
commit e95a0b2ecc
39 changed files with 13598 additions and 0 deletions
--- a/Shrinkage/04_ForwardSelection.ipynb
+++ b/Shrinkage/04_ForwardSelection.ipynb
--- a/Shrinkage/04_RidgeRegression.ipynb
+++ b/Shrinkage/04_RidgeRegression.ipynb
@@ -0,0 +1,196 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "fb0c424f-1667-4fb2-baab-2d88d8abb387",
+   "metadata": {},
+   "source": [
+    "# Preliminary setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "de6396ca-e17d-4c95-8f96-1f78a09e9ce2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "from ISLP import load_data\n",
+    "from matplotlib.pyplot import subplots, show\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "\n",
+    "# Load and preprocess data\n",
+    "Hitters = load_data('Hitters').dropna()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "87902d82-5336-456b-bec8-403530c75f00",
+   "metadata": {
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "# Task"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0ce8adda-23e7-498f-9ff3-26c138903b88",
+   "metadata": {},
+   "source": [
+    "1. Use the final model (tuning parameter) obtained from 10-fold CV and fit the model again using the full dataset and display the corresponding coefficients."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ac884445-bc95-4659-b656-d9c5f821bf52",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "05635216-4afb-4d0d-982a-a2af35d6bf3a",
+   "metadata": {},
+   "source": [
+    "2. Multiply the feature Errors by $1/1000$ and again fit the model from Task 1. Display the coefficients and interpret. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "70bc0da8-6134-4d4d-ad1f-e43ea26fae3c",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b6e19093-51bf-4e68-aba6-01c34905b5e4",
+   "metadata": {},
+   "source": [
+    "3. Redo Task 2 BUT without the normalizing (standardize) the data. Refit the same model again and display the coefficients. Interpret. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5a38add3-642e-41a8-8b80-c3d01a63e538",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "df85262d-8a38-4bf9-9dfa-0a001e117d33",
+   "metadata": {},
+   "source": [
+    "4. Split the dataset into a training set using $80\\%$ of the observations and validation set using all other observations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b0a152a8-395e-49e2-973d-252b88cd379c",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e1e3e60e-0d5a-4340-ae29-9153ffdad7c8",
+   "metadata": {},
+   "source": [
+    "5. Set up a grid for the tuning parameter $\\lambda$ and fit Lasso regressions for all tuning parameters using the training data. Make sure that you choose the mininmum and maximum values of $\\lambda$ so that it allows you to determine the optimal $\\lambda$ parameter in the next task (you might need to play with the grid size a bit). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b5e0cff0-6782-40a3-8d7f-891c19bb5f4d",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "21ba53c0-def1-4059-9872-27e6b437b8af",
+   "metadata": {},
+   "source": [
+    "6. For each model (tuning parameter), compute the mean squared prediction error in the validation dataset. Plot the validation error as a function of $\\lambda$ and find the best model which minimizes the validation error. Display the estimated coefficients for the best model and check whether some features are not selected in the final regression. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8323ce02-17fe-4f54-820d-030f198a34fe",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "19f07912-bffd-4a19-9a92-aa1a2dc48c75",
+   "metadata": {},
+   "source": [
+    "7. Finally compare the best Lasso model obtained from the validation set approach from Task 6 to the best Lasso model obtained by 5-fold cross-validation. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e0166113-9d31-4e42-a8df-69f2048b65af",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8fd306c8-2247-4343-8c30-5dd99393c9d0",
+   "metadata": {},
+   "source": [
+    "8. Compare the best model from Task 7 to the best ridge regression obtained from 5-fold cross validation. How do the coefficients of the two models differ?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3c70e9bd-78d9-4a91-a28f-588fca65c616",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "date": " ",
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.3"
+  },
+  "title": " ",
+  "toc-autonumbering": false,
+  "toc-showcode": false,
+  "toc-showmarkdowntxt": false,
+  "toc-showtags": false
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/Shrinkage/04_RidgeRegression_solved.ipynb
+++ b/Shrinkage/04_RidgeRegression_solved.ipynb
--- a/Shrinkage/04_RidgeRegression_solved.pdf
+++ b/Shrinkage/04_RidgeRegression_solved.pdf
--- a/Shrinkage/Script04_codes_FSS.py
+++ b/Shrinkage/Script04_codes_FSS.py
@@ -0,0 +1,164 @@
+import pandas as pd
+import numpy as np
+from sklearn.linear_model import LinearRegression
+from sklearn.feature_selection import SequentialFeatureSelector
+from ISLP import load_data
+
+###
+# Forward stepwise selection
+###
+# Load Hitters dataset from ISLP
+Hitters = load_data('Hitters')
+
+# Remove missing values
+Hitters = Hitters.dropna()
+
+# Create dummy variables for categorical columns
+Hitters = pd.get_dummies(Hitters, drop_first=True)
+
+# Separate response (target) and predictors
+y = Hitters['Salary']
+X = Hitters.drop(columns=['Salary'])
+
+# Define the linear regression model
+model = LinearRegression()
+
+# Perform forward stepwise selection using SequentialFeatureSelector
+#sfs = SequentialFeatureSelector(model, n_features_to_select=15, direction='forward', cv=5)
+sfs = SequentialFeatureSelector(model, n_features_to_select=15, direction='forward')
+
+# Fit the model to the data
+sfs.fit(X, y)
+
+# Get the selected features
+selected_features = X.columns[sfs.get_support()]
+
+# Fit the model with the selected features
+model.fit(X[selected_features], y)
+
+# Coefficients of the selected features
+coefficients = pd.DataFrame({
+    'Feature': selected_features,
+    'Coefficient': model.coef_
+})
+
+# Printing short summary - intercept, coefficients and $R^{2}$
+print("\nIntercept:")
+print(model.intercept_)
+
+print("\nCoefficients:")
+print(coefficients)
+
+print("\nR-squared:")
+print(model.score(X[selected_features], y))
+
+
+###
+# Validation errors for FSS
+###
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import mean_squared_error as MSE
+from mlxtend.feature_selection import SequentialFeatureSelector as SFS
+import statsmodels.api as sm
+
+# Split the data into training and validation sets based on row indices
+train_data = Hitters.iloc[:184]   # First 184 rows for training data
+val_data = Hitters.iloc[184:263]  # Rows 185 to 263 for validation data
+
+# Define X and y for both training and validation sets
+X_train = train_data.drop(columns=['Salary'])
+y_train = train_data['Salary']
+X_val = val_data.drop(columns=['Salary'])
+y_val = val_data['Salary']
+
+# Ensure that all categorical variables are encoded as numeric
+X_train = pd.get_dummies(X_train, drop_first=True).astype(float)
+X_val = pd.get_dummies(X_val, drop_first=True).astype(float)
+
+# Align columns of validation set to match training set
+X_val = X_val.reindex(columns=X_train.columns, fill_value=0).astype(float)
+
+# Convert validation data to matrix form (for statsmodels)
+val_data = sm.add_constant(X_val)
+
+# Ensure target variable is numeric
+y_train_np = np.asarray(y_train).astype(float)
+y_val_np = np.asarray(y_val).astype(float)
+
+
+# Run forward stepwise selection using sklearn's SequentialFeatureSelector
+model2 = LinearRegression()
+
+sfs2 = SFS(model2, 
+          k_features=15, 
+          forward=True, 
+          floating=False, 
+          scoring='neg_mean_squared_error', 
+          cv=0)  # No cross-validation
+
+sfs2.fit(X_train, y_train)
+
+# Extract selected features for each number of features (1 to 15)
+#selected_features = list(sfs2.subsets_)
+selected_features = sfs2.subsets_
+
+# Compute validation mean squared errors for each model
+val_err = np.zeros(15)
+for i in range(1, 16):
+  # Get the selected feature names for this step
+  feature_names = selected_features[i]['feature_names']
+  
+  # Select the corresponding features from X_train
+  X_train_selected = X_train[list(feature_names)]
+  
+  # Add constant (intercept) term
+  X_train_selected = sm.add_constant(X_train_selected).astype(float)
+  
+  # Ensure the selected features are numeric
+  X_train_selected_np = np.asarray(X_train_selected).astype(float)
+
+  # Fit OLS model
+  model = sm.OLS(y_train_np, X_train_selected_np).fit()
+
+  # Predict on validation set
+  X_val_selected = val_data[list(feature_names)]
+  X_val_selected_np = sm.add_constant(X_val_selected).astype(float)  # Ensure numpy array is float
+
+  y_pred_val = model.predict(X_val_selected_np)
+
+  # Compute MSE for validation set
+  val_err[i - 1] = MSE(y_val_np, y_pred_val)
+
+# Print validation errors for each model size
+print("Validation Errors for each model size (1 to 15 features):")
+print(val_err)
+
+print("\nMin val_err: ", min(val_err))
+
+
+##
+# PLOT results
+##
+import matplotlib.pyplot as plt
+# Assuming 'val_err' contains the validation MSE values
+
+# Find the index of the minimum validation error
+min_index = np.argmin(val_err) + 1  # +1 because index starts from 0, but variables start from 1
+
+# Plot the validation errors
+plt.figure(figsize=(8, 5))
+plt.plot(range(1, 16), val_err, marker='o', linestyle='--', color='black')
+
+# Highlight the minimum MSE with a red vertical line
+plt.axvline(x=min_index, color='red', linestyle='-', linewidth=1.5)
+
+# Label the axes
+plt.xlabel("# Variables", fontsize=12)
+plt.ylabel("Validation MSE", fontsize=12)
+
+# Title for the plot (optional)
+plt.title("Validation MSE vs Number of Variables", fontsize=14)
+
+# Show the plot
+plt.tight_layout()
+plt.show()
--- a/Shrinkage/Script04_codes_RR_LR_01.04.2025.py
+++ b/Shrinkage/Script04_codes_RR_LR_01.04.2025.py
@@ -0,0 +1,50 @@
+import pandas as pd
+import numpy as np
+from sklearn.linear_model import Ridge, RidgeCV, LassoCV
+from sklearn.preprocessing import StandardScaler
+from ISLP import load_data
+
+# === Setup ===
+# Load and preprocess data
+Hitters = load_data('Hitters').dropna()
+Hitters = pd.get_dummies(Hitters, drop_first=True)
+y = Hitters['Salary']
+X = Hitters.drop(columns=['Salary'])
+
+# Standardize predictors
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X)
+
+# === SLIDE 1: Ridge regression with fixed lambda ===
+ridge_fixed = Ridge(alpha=100)
+ridge_fixed.fit(X_scaled, y)
+ridge_fixed_coeffs = ridge_fixed.coef_
+ridge_fixed_preds = ridge_fixed.predict(X_scaled[:5])
+
+# === SLIDE 2: Ridge regression with cross-validation to find best lambda ===
+lambdas = 10**np.linspace(10, -2, 100) * 0.5  # Equivalent to R's lambda grid
+ridge_cv = RidgeCV(alphas=lambdas, scoring='neg_mean_squared_error', cv=10)
+ridge_cv.fit(X_scaled, y)
+best_lambda_ridge = ridge_cv.alpha_
+ridge_cv_coeffs = ridge_cv.coef_
+ridge_cv_preds = ridge_cv.predict(X_scaled[:5])
+
+# === SLIDE 3: Lasso regression with cross-validation ===
+lasso_cv = LassoCV(cv=10, max_iter=10000)
+lasso_cv.fit(X_scaled, y)
+best_lambda_lasso = lasso_cv.alpha_
+lasso_cv_coeffs = lasso_cv.coef_
+lasso_cv_preds = lasso_cv.predict(X_scaled[:5])
+
+# === Create summary DataFrame ===
+summary = pd.DataFrame({
+    'Model': ['Ridge (lambda=100)', 'RidgeCV (best lambda)', 'LassoCV (best lambda)'],
+    'Best Lambda': [100, best_lambda_ridge, best_lambda_lasso],
+    'Non-zero Coefficients': [
+        np.sum(ridge_fixed_coeffs != 0),
+        np.sum(ridge_cv_coeffs != 0),
+        np.sum(lasso_cv_coeffs != 0)
+    ]
+})
+
+print(summary)