major upload of (python) course material & solutions
This commit is contained in:
File diff suppressed because one or more lines are too long
@@ -0,0 +1,196 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fb0c424f-1667-4fb2-baab-2d88d8abb387",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Preliminary setup"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "de6396ca-e17d-4c95-8f96-1f78a09e9ce2",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"import pandas as pd\n",
|
||||
"from ISLP import load_data\n",
|
||||
"from matplotlib.pyplot import subplots, show\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Load and preprocess data\n",
|
||||
"Hitters = load_data('Hitters').dropna()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "87902d82-5336-456b-bec8-403530c75f00",
|
||||
"metadata": {
|
||||
"tags": [],
|
||||
"user_expressions": []
|
||||
},
|
||||
"source": [
|
||||
"# Task"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0ce8adda-23e7-498f-9ff3-26c138903b88",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. Use the final model (tuning parameter) obtained from 10-fold CV and fit the model again using the full dataset and display the corresponding coefficients."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "ac884445-bc95-4659-b656-d9c5f821bf52",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "05635216-4afb-4d0d-982a-a2af35d6bf3a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"2. Multiply the feature Errors by $1/1000$ and again fit the model from Task 1. Display the coefficients and interpret. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "70bc0da8-6134-4d4d-ad1f-e43ea26fae3c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b6e19093-51bf-4e68-aba6-01c34905b5e4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"3. Redo Task 2 BUT without the normalizing (standardize) the data. Refit the same model again and display the coefficients. Interpret. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "5a38add3-642e-41a8-8b80-c3d01a63e538",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "df85262d-8a38-4bf9-9dfa-0a001e117d33",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"4. Split the dataset into a training set using $80\\%$ of the observations and validation set using all other observations."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "b0a152a8-395e-49e2-973d-252b88cd379c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e1e3e60e-0d5a-4340-ae29-9153ffdad7c8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"5. Set up a grid for the tuning parameter $\\lambda$ and fit Lasso regressions for all tuning parameters using the training data. Make sure that you choose the mininmum and maximum values of $\\lambda$ so that it allows you to determine the optimal $\\lambda$ parameter in the next task (you might need to play with the grid size a bit). "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "b5e0cff0-6782-40a3-8d7f-891c19bb5f4d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "21ba53c0-def1-4059-9872-27e6b437b8af",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"6. For each model (tuning parameter), compute the mean squared prediction error in the validation dataset. Plot the validation error as a function of $\\lambda$ and find the best model which minimizes the validation error. Display the estimated coefficients for the best model and check whether some features are not selected in the final regression. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "8323ce02-17fe-4f54-820d-030f198a34fe",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "19f07912-bffd-4a19-9a92-aa1a2dc48c75",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"7. Finally compare the best Lasso model obtained from the validation set approach from Task 6 to the best Lasso model obtained by 5-fold cross-validation. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "e0166113-9d31-4e42-a8df-69f2048b65af",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8fd306c8-2247-4343-8c30-5dd99393c9d0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"8. Compare the best model from Task 7 to the best ridge regression obtained from 5-fold cross validation. How do the coefficients of the two models differ?"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "3c70e9bd-78d9-4a91-a28f-588fca65c616",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"date": " ",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.13.3"
|
||||
},
|
||||
"title": " ",
|
||||
"toc-autonumbering": false,
|
||||
"toc-showcode": false,
|
||||
"toc-showmarkdowntxt": false,
|
||||
"toc-showtags": false
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
Binary file not shown.
@@ -0,0 +1,164 @@
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from sklearn.linear_model import LinearRegression
|
||||
from sklearn.feature_selection import SequentialFeatureSelector
|
||||
from ISLP import load_data
|
||||
|
||||
###
|
||||
# Forward stepwise selection
|
||||
###
|
||||
# Load Hitters dataset from ISLP
|
||||
Hitters = load_data('Hitters')
|
||||
|
||||
# Remove missing values
|
||||
Hitters = Hitters.dropna()
|
||||
|
||||
# Create dummy variables for categorical columns
|
||||
Hitters = pd.get_dummies(Hitters, drop_first=True)
|
||||
|
||||
# Separate response (target) and predictors
|
||||
y = Hitters['Salary']
|
||||
X = Hitters.drop(columns=['Salary'])
|
||||
|
||||
# Define the linear regression model
|
||||
model = LinearRegression()
|
||||
|
||||
# Perform forward stepwise selection using SequentialFeatureSelector
|
||||
#sfs = SequentialFeatureSelector(model, n_features_to_select=15, direction='forward', cv=5)
|
||||
sfs = SequentialFeatureSelector(model, n_features_to_select=15, direction='forward')
|
||||
|
||||
# Fit the model to the data
|
||||
sfs.fit(X, y)
|
||||
|
||||
# Get the selected features
|
||||
selected_features = X.columns[sfs.get_support()]
|
||||
|
||||
# Fit the model with the selected features
|
||||
model.fit(X[selected_features], y)
|
||||
|
||||
# Coefficients of the selected features
|
||||
coefficients = pd.DataFrame({
|
||||
'Feature': selected_features,
|
||||
'Coefficient': model.coef_
|
||||
})
|
||||
|
||||
# Printing short summary - intercept, coefficients and $R^{2}$
|
||||
print("\nIntercept:")
|
||||
print(model.intercept_)
|
||||
|
||||
print("\nCoefficients:")
|
||||
print(coefficients)
|
||||
|
||||
print("\nR-squared:")
|
||||
print(model.score(X[selected_features], y))
|
||||
|
||||
|
||||
###
|
||||
# Validation errors for FSS
|
||||
###
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.metrics import mean_squared_error as MSE
|
||||
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
|
||||
import statsmodels.api as sm
|
||||
|
||||
# Split the data into training and validation sets based on row indices
|
||||
train_data = Hitters.iloc[:184] # First 184 rows for training data
|
||||
val_data = Hitters.iloc[184:263] # Rows 185 to 263 for validation data
|
||||
|
||||
# Define X and y for both training and validation sets
|
||||
X_train = train_data.drop(columns=['Salary'])
|
||||
y_train = train_data['Salary']
|
||||
X_val = val_data.drop(columns=['Salary'])
|
||||
y_val = val_data['Salary']
|
||||
|
||||
# Ensure that all categorical variables are encoded as numeric
|
||||
X_train = pd.get_dummies(X_train, drop_first=True).astype(float)
|
||||
X_val = pd.get_dummies(X_val, drop_first=True).astype(float)
|
||||
|
||||
# Align columns of validation set to match training set
|
||||
X_val = X_val.reindex(columns=X_train.columns, fill_value=0).astype(float)
|
||||
|
||||
# Convert validation data to matrix form (for statsmodels)
|
||||
val_data = sm.add_constant(X_val)
|
||||
|
||||
# Ensure target variable is numeric
|
||||
y_train_np = np.asarray(y_train).astype(float)
|
||||
y_val_np = np.asarray(y_val).astype(float)
|
||||
|
||||
|
||||
# Run forward stepwise selection using sklearn's SequentialFeatureSelector
|
||||
model2 = LinearRegression()
|
||||
|
||||
sfs2 = SFS(model2,
|
||||
k_features=15,
|
||||
forward=True,
|
||||
floating=False,
|
||||
scoring='neg_mean_squared_error',
|
||||
cv=0) # No cross-validation
|
||||
|
||||
sfs2.fit(X_train, y_train)
|
||||
|
||||
# Extract selected features for each number of features (1 to 15)
|
||||
#selected_features = list(sfs2.subsets_)
|
||||
selected_features = sfs2.subsets_
|
||||
|
||||
# Compute validation mean squared errors for each model
|
||||
val_err = np.zeros(15)
|
||||
for i in range(1, 16):
|
||||
# Get the selected feature names for this step
|
||||
feature_names = selected_features[i]['feature_names']
|
||||
|
||||
# Select the corresponding features from X_train
|
||||
X_train_selected = X_train[list(feature_names)]
|
||||
|
||||
# Add constant (intercept) term
|
||||
X_train_selected = sm.add_constant(X_train_selected).astype(float)
|
||||
|
||||
# Ensure the selected features are numeric
|
||||
X_train_selected_np = np.asarray(X_train_selected).astype(float)
|
||||
|
||||
# Fit OLS model
|
||||
model = sm.OLS(y_train_np, X_train_selected_np).fit()
|
||||
|
||||
# Predict on validation set
|
||||
X_val_selected = val_data[list(feature_names)]
|
||||
X_val_selected_np = sm.add_constant(X_val_selected).astype(float) # Ensure numpy array is float
|
||||
|
||||
y_pred_val = model.predict(X_val_selected_np)
|
||||
|
||||
# Compute MSE for validation set
|
||||
val_err[i - 1] = MSE(y_val_np, y_pred_val)
|
||||
|
||||
# Print validation errors for each model size
|
||||
print("Validation Errors for each model size (1 to 15 features):")
|
||||
print(val_err)
|
||||
|
||||
print("\nMin val_err: ", min(val_err))
|
||||
|
||||
|
||||
##
|
||||
# PLOT results
|
||||
##
|
||||
import matplotlib.pyplot as plt
|
||||
# Assuming 'val_err' contains the validation MSE values
|
||||
|
||||
# Find the index of the minimum validation error
|
||||
min_index = np.argmin(val_err) + 1 # +1 because index starts from 0, but variables start from 1
|
||||
|
||||
# Plot the validation errors
|
||||
plt.figure(figsize=(8, 5))
|
||||
plt.plot(range(1, 16), val_err, marker='o', linestyle='--', color='black')
|
||||
|
||||
# Highlight the minimum MSE with a red vertical line
|
||||
plt.axvline(x=min_index, color='red', linestyle='-', linewidth=1.5)
|
||||
|
||||
# Label the axes
|
||||
plt.xlabel("# Variables", fontsize=12)
|
||||
plt.ylabel("Validation MSE", fontsize=12)
|
||||
|
||||
# Title for the plot (optional)
|
||||
plt.title("Validation MSE vs Number of Variables", fontsize=14)
|
||||
|
||||
# Show the plot
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
@@ -0,0 +1,50 @@
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from sklearn.linear_model import Ridge, RidgeCV, LassoCV
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from ISLP import load_data
|
||||
|
||||
# === Setup ===
|
||||
# Load and preprocess data
|
||||
Hitters = load_data('Hitters').dropna()
|
||||
Hitters = pd.get_dummies(Hitters, drop_first=True)
|
||||
y = Hitters['Salary']
|
||||
X = Hitters.drop(columns=['Salary'])
|
||||
|
||||
# Standardize predictors
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# === SLIDE 1: Ridge regression with fixed lambda ===
|
||||
ridge_fixed = Ridge(alpha=100)
|
||||
ridge_fixed.fit(X_scaled, y)
|
||||
ridge_fixed_coeffs = ridge_fixed.coef_
|
||||
ridge_fixed_preds = ridge_fixed.predict(X_scaled[:5])
|
||||
|
||||
# === SLIDE 2: Ridge regression with cross-validation to find best lambda ===
|
||||
lambdas = 10**np.linspace(10, -2, 100) * 0.5 # Equivalent to R's lambda grid
|
||||
ridge_cv = RidgeCV(alphas=lambdas, scoring='neg_mean_squared_error', cv=10)
|
||||
ridge_cv.fit(X_scaled, y)
|
||||
best_lambda_ridge = ridge_cv.alpha_
|
||||
ridge_cv_coeffs = ridge_cv.coef_
|
||||
ridge_cv_preds = ridge_cv.predict(X_scaled[:5])
|
||||
|
||||
# === SLIDE 3: Lasso regression with cross-validation ===
|
||||
lasso_cv = LassoCV(cv=10, max_iter=10000)
|
||||
lasso_cv.fit(X_scaled, y)
|
||||
best_lambda_lasso = lasso_cv.alpha_
|
||||
lasso_cv_coeffs = lasso_cv.coef_
|
||||
lasso_cv_preds = lasso_cv.predict(X_scaled[:5])
|
||||
|
||||
# === Create summary DataFrame ===
|
||||
summary = pd.DataFrame({
|
||||
'Model': ['Ridge (lambda=100)', 'RidgeCV (best lambda)', 'LassoCV (best lambda)'],
|
||||
'Best Lambda': [100, best_lambda_ridge, best_lambda_lasso],
|
||||
'Non-zero Coefficients': [
|
||||
np.sum(ridge_fixed_coeffs != 0),
|
||||
np.sum(ridge_cv_coeffs != 0),
|
||||
np.sum(lasso_cv_coeffs != 0)
|
||||
]
|
||||
})
|
||||
|
||||
print(summary)
|
||||
Reference in New Issue
Block a user