major upload of (python) course material & solutions

2025-12-03 14:39:45 +01:00
parent 52552e20cb
commit e95a0b2ecc
39 changed files with 13598 additions and 0 deletions
--- a/2/ProblemSet2.ipynb
+++ b/2/ProblemSet2.ipynb
@@ -0,0 +1,354 @@
+{
+ "cells": [
+  {
+   "cell_type": "raw",
+   "id": "77f76980-cc4f-4837-867f-218c92a7deae",
+   "metadata": {},
+   "source": [
+    "\\vspace{-4cm}\n",
+    "\\begin{center}\n",
+    "  \\LARGE{Machine Learning for Economics and Finance}\\\\[0.5cm]\n",
+    "  \\Large{\\textbf{Problem Set 2}}\\\\[1.0cm]\n",
+    "  \\large{Ole Wilms}\\\\[0.5cm]\n",
+    "  \\large{July 29, 2024}\\\\\n",
+    "\\end{center}"
+   ]
+  },
+  {
+   "cell_type": "raw",
+   "id": "2c3a2d4e-1e5a-4fe3-88be-abd9b9152def",
+   "metadata": {},
+   "source": [
+    "\\setcounter{secnumdepth}{0}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "040dc2a4-910e-4cf5-9d1e-62fe7d0a8efd",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "## Important Instructions\n",
+    "\n",
+    "- In this problem set you are asked to apply the machine learning techniques we covered in the past weeks\n",
+    "- In case you struggle with some problems, please post your questions on the OpenOlat discussion board.\n",
+    "- We will discuss the solutions for the problem set on `MONTH DAY`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "baac6966-d67a-4a66-acec-8ef6411c4f66",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "## Setup\n",
+    "\n",
+    "Assume the same setup as in *Problem Set 1* but now you try to improve the return predictions using\n",
+    "the machine learning approaches we have discussed in class. For this you are asked to use the same\n",
+    "training and test datasets we constructed in *Problem Set 1*."
+   ]
+  },
+  {
+   "cell_type": "raw",
+   "id": "156ee566-f0eb-4206-a443-34a63bc6dbd8",
+   "metadata": {},
+   "source": [
+    "\\newpage"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "87902d82-5336-456b-bec8-403530c75f00",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "## Question 1: Shrinkage Methods\n",
+    "\n",
+    "1. Fit a ridge regression using the training data. Determine the optimal penalty parameter $\\lambda$ using $5$-fold cross validation (set the seed to $2$ before you run the CV). Provide a plot of the cross-validation MSE as a function of log($\\lambda$) and interpret the outome."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0770500d-74fe-48df-841c-20b9aef42883",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "73330b81-0e43-43ac-911f-4086a9f9788f",
+   "metadata": {
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "2. Prepare a slide with a table that reports training MSE and test MSE for different models. Fill in the MSE from the linear model using all features from Problem Set 1. Now compute the training and test MSE for the ridge regression with the optimal penalty parameter $\\lambda$ from *Q1.1*."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f1b13abd-80b1-4805-b108-55d403b7ab5c",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "80e4160e-374a-43e1-a159-45077703658e",
+   "metadata": {
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "3. Redo the two tasks above using Lasso instead of Ridge. Again fix the seed to $2$. Provide a plot of the cross-validation MSE as a function of log($\\lambda$) and interpret. Provide a table that shows the coefficient of the Lasso with the optimal penalty parameter $\\lambda$. Compute the training and test MSE of this Lasso model and add it to the table from *Q1.2*."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a214f453-68d3-4b6f-bc36-dbabf5536fc3",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "03d19235-25ee-4c3b-b7bf-97cdf27d41b2",
+   "metadata": {
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "4. Now suppose your boss tells you that he only trusts sparse models with few variables. Use the Lasso and choose the tuning parameter $\\lambda$ such that the model only considers $3$ out of the six variables. Report the coefficients and compare them to the coefficients from the optimal model from *Q1.3* and interpret. Compute the training and test MSE of this Lasso model and add it to the table from *Q1.2*. Interpret."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9e53d846-19a3-46d9-b103-f42e75a87c20",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e715dd42-7021-466d-a9c1-0c0b4efeee78",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "## Question 2: Tree-Based Methods\n",
+    "\n",
+    "1. Fit a large regression tree using the training data. Report the number of terminal nodes as well as the most important variables for splitting the tree."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0207f3f9-c389-4e50-abeb-5316857ab2da",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3069027d-f53f-4348-8c0c-0885483dc8d9",
+   "metadata": {
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "2. Compute the training and test MSE of the tree and add it to the table from *Q1.2*."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f65211c4-6864-4749-8b94-eaeea96c9cbf",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "581f7631-9c99-4143-b87e-11b43c243dd0",
+   "metadata": {
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "3. Again set the seed to $2$ and use $5$-fold cross validation to determine the optimal pruning parameter for the large tree. Provide a plot of the prediction error against the size of the tree. Report the optimal tree size and provide a plot of the pruned tree. Which variables are important for splitting the pruned tree?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9801c9a3-85ba-4b70-82b6-a9bbbfcfaec4",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "18a9a179-4226-4734-8bcf-554671ce85e9",
+   "metadata": {
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "4. Compute the training and test MSE of the pruned tree and add it to the table from *Q1.2*."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b0272ea3-971d-4881-8308-9b41c38b05bd",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5a7e1a79-340c-4b61-9e74-e06b4f455904",
+   "metadata": {
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "5. Finally, use random forest to improve the predictions. Motivate your choice for the tuning parameters. Report the training and test MSE and add it to the table from *Q1.2*. Which variables are most important in the random forest?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e9731a27-c811-4cf2-a53d-7d49a48e1d5b",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ccecdd74-9faf-4b7a-bd23-9d3f81dcda60",
+   "metadata": {
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "6. Supposed it is the beginning of $2020$ and you have access to both the in-sample and out-of-sample errors for the different methods. Which model do you choose to predict stock markets in the future and why?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "151e7ae9-1f4d-47f9-87d1-9da0b030da50",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "raw",
+   "id": "2419d990-f478-4bda-8dbc-3144fbdfc917",
+   "metadata": {},
+   "source": [
+    "\\newpage"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "81cbfae3-7385-40a2-8d0d-d7db7ae9a9f5",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "## Appendix\n",
+    "The dataset contains the following variables:\n",
+    "\n",
+    " - **ret**: the quarterly return of the US stock market (a number of 0.01 is a $1\\%$ return per quarter)\n",
+    " - **date**: the date in format $yyyyq$ ($19941$ means the first quarter of $1994$)\n",
+    " - **DP**: the dividend to price ratio of the stock market (a valuation measure whether prices are high or low relative to the dividends payed)\n",
+    " - **CS**: the credit spread defined as the difference in yields between high rated corporate bonds (save investments) and low rated corporate bonds (corporations that might go bankrupt). CS measures the additional return investors require to invest in risky firms compared to well established firms with lower risks\n",
+    " - **ntis**: A measure for corporate issuing activity (IPO’s, stock repurchases,...)\n",
+    " - **cay**: a measure of the wealth-to-consumption ratio (how much is consumed relative to total wealth)\n",
+    " - **TS**: the term spread is the difference between the long term yield on government bonds and short term yields.\n",
+    " - **svar**: a measure for the stock market variance\n",
+    "\n",
+    "For a full description of the data, see *Welch und Goyal* ($2007$). Google is also very helpful if you are interested in obtaining more intuition about the variables.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "db90f03c-18a4-4e7f-a31c-56f206baf5cc",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "## References\n",
+    "\n",
+    "Welch, I. and A. Goyal ($2007$, $03$). A Comprehensive Look at The Empirical Performance of Equity\n",
+    "Premium Prediction. *The Review of Financial Studies 21* ($4$), $1455$ – $1508$."
+   ]
+  }
+ ],
+ "metadata": {
+  "date": " ",
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.7"
+  },
+  "title": " ",
+  "toc-autonumbering": false,
+  "toc-showcode": false,
+  "toc-showmarkdowntxt": false,
+  "toc-showtags": false
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/2/ProblemSet2.pdf
+++ b/2/ProblemSet2.pdf
--- a/2/ProblemSet2_solution.ipynb
+++ b/2/ProblemSet2_solution.ipynb
--- a/2/ProblemSet2_solution.pdf
+++ b/2/ProblemSet2_solution.pdf