{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "43e49832-91e1-408b-98ec-9b6e86fefdce",
      "metadata": {
        "id": "43e49832-91e1-408b-98ec-9b6e86fefdce"
      },
      "source": [
        "## Computational - 35 points\n",
        "\n",
        "These problems have a coding solution. Code your solutions in Python in the space provided.\n",
        "\n",
        "Use of LLM's such as ChatGPT are not allowed for this computational task, please try not to use such applications as we do check for the same during grading."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "c7c5994e-0d69-4b0a-b402-185eda5f4d44",
      "metadata": {
        "id": "c7c5994e-0d69-4b0a-b402-185eda5f4d44"
      },
      "source": [
        "### Problem 1 - Bingo [5 points]\n",
        "\n",
        "You are playing a game of Bingo with 5 balls numbered from 1 to 5. You take 20 balls out of a box, and the following table shows how many times a number was observed.\n",
        "\n",
        "**For background on the Multinomial and its conjugate prior the Dirichlet distribution see https://mcrovella.github.io/DS122-Foundations-of-Data-Science-III/27-Conjugate-Priors.html#lions-and-tigers-and-bears**\n",
        "\n",
        "**Based on these observations we'll determine what the probabilty of drawing ball 3 is.**"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Import necessary libraries\n",
        "import pandas as pd  # Utilize this library for data manipulation and analysis.\n",
        "import numpy as np  # This library is for numerical operations.\n",
        "\n",
        "from scipy.stats import dirichlet\n",
        "\n",
        "# Data for bingo ball frequencies. Define a dictionary to represent this information.\n",
        "bingo_data = {\n",
        "    'numbers': [1, 2, 3, 4, 5],  # This key should denote the bingo ball numbers.\n",
        "    'frequency': [4, 5, 2, 6, 3]   # This key corresponds to how often each number has been drawn.\n",
        "}\n",
        "\n",
        "# Create a pandas DataFrame from the data\n",
        "bingo = pd.DataFrame(bingo_data)\n",
        "\n",
        "# Display the DataFrame\n",
        "print(bingo)\n",
        "\n",
        "# Define the prior Dirichlet distribution. Non-informative priors are typically uniform.\n",
        "alpha_prior = np.array([1, 1, 1, 1, 1])  # Specify the Dirichlet prior with equal weight for each outcome.\n",
        "\n",
        "# Update the parameter vector alpha with observed frequencies. The update reflects our belief after seeing the data.\n",
        "alpha_posterior = alpha_prior + bingo['frequency'].values  # Incorporate the observed frequencies into the prior.\n",
        "\n",
        "# Define the posterior Dirichlet distribution. The posterior captures our updated beliefs.\n",
        "posterior = dirichlet(alpha_posterior)\n",
        "\n",
        "# Predict the probability of the next ball being a 3. We use the mean of the posterior distribution.\n",
        "probability_of_3 = posterior.mean()[2]  # Access the mean probability for the third number.\n",
        "\n",
        "# Print the result. The probability is formatted to four decimal places.\n",
        "print(f'Probability of ball 3: {probability_of_3:.4f}')\n",
        "\n",
        "# Visualize the distribution: this part is for plotting the posterior distribution.\n",
        "# To plot, you need to import a specific library designed for creating static, animated, and interactive visualizations.\n",
        "import matplotlib.pyplot as plt\n",
        "\n",
        "# Simulate draws from the posterior distribution. The number of simulations can be adjusted for precision, you can try out different numbers.\n",
        "simulations = posterior.rvs(size=10000)\n",
        "plt.hist(simulations[:, 2], bins=50, density=True)  # Extract the simulation data for the third ball.\n",
        "plt.title('Simulated Posterior Distribution for Bingo Ball 3')\n",
        "plt.xlabel('Probability of Ball 3')\n",
        "plt.ylabel('Density')\n",
        "plt.axvline(probability_of_3, color='red', linestyle='dashed', linewidth=2)\n",
        "plt.show()\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 601
        },
        "id": "o830lXL_qxna",
        "outputId": "261fdf67-48d1-4337-b80c-ca0fa2b35cae"
      },
      "id": "o830lXL_qxna",
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "   numbers  frequency\n",
            "0        1          4\n",
            "1        2          5\n",
            "2        3          2\n",
            "3        4          6\n",
            "4        5          3\n",
            "Probability of ball 3: 0.1200\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<Figure size 640x480 with 1 Axes>"
            ],
            "image/png": "\n"
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "id": "1f1a961a-cccd-4a12-8087-6f19484eeb27",
      "metadata": {
        "id": "1f1a961a-cccd-4a12-8087-6f19484eeb27"
      },
      "source": [
        "### Problem 2 - New Beverage Taste Preference [10 points]\n",
        "\n",
        "A beverage company has developed a new flavor and claims that it will be preferred over the current leading brand by 70% of consumers. As a market analyst, you have been tasked with validating this claim. You organize a taste test with 100 participants and find that 65 of them express a preference for the new flavor over the leading brand.\n",
        "\n",
        "Prior to the taste test, you had some reservations about the claim due to the leading brand's strong market presence. Therefore, you choose to use a beta distribution with parameters \\(\\alpha = 50\\) and \\(\\beta = 50\\) as the prior distribution for the probability of preference for the new flavor, indicating a neutral stance with a tendency towards uncertainty due to the balanced alpha and beta parameters.\n",
        "\n",
        "**2.1 Define a beta distribution as your prior using the given alpha and beta values. Plot the prior distribution. [2 points]**"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Import necessary libraries\n",
        "import numpy as np  # This library is essential for handling arrays and performing numerical computations.\n",
        "import matplotlib.pyplot as plt  # This library is used for creating static, interactive, and animated visualizations in Python.\n",
        "from scipy.stats import beta  # Import the beta distribution for Bayesian analysis.\n",
        "\n",
        "# Define the prior beta distribution parameters. These parameters represent our prior belief about the probability.\n",
        "alpha_prior = 50  # Represents the number of \"successes\" in our prior knowledge given in the question.\n",
        "beta_prior = 50  # Represents the number of \"failures\" in our prior knowledge given in the question.\n",
        "prior_dist = beta(alpha_prior, beta_prior)\n",
        "\n",
        "# Plotting the prior distribution. The plot represents the density of our prior belief across all possible probabilities.\n",
        "x = np.linspace(0, 1, 1000)  # Generate an array of probability values between 0 and 1.\n",
        "plt.plot(x, prior_dist.pdf(x), label='Prior Distribution')  # Plot the probability density function of the prior.\n",
        "plt.title('Prior Beta Distribution with alpha=50 and beta=50')  # Fill in with the corresponding alpha and beta values.\n",
        "plt.xlabel('Preference Probability')  # Label for the x-axis indicating what it represents.\n",
        "plt.ylabel('Density')  # Label for the y-axis indicating the density of the prior distribution.\n",
        "plt.legend()  # Display a legend to identify the plotted line.\n",
        "plt.show()  # This command will display the plot.\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 472
        },
        "id": "mVBEvEzPs7gQ",
        "outputId": "27ec4809-b9b5-49ce-d744-1f273a253e15"
      },
      "id": "mVBEvEzPs7gQ",
      "execution_count": 2,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<Figure size 640x480 with 1 Axes>"
            ],
            "image/png": "\n"
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ab74b592-227e-4f6c-88ab-427310a4fcb7",
      "metadata": {
        "id": "ab74b592-227e-4f6c-88ab-427310a4fcb7"
      },
      "source": [
        "**2.2 Using the data from the taste test, update your prior to obtain the posterior distribution. Calculate the posterior mean, and plot both the prior and posterior distributions on the same graph for comparison. [5 points]**\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "id": "4a13210d-90ad-407d-acfd-f1e71177032f",
      "metadata": {
        "id": "4a13210d-90ad-407d-acfd-f1e71177032f",
        "outputId": "129712c7-7e93-4676-c7cc-89be444a0ea6",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 490
        }
      },
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<Figure size 640x480 with 1 Axes>"
            ],
            "image/png": "\n"
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "The posterior mean of the new flavor's preference rate is: 0.5750\n"
          ]
        }
      ],
      "source": [
        "# Define the observed data from the taste test\n",
        "preference_count = 65\n",
        "total_participants = 100\n",
        "non_preference_count = total_participants - preference_count # Code the non-preference count from the preference and total participants count\n",
        "\n",
        "# Update the prior with the observed data to form the posterior distribution\n",
        "alpha_posterior = alpha_prior + preference_count # Update the alpha posterier based on the preference count\n",
        "beta_posterior = beta_prior + non_preference_count # Update the beta posterier on the prior\n",
        "posterior_dist = beta(alpha_posterior, beta_posterior)\n",
        "\n",
        "# Calculate the posterior mean\n",
        "posterior_mean = posterior_dist.mean()\n",
        "\n",
        "# Plot the prior and posterior distributions\n",
        "plt.plot(x, prior_dist.pdf(x), label='Prior Distribution')\n",
        "plt.plot(x, posterior_dist.pdf(x), label='Posterior Distribution', color='green')\n",
        "plt.title('Prior vs Posterior Beta Distribution')\n",
        "plt.xlabel('Preference Probability')\n",
        "plt.ylabel('Density')\n",
        "plt.legend()\n",
        "plt.show()\n",
        "\n",
        "# Print the posterior mean\n",
        "print(f\"The posterior mean of the new flavor's preference rate is: {posterior_mean:.4f}\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f95f4f4b-e62a-428f-bcb1-96fa604e1d45",
      "metadata": {
        "id": "f95f4f4b-e62a-428f-bcb1-96fa604e1d45"
      },
      "source": [
        "\n",
        "In this problem, the prior distribution reflects a neutral yet uncertain stance on the new flavor's potential preference rate. The posterior distribution, updated with the data from the taste test, provides a new estimate that incorporates both the initial skepticism and the actual preference data collected. The posterior mean serves as a Bayesian estimate of the true preference rate, which can then be used to inform the company's marketing strategy."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "263cebbc-7acb-4786-b9b3-b4dc216d76f5",
      "metadata": {
        "id": "263cebbc-7acb-4786-b9b3-b4dc216d76f5"
      },
      "source": [
        "### Problem 3 - Evaluating Rookie Performance [20 points]\n",
        "\n",
        "As a data analyst for Boston University, you've been tasked with evaluating the performance of rookie players. The team is interested in determining which rookies show the most promise based on their batting averages in the minor leagues. You have access to the Baseball data, which includes statistics for rookies.\n",
        "\n",
        "Consider the dataframe `rookie_df` defined below, which contains the number of hits and the total number of at-bats for several rookie players.\n",
        "\n",
        "Adapted from David Robinson"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a0c7a4c6-0d99-4e7b-8da4-14e2124af012",
      "metadata": {
        "id": "a0c7a4c6-0d99-4e7b-8da4-14e2124af012"
      },
      "source": [
        "#### Loading data - 2 Points"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "id": "ac05ba89-58b1-4d7d-9215-21e14b6d91dc",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ac05ba89-58b1-4d7d-9215-21e14b6d91dc",
        "outputId": "22a3fd2b-d211-4a01-a14e-e3413b74cf43"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "   success  total\n",
            "0        5     25\n",
            "1       23     75\n",
            "2        2     10\n",
            "3       10     40\n",
            "4       60    150\n"
          ]
        }
      ],
      "source": [
        "# Import necessary libraries\n",
        "import pandas as pd\n",
        "import matplotlib.pyplot as plt\n",
        "from scipy.stats import beta\n",
        "import numpy as np\n",
        "\n",
        "# Creating the dataframe for the baseball stats\n",
        "rookie_data = {'success': [5, 23, 2, 10, 60],\n",
        "               'total': [25, 75, 10, 40, 150]}\n",
        "\n",
        "rookie_df = pd.DataFrame(rookie_data) # Create a dataframe from the data\n",
        "print(rookie_df)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5481c6ca-3ec1-4874-8e80-c07b1d7d66fe",
      "metadata": {
        "id": "5481c6ca-3ec1-4874-8e80-c07b1d7d66fe"
      },
      "source": [
        "**3.1 Bad approach - Calculate the raw batting average for each player and note the shortcomings of this approach. [3 points]**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "id": "fa4b11fa-42aa-446e-b5a9-b791d174b413",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "fa4b11fa-42aa-446e-b5a9-b791d174b413",
        "outputId": "e7aa63ab-4615-44d3-b7d7-39ffd498eaad"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "   success  total  batting_average\n",
            "0        5     25         0.200000\n",
            "1       23     75         0.306667\n",
            "2        2     10         0.200000\n",
            "3       10     40         0.250000\n",
            "4       60    150         0.400000\n"
          ]
        }
      ],
      "source": [
        "# Calculating the raw batting average\n",
        "rookie_df['batting_average'] = rookie_df['success'] / rookie_df['total'] # Based on the success and the total\n",
        "print(rookie_df)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "18086f0c-0847-46fc-8cb7-8b89200ea624",
      "metadata": {
        "id": "18086f0c-0847-46fc-8cb7-8b89200ea624"
      },
      "source": [
        "**3.2 Apply empirical Bayes estimation to improve the estimate of each rookie player's batting average. [10 points]**\n",
        "\n",
        "**Loading Batting.csv and Pitching.csv datasets - 2 Points**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "id": "7545b9b2-4905-4a7b-bca4-489e8e4f7c9f",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 170
        },
        "id": "7545b9b2-4905-4a7b-bca4-489e8e4f7c9f",
        "outputId": "7d72eada-8ba6-469d-b51d-f91600df6116"
      },
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "     <input type=\"file\" id=\"files-2da033e7-331c-4dea-9db6-a8fef7eda6ab\" name=\"files[]\" multiple disabled\n",
              "        style=\"border:none\" />\n",
              "     <output id=\"result-2da033e7-331c-4dea-9db6-a8fef7eda6ab\">\n",
              "      Upload widget is only available when the cell has been executed in the\n",
              "      current browser session. Please rerun this cell to enable.\n",
              "      </output>\n",
              "      <script>// Copyright 2017 Google LLC\n",
              "//\n",
              "// Licensed under the Apache License, Version 2.0 (the \"License\");\n",
              "// you may not use this file except in compliance with the License.\n",
              "// You may obtain a copy of the License at\n",
              "//\n",
              "//      http://www.apache.org/licenses/LICENSE-2.0\n",
              "//\n",
              "// Unless required by applicable law or agreed to in writing, software\n",
              "// distributed under the License is distributed on an \"AS IS\" BASIS,\n",
              "// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
              "// See the License for the specific language governing permissions and\n",
              "// limitations under the License.\n",
              "\n",
              "/**\n",
              " * @fileoverview Helpers for google.colab Python module.\n",
              " */\n",
              "(function(scope) {\n",
              "function span(text, styleAttributes = {}) {\n",
              "  const element = document.createElement('span');\n",
              "  element.textContent = text;\n",
              "  for (const key of Object.keys(styleAttributes)) {\n",
              "    element.style[key] = styleAttributes[key];\n",
              "  }\n",
              "  return element;\n",
              "}\n",
              "\n",
              "// Max number of bytes which will be uploaded at a time.\n",
              "const MAX_PAYLOAD_SIZE = 100 * 1024;\n",
              "\n",
              "function _uploadFiles(inputId, outputId) {\n",
              "  const steps = uploadFilesStep(inputId, outputId);\n",
              "  const outputElement = document.getElementById(outputId);\n",
              "  // Cache steps on the outputElement to make it available for the next call\n",
              "  // to uploadFilesContinue from Python.\n",
              "  outputElement.steps = steps;\n",
              "\n",
              "  return _uploadFilesContinue(outputId);\n",
              "}\n",
              "\n",
              "// This is roughly an async generator (not supported in the browser yet),\n",
              "// where there are multiple asynchronous steps and the Python side is going\n",
              "// to poll for completion of each step.\n",
              "// This uses a Promise to block the python side on completion of each step,\n",
              "// then passes the result of the previous step as the input to the next step.\n",
              "function _uploadFilesContinue(outputId) {\n",
              "  const outputElement = document.getElementById(outputId);\n",
              "  const steps = outputElement.steps;\n",
              "\n",
              "  const next = steps.next(outputElement.lastPromiseValue);\n",
              "  return Promise.resolve(next.value.promise).then((value) => {\n",
              "    // Cache the last promise value to make it available to the next\n",
              "    // step of the generator.\n",
              "    outputElement.lastPromiseValue = value;\n",
              "    return next.value.response;\n",
              "  });\n",
              "}\n",
              "\n",
              "/**\n",
              " * Generator function which is called between each async step of the upload\n",
              " * process.\n",
              " * @param {string} inputId Element ID of the input file picker element.\n",
              " * @param {string} outputId Element ID of the output display.\n",
              " * @return {!Iterable<!Object>} Iterable of next steps.\n",
              " */\n",
              "function* uploadFilesStep(inputId, outputId) {\n",
              "  const inputElement = document.getElementById(inputId);\n",
              "  inputElement.disabled = false;\n",
              "\n",
              "  const outputElement = document.getElementById(outputId);\n",
              "  outputElement.innerHTML = '';\n",
              "\n",
              "  const pickedPromise = new Promise((resolve) => {\n",
              "    inputElement.addEventListener('change', (e) => {\n",
              "      resolve(e.target.files);\n",
              "    });\n",
              "  });\n",
              "\n",
              "  const cancel = document.createElement('button');\n",
              "  inputElement.parentElement.appendChild(cancel);\n",
              "  cancel.textContent = 'Cancel upload';\n",
              "  const cancelPromise = new Promise((resolve) => {\n",
              "    cancel.onclick = () => {\n",
              "      resolve(null);\n",
              "    };\n",
              "  });\n",
              "\n",
              "  // Wait for the user to pick the files.\n",
              "  const files = yield {\n",
              "    promise: Promise.race([pickedPromise, cancelPromise]),\n",
              "    response: {\n",
              "      action: 'starting',\n",
              "    }\n",
              "  };\n",
              "\n",
              "  cancel.remove();\n",
              "\n",
              "  // Disable the input element since further picks are not allowed.\n",
              "  inputElement.disabled = true;\n",
              "\n",
              "  if (!files) {\n",
              "    return {\n",
              "      response: {\n",
              "        action: 'complete',\n",
              "      }\n",
              "    };\n",
              "  }\n",
              "\n",
              "  for (const file of files) {\n",
              "    const li = document.createElement('li');\n",
              "    li.append(span(file.name, {fontWeight: 'bold'}));\n",
              "    li.append(span(\n",
              "        `(${file.type || 'n/a'}) - ${file.size} bytes, ` +\n",
              "        `last modified: ${\n",
              "            file.lastModifiedDate ? file.lastModifiedDate.toLocaleDateString() :\n",
              "                                    'n/a'} - `));\n",
              "    const percent = span('0% done');\n",
              "    li.appendChild(percent);\n",
              "\n",
              "    outputElement.appendChild(li);\n",
              "\n",
              "    const fileDataPromise = new Promise((resolve) => {\n",
              "      const reader = new FileReader();\n",
              "      reader.onload = (e) => {\n",
              "        resolve(e.target.result);\n",
              "      };\n",
              "      reader.readAsArrayBuffer(file);\n",
              "    });\n",
              "    // Wait for the data to be ready.\n",
              "    let fileData = yield {\n",
              "      promise: fileDataPromise,\n",
              "      response: {\n",
              "        action: 'continue',\n",
              "      }\n",
              "    };\n",
              "\n",
              "    // Use a chunked sending to avoid message size limits. See b/62115660.\n",
              "    let position = 0;\n",
              "    do {\n",
              "      const length = Math.min(fileData.byteLength - position, MAX_PAYLOAD_SIZE);\n",
              "      const chunk = new Uint8Array(fileData, position, length);\n",
              "      position += length;\n",
              "\n",
              "      const base64 = btoa(String.fromCharCode.apply(null, chunk));\n",
              "      yield {\n",
              "        response: {\n",
              "          action: 'append',\n",
              "          file: file.name,\n",
              "          data: base64,\n",
              "        },\n",
              "      };\n",
              "\n",
              "      let percentDone = fileData.byteLength === 0 ?\n",
              "          100 :\n",
              "          Math.round((position / fileData.byteLength) * 100);\n",
              "      percent.textContent = `${percentDone}% done`;\n",
              "\n",
              "    } while (position < fileData.byteLength);\n",
              "  }\n",
              "\n",
              "  // All done.\n",
              "  yield {\n",
              "    response: {\n",
              "      action: 'complete',\n",
              "    }\n",
              "  };\n",
              "}\n",
              "\n",
              "scope.google = scope.google || {};\n",
              "scope.google.colab = scope.google.colab || {};\n",
              "scope.google.colab._files = {\n",
              "  _uploadFiles,\n",
              "  _uploadFilesContinue,\n",
              "};\n",
              "})(self);\n",
              "</script> "
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Saving Pitching.csv to Pitching.csv\n",
            "Saving Batting.csv to Batting.csv\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "<ipython-input-6-863f93afae1e>:13: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.\n",
            "  batting_df = batting_df.groupby('playerID').sum().reset_index()\n"
          ]
        }
      ],
      "source": [
        "# Google Colab file uploads\n",
        "from google.colab import files\n",
        "uploaded = files.upload()  # Prompt for file upload and store the uploaded file references\n",
        "\n",
        "# Load Batting.csv dataset\n",
        "batting_df = pd.read_csv('Batting.csv')  # Read the Batting CSV into a DataFrame\n",
        "\n",
        "# Load Pitching.csv dataset\n",
        "pitching_df = pd.read_csv('Pitching.csv')  # Read the Pitching CSV into a DataFrame\n",
        "\n",
        "# Proceed with the preprocessing as you have described\n",
        "batting_df = batting_df[batting_df['AB'] > 0]\n",
        "batting_df = batting_df.groupby('playerID').sum().reset_index()\n",
        "batting_df['batting_avg'] = batting_df['H'] / batting_df['AB']\n",
        "\n",
        "pitchers_list = pitching_df['playerID'].unique().tolist()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6d2c343e-2ea9-49df-86a9-1c66e497d54e",
      "metadata": {
        "id": "6d2c343e-2ea9-49df-86a9-1c66e497d54e"
      },
      "source": [
        "**3.2.1 Remove outliers by filtering out all players with fewer than 500 career at-bats from the batting dataset. [2 points]**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "id": "cb008de8-23b8-4dce-9480-4a17e6aa3bb1",
      "metadata": {
        "id": "cb008de8-23b8-4dce-9480-4a17e6aa3bb1"
      },
      "outputs": [],
      "source": [
        "# Filter out players with fewer than 500 career at-bats\n",
        "filtered_batting_df = batting_df[batting_df['AB'] >= 500]"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a4e1a643-f10e-47cd-8865-d4be0b089a3c",
      "metadata": {
        "id": "a4e1a643-f10e-47cd-8865-d4be0b089a3c"
      },
      "source": [
        "**3.2.2 Create a dataframe called `non_pitchers` that excludes players who have pitched. [3 points]**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "id": "90f56dd8-aca5-419e-9d0a-52ade89d74f3",
      "metadata": {
        "id": "90f56dd8-aca5-419e-9d0a-52ade89d74f3"
      },
      "outputs": [],
      "source": [
        "# Exclude players who have also pitched\n",
        "non_pitchers_df = filtered_batting_df[~filtered_batting_df['playerID'].isin(pitchers_list)] # You have to use the Player_ID column and the pitchers_list that you preprocessed"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "48355e48-36d9-4d5c-84a9-a9dc66f7c1c2",
      "metadata": {
        "id": "48355e48-36d9-4d5c-84a9-a9dc66f7c1c2"
      },
      "source": [
        "**3.2.3 Estimate the beta prior using scipy and the non_pitcher data. [3 points]**"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Estimate the beta prior\n",
        "alpha_prior, beta_prior, _, _ = beta.fit(non_pitchers_df['batting_avg'], floc=0, fscale=1)\n",
        "print(f'Alpha prior: {alpha_prior}, Beta prior: {beta_prior}')\n",
        "\n",
        "# Calculate the empirical Bayes estimate for the rookies\n",
        "rookie_df['eb_estimate'] = (rookie_df['success'] + alpha_prior) / (rookie_df['total'] + alpha_prior + beta_prior) # Include the alpha_prior and the success columns in accordance with the formula to calculate the empirical bayes estimate\n",
        "print(rookie_df[['success', 'total', 'batting_average', 'eb_estimate']])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "lq7nfWCaoFuW",
        "outputId": "8b97abd3-7b44-4ad8-807c-e20bd6cead84"
      },
      "id": "lq7nfWCaoFuW",
      "execution_count": 9,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Alpha prior: 79.43156967970783, Beta prior: 228.0037390687307\n",
            "   success  total  batting_average  eb_estimate\n",
            "0        5     25         0.200000     0.253979\n",
            "1       23     75         0.306667     0.267840\n",
            "2        2     10         0.200000     0.256530\n",
            "3       10     40         0.250000     0.257405\n",
            "4       60    150         0.400000     0.304812\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "id": "79a74f4a-f4ae-4177-841d-ab8d349f1b2a",
      "metadata": {
        "id": "79a74f4a-f4ae-4177-841d-ab8d349f1b2a"
      },
      "source": [
        "**Visualizing the Results [5 Points]**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "id": "e10315be-effe-4a78-8989-68d001ce3091",
      "metadata": {
        "id": "e10315be-effe-4a78-8989-68d001ce3091",
        "outputId": "98165840-5212-4923-d396-3850230f98b8",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 725
        }
      },
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<Figure size 1000x800 with 2 Axes>"
            ],
            "image/png": "\n"
          },
          "metadata": {}
        }
      ],
      "source": [
        "# Plot the empirical Bayes estimates\n",
        "# Hints\n",
        "## You will need to make a scatter plot of the batting average vs eb estimate\n",
        "## Make sure to include correct labels for the x and y axis\n",
        "## You will need to make a diagonal line for reference\n",
        "## Make sure to limit the graph to 0,1 for the axis as shown in the example output\n",
        "## Finally show your output\n",
        "\n",
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n",
        "\n",
        "# Seaborn Stuffs\n",
        "sns.set_style(\"whitegrid\")\n",
        "plt.figure(figsize=(10, 8))\n",
        "scatter = plt.scatter(rookie_df['batting_average'], rookie_df['eb_estimate'],\n",
        "                      alpha=0.7, c=rookie_df['total'], cmap='viridis', edgecolor='k')\n",
        "\n",
        "# Labels and title\n",
        "plt.xlabel('Raw Batting Average', fontsize=14)\n",
        "plt.ylabel('Empirical Bayes Estimate', fontsize=14)\n",
        "plt.title('Comparison of Raw Batting Average and Empirical Bayes Estimate', fontsize=16)\n",
        "\n",
        "# Diagonal line\n",
        "plt.plot([0, 1], [0, 1], 'k--', alpha=0.75)\n",
        "\n",
        "# Limit the axes to 0 and 1\n",
        "plt.xlim(0, 1)\n",
        "plt.ylim(0, 1)\n",
        "\n",
        "# Add a color bar\n",
        "plt.colorbar(scatter, label='Total At-Bats')\n",
        "\n",
        "# Show the plot\n",
        "plt.show()\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.13"
    },
    "colab": {
      "provenance": [],
      "gpuType": "T4"
    },
    "accelerator": "GPU"
  },
  "nbformat": 4,
  "nbformat_minor": 5
}