{ "cells": [ { "cell_type": "markdown", "id": "2a5ff1be-2da3-4ce2-9a6c-9f171f1ff6c1", "metadata": {}, "source": [ "# DS 122 Homework 1\n", "\n", "### You can download the python notebook for this pdf from https://drive.google.com/file/d/1T7EBmSNzn2JP99nqHMduJH-wiIEPXTtG/view?usp=sharing\n", "\n", "**Due Sep., 20th**\n", "\n", "**Full credit is 87 points (With Bonus Question: 92 Points)**\n", "\n", "**Name:**\n", "\n", "**BUID:**\n", "\n", "Most homeworks will involve “analytical” questions, and many will involve “computational” questions. This homework involves analytical and computational questions.\n", "\n", "**NOTE**\n", "\n", "- It is advised not to use CHAT GPT or any other LLM to complete the homeworks and try the questions on your own unless otherwise stated in the question.\n", "\n", "- Try to answer the questions in detail, In case you do not get the correct answer, we will take into consideration the steps (process you take to solve the question) which will help you in getting partial points.\n", "\n", "- Please note unless otherwise mentioned, leaving the answer in fractional format is completely alright. Do not worry about accuracy of the decimal answers.\n", "\n", "- Coding questions might seem a little daunting at first but if you go through them you will notice that a lot of answers are directly available in the notebook. It is more for your understanding than for testing, if you are unable to find a solution at first please try reading up the documentation and your class notes. We are always available during our office hours in case you have doubts regarding a topic.\n", "\n", "- To make things clear to the grader, you MUST draw a box around your answer. Questions whose answers are not boxed will lose points. To put a box around your answer in LATEX use \\fbox{} or \\boxed{}.\n", "\n", "**SUBMISSION GUIDELINES**\n", "\n", "- You are free to write your solutions to math problems on paper and upload scanned copies as PDF. If you wish to type your solutions, I would suggest using Latex to write mathematical equations, you can use https://www.overleaf.com/ to create free latex documents.\n", "\n", "- For coding questions, please edit the jupyter notebook itself in the space provided to input your answer. You can choose to create a new cell to enter your code so as to not lose the sample output. \n", "\n", "- Final submissions should contain both your code (Jupyter Notebook) as well as mathematical files (Scanned or Typed PDF). You can select more than one file while uploading during submission. Please try to use the following naming convention for your submissions **{FirstName}\\_\\{LastName}\\_\\{BUID}\\_\\{analytical/computational}.zip**\n", "\n", "- Steps to Submit\n", " - Write your answers to the mathematical questions on a paper or on Latex Editor\n", " - If you wrote them on a paper, scan them as pdf or else save the pdf from the Latex Editor\n", " - Download the Python notebook and complete the coding questions\n", " - Copy content from the question cell to a new cell and write your answer\n", " - Press Shift + Enter to run the cell and see if the output matches the sample output\n", " - Once you have completed all the coding questions save the notebook with the prescribed name\n", " - Submit both the files on gradescope (PDF and Jupyter Notebook)\n", "\n" ] }, { "cell_type": "markdown", "id": "e1af25b2-fa7b-4b1c-9ec2-dd56053dc4ff", "metadata": {}, "source": [ "## Analytical" ] }, { "cell_type": "markdown", "id": "3115967f-032b-446d-a293-97be8d40e862", "metadata": {}, "source": [ "**Problem A - 5 points**\n", "\n", "Consider a wooden cube with painted faces that is sawed up into 27 smaller equal-sized cubes. If one of these small cubes is chosen at random, what is the probability that it has exactly 3 painted faces?\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "a7433e86-0b4a-4007-a91a-d6f078204601", "metadata": {}, "source": [ "**Problem B - 5 points**\n", "\n", "In a penalty shootout, two footballers, Player X and Player Y, are known for their precision. Player X has an 85% probability of scoring a goal, while Player Y has a 80% probability. If both players take a shot one after the other, what is the probability that at least one of them scores?\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "71a93a8c-e82e-4ffa-937b-140d00de7761", "metadata": {}, "source": [ "**Problem C - 5 points**\n", "\n", "In a carnival game, players throw balls at a wall with five differently shaped targets (Please note that a throw will either hit one of the shapes or miss completely, there is no other possiblity). The probabilities of **not hitting** each of these shapes are:\n", "\n", "- Star: 0.68\n", "- Triangle: 0.95\n", "- Square: 0.90\n", "- Circle: 0.80\n", "- Pentagon: 0.81\n", "\n", "Given that the **outcomes for each shape are independent**, what is the probability that a player's shot does not hit any of the shapes?\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "47f557b0-f187-4847-8733-f48c37d30145", "metadata": {}, "source": [ "**Problem D - 30 points (35 with Bonus)**\n", "\n", "**Premise:**\n", "\n", "You and your friend are sitting in the CDS building waiting for your next class. Your friend is bored and suggests playing a game where you roll two six-sided dice, You win if the sum of the two dice equals 7.\n", "\n", "**Part a - 5 points** \n", "\n", "If you decide to play the game just once, what is the probability that you win?\n", "\n", "**Part b - 5 points**\n", "\n", "If you play the game three times in a row and don't remember your previous rolls, what is the probability you win at least once?\n", "\n", "**Part c - 5 points**\n", "\n", "If you play 4 games, what is the probability you win exactly twice, given the probability of winning a single game as found in Part a?\n", "\n", "**Part d - 5 points**\n", "\n", "What is the expected number of games you will play before you win?\n", "\n", "**Part e - 10 points**\n", "\n", "Your friend now proposes a twist. If either dice shows a 1, you automatically lose, regardless of the sum. Calculate the probability of winning in this new scenario for a single game.\n", "\n", "**Part f - 5 points (Bonus Question - Optional)**\n", "\n", "You and your friend decide to further tweak the rules. Now, if the first dice shows a 1, you lose, but if the second dice shows a 6, you automatically win, regardless of the sum. Find the probability that you win the game in this new scenario.\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "65b85b38-bda7-4876-b871-7dcceaf34b30", "metadata": {}, "source": [ "**Problem E - 10 points**\n", "\n", "Imagine you're Batman and the overall rate of individuals having committed a crime (let's denote this as event A) in Gotham is P(A)\n", "\n", "There's a certain marker, a unique tattoo (let's denote this as event B), which has been found common among many criminals. You observed that among the individuals with this tattoo, the probability of them having committed a crime is P(A | B) and this is higher than P(A)\n", "\n", "P(A) = 0.4\n", "\n", "P(B) = 0.3\n", "\n", "P(A | B) = 0.5\n", "\n", "Using the given probabilities and the definition of conditional probability, determine the exact value of P(B|A). Compare it to the value of P(B), is there any relation you find between P(B|A) and P(B) (Greater than, Lesser than, Equal to)\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "8b612a28-f15f-48ff-9ceb-92c5eefe76dd", "metadata": {}, "source": [ "**Problem G - 5 points**\n", "\n", "Consider a radioactive source emitting alpha particles at\n", " an average rate of 4 particles per second. What is the probability that in a particular one-second interval, less than two particles are emitted?" ] }, { "cell_type": "markdown", "id": "dcb453bb-c759-4f1f-9469-ecebaddecd5a", "metadata": {}, "source": [ "**Problem H - 5 Points**\n", "\n", "Consider the following joint probability table for two random variables, \\(X\\) and \\(Y\\):\n", "\n", "| | Y = 1 | Y = 2 |\n", "|---------|-------|-------|\n", "| X = 1 | 0.2 | 0.3 |\n", "| X = 2 | 0.1 | 0.4 |\n", "\n", "What is the marginal probability \\(P(Y = 1)\\)?" ] }, { "cell_type": "markdown", "id": "f039bb78-5200-4bce-bdd3-f04f68fe344e", "metadata": {}, "source": [ "## Computational\n", "\n", "- Add your answers in the same cell as the code or add another cell by copy pasting the existing cell\n", "- Outputs from the answer key have been left as they are for your reference. My personal suggestion would be to create a new cell with the same code copied and make sure that the output coming is the same. " ] }, { "cell_type": "markdown", "id": "64b58f42-cc0e-4211-bf19-e23c9192b782", "metadata": {}, "source": [ "**Problem I - 0 points**\n", "\n", "Install Python 3 and NumPy package on the computer you will use for this\n", " course. Read the document ``Getting \n", " Started with Python'' on Piazza.\n", " https://numpy.org/install/" ] }, { "cell_type": "markdown", "id": "6f910e5f-7d3e-4df7-8242-14c867597abb", "metadata": {}, "source": [ "**Problem J - 2 points**\n", "\n", "Verify that NumPy is installed correctly: \n", "\n", "execute \n", "\n", "$$\\texttt{import numpy as np;}$$\n", "$$\\texttt{A = np.array([1, 2, 3])};$$\n", "$$\\texttt{print(np.sum(A))}$$\n", "\n", "Cut and paste the input and output from your Python interpreter." ] }, { "cell_type": "markdown", "id": "40ea5244-9828-4e4a-8a7f-fe20de9d2829", "metadata": {}, "source": [ "**Problem K - 10 points**\n", "\n", "Now we will use some of the skills we have learned to examine the attritubtes of professional baseball players. Read in ’hwk01 mlb.csv’ (https://drive.google.com/file/d/1AioLAfFQF7cXig7MdlBZ_bnAbkjpwCcP/view?usp=sharing) and fill in the code to compute a CDF of the players ages using no built-in functions of NumPy other than sort. Plot the resulting CDF.\n", "\n", "Complete the blanks wherever given as TODO" ] }, { "cell_type": "code", "execution_count": null, "id": "ba588294-30a8-4af3-96ee-b964d59dcad2", "metadata": {}, "outputs": [], "source": [ "#importing libraries\n", "import _____ as pd #TODO Import pandas with the alias pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "id": "f07768af-897a-470b-90cf-c7f245c83fdf", "metadata": {}, "outputs": [], "source": [ "#reading the dataset\n", "df = pd.read_csv('_________') #TODO, read the dataset by completing the function read_csv" ] }, { "cell_type": "code", "execution_count": null, "id": "5e63b225-5e3e-4f49-a555-4f7acb800211", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Name</th>\n", " <th>Team</th>\n", " <th>Position</th>\n", " <th>Height(inches)</th>\n", " <th>Weight(pounds)</th>\n", " <th>Age</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Adam_Donachie</td>\n", " <td>BAL</td>\n", " <td>Catcher</td>\n", " <td>74</td>\n", " <td>180.0</td>\n", " <td>22.99</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Paul_Bako</td>\n", " <td>BAL</td>\n", " <td>Catcher</td>\n", " <td>74</td>\n", " <td>215.0</td>\n", " <td>34.69</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Ramon_Hernandez</td>\n", " <td>BAL</td>\n", " <td>Catcher</td>\n", " <td>72</td>\n", " <td>210.0</td>\n", " <td>30.78</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Kevin_Millar</td>\n", " <td>BAL</td>\n", " <td>First_Baseman</td>\n", " <td>72</td>\n", " <td>210.0</td>\n", " <td>35.43</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>Chris_Gomez</td>\n", " <td>BAL</td>\n", " <td>First_Baseman</td>\n", " <td>73</td>\n", " <td>188.0</td>\n", " <td>35.71</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>Brian_Roberts</td>\n", " <td>BAL</td>\n", " <td>Second_Baseman</td>\n", " <td>69</td>\n", " <td>176.0</td>\n", " <td>29.39</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>Miguel_Tejada</td>\n", " <td>BAL</td>\n", " <td>Shortstop</td>\n", " <td>69</td>\n", " <td>209.0</td>\n", " <td>30.77</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>Melvin_Mora</td>\n", " <td>BAL</td>\n", " <td>Third_Baseman</td>\n", " <td>71</td>\n", " <td>200.0</td>\n", " <td>35.07</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>Aubrey_Huff</td>\n", " <td>BAL</td>\n", " <td>Third_Baseman</td>\n", " <td>76</td>\n", " <td>231.0</td>\n", " <td>30.19</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>Adam_Stern</td>\n", " <td>BAL</td>\n", " <td>Outfielder</td>\n", " <td>71</td>\n", " <td>180.0</td>\n", " <td>27.05</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Name Team Position Height(inches) Weight(pounds) Age\n", "0 Adam_Donachie BAL Catcher 74 180.0 22.99\n", "1 Paul_Bako BAL Catcher 74 215.0 34.69\n", "2 Ramon_Hernandez BAL Catcher 72 210.0 30.78\n", "3 Kevin_Millar BAL First_Baseman 72 210.0 35.43\n", "4 Chris_Gomez BAL First_Baseman 73 188.0 35.71\n", "5 Brian_Roberts BAL Second_Baseman 69 176.0 29.39\n", "6 Miguel_Tejada BAL Shortstop 69 209.0 30.77\n", "7 Melvin_Mora BAL Third_Baseman 71 200.0 35.07\n", "8 Aubrey_Huff BAL Third_Baseman 76 231.0 30.19\n", "9 Adam_Stern BAL Outfielder 71 180.0 27.05" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Show the top 10 elements of the dataframe using the head function\n", "df.________ #TODO, Show the top 10 elements" ] }, { "cell_type": "code", "execution_count": 16, "id": "0d45bf3a-4aba-407c-8822-0667551f37d6", "metadata": {}, "outputs": [], "source": [ "#taking out the column 'Age' and converting it to a list\n", "#this is advisable for a better code\n", "#Convert the values from column Age to a list\n", "ages = df['Age'].to_list() #TODO, Store the values of Ages from the dataset in the variable ages in the form of a list\n", "\n", "#sort the list of ages\n", "sorted_ages = np.sort(ages) #TODO, Use np.sort to sort the list of ages" ] }, { "cell_type": "code", "execution_count": 17, "id": "11a2f325-3316-4cb7-9903-9cf5c49196b9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[22.99, 34.69, 30.78, 35.43, 35.71, 29.39, 30.77, 35.07, 30.19]\n", "[20.9 21.46 21.52 21.58 21.78 21.85 21.9 22.02 22.06]\n" ] } ], "source": [ "#printing the first 10 elements of the arrays\n", "#this step is not required; arrays printed for better understanding\n", "print(ages[0:9])\n", "print(sorted_ages[0:9])" ] }, { "cell_type": "code", "execution_count": null, "id": "6c448661-f3e9-40c3-9c61-292adbb0d071", "metadata": {}, "outputs": [], "source": [ "#defining a function for calculating the CDF\n", "#Complete the below code to calculate the CDF\n", "def calc_cdf(sorted_data):\n", " #calculating proportional values\n", " p = 1. * np.________(1,len(sorted_data)+1) / (len(sorted_data)) #TODO, calculate the proportional values from the data and return them\n", " return ______" ] }, { "cell_type": "code", "execution_count": 19, "id": "08b4742b-a1d3-41bc-956c-6f05998a58f8", "metadata": {}, "outputs": [], "source": [ "#calling the function to calculate cdf of ages\n", "cdf_ages = calc_cdf(_________) #TODO, calculate cdf of ages" ] }, { "cell_type": "code", "execution_count": null, "id": "ea3bb4f3-3d9b-4c16-a52e-94378dd1b2c1", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<Figure size 640x480 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#plotting the CDF\n", "#Complete the below code to plot the CDF\n", "plt._______(sorted_ages, cdf_ages) #TODO, Plot the cdf of ages\n", "plt.xlabel('Data')\n", "plt.ylabel('CDF')\n", "##Complete the below code to show the plot\n", "plt.__________ #TODO, Show the plot" ] }, { "cell_type": "code", "execution_count": 21, "id": "54ab587d-d0f3-4714-801a-873fc56381a0", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<Figure size 640x480 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#NOTE - a neat trick to cross check your answer is to use a built-in function\n", "cdf_check_ages = np.cumsum(sorted_ages)\n", "#plotting the CDF\n", "plt.plot(sorted_ages, cdf_check_ages)\n", "plt.xlabel('Data')\n", "plt.ylabel('CDF')\n", "plt.show()\n", "#The above is only for your referance and need not be completed, you have your own user defined function finding the cumulative sum" ] }, { "cell_type": "markdown", "id": "394dcbd3-83e5-4ae5-bb0c-eb6276825173", "metadata": {}, "source": [ "**Problem L - 10 points**\n", "\n", "Next we will consider the players heights and weights, which taken together are bivariate data. First, plot a scatterplot of the two variables against each other. Second, plot the CDFs of the marginals." ] }, { "cell_type": "code", "execution_count": null, "id": "54015d43-014f-4663-bd1b-bb11705555e4", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<Figure size 640x480 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#plotting scatterplot of heights and weights\n", "#Use the scatter function in plt to plot the scatter plot of Height and Weight\n", "plt.scatter(_________, _________) #TODO, complete the scatter function with the height and weight values from the dataset\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 24, "id": "4df17ca2-42b3-4f75-9828-da9e2dd0bcb2", "metadata": {}, "outputs": [], "source": [ "#Below we calculate the cdf_height and cdf_weight, enter the function name to complete the code\n", "cdf_height = calc_cdf(np.____(df['Height(inches)'])) #TODO, find the cdf of height\n", "cdf_weight = calc_cdf(np.sort(df[________________])) #TODO, find the cdf_ of weight" ] }, { "cell_type": "code", "execution_count": 25, "id": "a26216fd-0fdf-4032-a098-f271dd47db55", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<Figure size 640x480 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#plotting the CDF\n", "sorted_height = np.sort(df['Height(inches)'])\n", "plt.____(___________, cdf_height) #TODO, plot the cdf\n", "plt.xlabel('Data')\n", "plt.ylabel('CDF')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "de73d57c-dbfd-4c38-bacb-18108edd7f35", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<Figure size 640x480 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#plotting the CDF\n", "sorted_weight = np.____(df['Weight(pounds)']) #TODO, Store the sorted heights\n", "plt.plot(sorted_weight, ________) #TODO, Plot the sorted weight\n", "plt.xlabel('Data')\n", "plt.ylabel('CDF')\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }