{ "cells": [ { "cell_type": "markdown", "id": "b6a8c8c9-9b86-487a-a1e3-ebc08f153714", "metadata": {}, "source": [ "## Computational\n", "\n", "- Add your answers in the same cell as the code or add another cell by copy pasting the existing cell\n", "- Outputs from the answer key have been left as they are for your reference. My personal suggestion would be to create a new cell with the same code copied and make sure that the output coming is the same. " ] }, { "cell_type": "markdown", "id": "751b7158-0c9d-4d31-b455-b5fb17f2aa26", "metadata": { "tags": [] }, "source": [ "### Bonferroni correction\n", "\n", "**Problem D - 10 points**\n", "\n", "The Bonferroni correction is a method used to control the familywise error rate when conducting multiple statistical tests. The familywise error rate (often denoted by α FW) is the probability of making one or more Type I errors (false positives) when conducting multiple tests.\n", "\n", "Advantages:\n", "\n", "- Simple and easy to understand.\n", "- Conservative, reducing the likelihood of Type I errors.\n", "\n", "Disadvantages:\n", "- Because it's conservative, it increases the risk of Type II errors (false negatives). This means you might miss genuine effects.\n", "- It assumes all tests are independent, which might not be the case.\n", "\n", "Read more about it: https://mathworld.wolfram.com/BonferroniCorrection.html\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "6a0d1011-92ca-435d-95e8-88bdc37ac9d9", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from scipy.stats import ttest_ind\n", "\n", "# Generating synthetic data: 1000 data points for each group\n", "np.random.seed(0) # for reproducibility\n", "group_a = np.random.normal(loc=0.2, scale=0.1, size=1000) # Old layout, mean CTR = 0.2\n", "group_b = np.random.normal(loc=0.22, scale=0.1, size=1000) # New layout, mean CTR = 0.22\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "175a91d9-8c63-41ef-9031-aa21ad2e2c3e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "T-statistic: -5.9180099055395115\n", "P-value: 3.825134693693202e-09\n", "The difference in CTRs is statistically significant. Consider implementing the new layout.\n" ] } ], "source": [ "# Performing a two-sample t-test\n", "t_stat, p_value = ttest_ind(_____, group_b)\n", "\n", "print(\"T-statistic:\", t_stat)\n", "print(\"P-value:\", p_value)\n", "\n", "# Check if p-value is less than 0.05 for statistical significance\n", "if p_value < _____:\n", " print(\"The difference in CTRs is statistically significant. Consider implementing the new layout.\")\n", "else:\n", " print(\"The difference in CTRs is not statistically significant.\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "ac9e6ddb-c514-434d-bf0c-2e69a87b897b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Adjusted p-values with Bonferroni correction: [1.1475404081079607e-08, 0.008194610874169124, 0.00273609982720447]\n", "Reject Hypotheses: [True, True, True]\n" ] } ], "source": [ "# Generating more synthetic data\n", "group_c = np.random.normal(loc=0.2, scale=0.1, size=1000) # Old color scheme\n", "group_d = np.random.normal(loc=0.21, scale=0.1, size=1000) # New color scheme\n", "\n", "group_e = np.random.normal(loc=0.2, scale=0.1, size=1000) # Old button placement\n", "group_f = np.random.normal(loc=0.19, scale=0.1, size=1000) # New button placement\n", "\n", "# Perform t-tests\n", "_, p_value_1 = ttest_ind(group_a, ______) # TODO: Perform ttest on layout\n", "_, p_value_2 = ttest_ind(______, group_d) # TODO: Perform ttest on color scheme\n", "_, p_value_3 = ttest_ind(group_e, ______) # TODO: Perform ttest on button placement\n", "\n", "# Collect the p-values into a list\n", "p_values = [p_value_1, p_value_2, p_value_3]\n", "\n", "# Manually apply Bonferroni correction (alpha = 0.05)\n", "def bonferroni_correction(p_values, alpha=____):\n", " n = len(p_values)\n", " new_alpha = alpha / n\n", " adjusted_p_values = [min(1, p * n) for p in p_values]\n", " reject_hypothesis = [p < _______ for p in p_values] #TODO: Complete the code\n", " return adjusted_p_values, reject_hypothesis\n", "\n", "# Apply correction\n", "adjusted_p_values, reject_hypothesis = bonferroni_correction(p_values)\n", "print(\"Adjusted p-values with Bonferroni correction:\", adjusted_p_values)\n", "print(\"Reject Hypotheses:\", reject_hypothesis)" ] }, { "cell_type": "markdown", "id": "3d6edc31-48ca-4096-a03e-37a28496b0e9", "metadata": {}, "source": [ "# Revisiting Pokemons\n", "\n", "Remember the code that we used to sample pokemon data, we are going to take it one step further to figure out hypothesis testing." ] }, { "cell_type": "code", "execution_count": 4, "id": "35333b40-2f60-4f2a-be59-bfc04dc10e9e", "metadata": {}, "outputs": [], "source": [ "#importing libraries\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 5, "id": "5354c571-3b1a-4e0c-9842-67c987cd7d0f", "metadata": {}, "outputs": [], "source": [ "#loading the dataset into a pandas dataframe\n", "df = pd.read_csv('pokemon.csv') #TODO: Read CSV named pokemon.csv" ] }, { "cell_type": "code", "execution_count": 6, "id": "9b3546f0-0fc6-4799-82f0-3419562b6251", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>#</th>\n", " <th>Name</th>\n", " <th>Type 1</th>\n", " <th>Type 2</th>\n", " <th>Total</th>\n", " <th>HP</th>\n", " <th>Attack</th>\n", " <th>Defense</th>\n", " <th>Sp. Atk</th>\n", " <th>Sp. Def</th>\n", " <th>Speed</th>\n", " <th>Generation</th>\n", " <th>Legendary</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>Bulbasaur</td>\n", " <td>Grass</td>\n", " <td>Poison</td>\n", " <td>318</td>\n", " <td>45</td>\n", " <td>49</td>\n", " <td>49</td>\n", " <td>65</td>\n", " <td>65</td>\n", " <td>45</td>\n", " <td>1</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>Ivysaur</td>\n", " <td>Grass</td>\n", " <td>Poison</td>\n", " <td>405</td>\n", " <td>60</td>\n", " <td>62</td>\n", " <td>63</td>\n", " <td>80</td>\n", " <td>80</td>\n", " <td>60</td>\n", " <td>1</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>Venusaur</td>\n", " <td>Grass</td>\n", " <td>Poison</td>\n", " <td>525</td>\n", " <td>80</td>\n", " <td>82</td>\n", " <td>83</td>\n", " <td>100</td>\n", " <td>100</td>\n", " <td>80</td>\n", " <td>1</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>VenusaurMega Venusaur</td>\n", " <td>Grass</td>\n", " <td>Poison</td>\n", " <td>625</td>\n", " <td>80</td>\n", " <td>100</td>\n", " <td>123</td>\n", " <td>122</td>\n", " <td>120</td>\n", " <td>80</td>\n", " <td>1</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>Charmander</td>\n", " <td>Fire</td>\n", " <td>NaN</td>\n", " <td>309</td>\n", " <td>39</td>\n", " <td>52</td>\n", " <td>43</td>\n", " <td>60</td>\n", " <td>50</td>\n", " <td>65</td>\n", " <td>1</td>\n", " <td>False</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " # Name Type 1 Type 2 Total HP Attack Defense \\\n", "0 1 Bulbasaur Grass Poison 318 45 49 49 \n", "1 2 Ivysaur Grass Poison 405 60 62 63 \n", "2 3 Venusaur Grass Poison 525 80 82 83 \n", "3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 \n", "4 4 Charmander Fire NaN 309 39 52 43 \n", "\n", " Sp. Atk Sp. Def Speed Generation Legendary \n", "0 65 65 45 1 False \n", "1 80 80 60 1 False \n", "2 100 100 80 1 False \n", "3 122 120 80 1 False \n", "4 60 50 65 1 False " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 7, "id": "79b05b2a-7fae-4c69-b25f-2522e3fab131", "metadata": {}, "outputs": [], "source": [ "def random_sampling(df,no_of_samples, sample_size):\n", " random_samples = [] \n", " for i in range(no_of_samples): # Enter the range for the no_of_samples\n", " # Randomly sampling without replacement\n", " sample = np.random.choice(df, size=sample_size, replace=False)\n", " random_samples.append(sample)\n", " random_samples = np.array(random_samples)\n", " return random_samples" ] }, { "cell_type": "code", "execution_count": 8, "id": "0bf9007d-d6f3-47f6-b7fc-51dcf2c7b646", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[600, 460, 495, ..., 300, 299, 340],\n", " [288, 490, 485, ..., 430, 480, 528],\n", " [600, 528, 425, ..., 318, 600, 290],\n", " ...,\n", " [505, 405, 309, ..., 285, 430, 456],\n", " [480, 405, 530, ..., 420, 567, 250],\n", " [470, 380, 495, ..., 579, 700, 455]])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon_samples = random_sampling(df['Total'], 200, 50)\n", "\n", "pokemon_samples" ] }, { "cell_type": "markdown", "id": "97810014-0bfe-4d43-9708-82f742d28577", "metadata": {}, "source": [ "### Hypothesis testing\n", "\n", "**Problem E - 10 points**" ] }, { "cell_type": "markdown", "id": "2a3d01ed-cdd3-4490-bf73-7914874107b4", "metadata": {}, "source": [ "Consider the Attack points for different pokemons. Check whether the average attack is greater than 73 or not." ] }, { "cell_type": "code", "execution_count": 9, "id": "9a0c914f-1232-4bde-a914-1c4a231f4f3b", "metadata": {}, "outputs": [], "source": [ "#importing libraries\n", "from scipy.stats import ttest_1samp" ] }, { "cell_type": "markdown", "id": "2fd0ca0c-2231-404c-a6e1-9ecf993b9b08", "metadata": {}, "source": [ "**Part a - 1 point**\n", "\n", "Find out the mean of Attack." ] }, { "cell_type": "code", "execution_count": 10, "id": "f2aac601-5a27-4abe-9c1a-1063de288ad9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "79.00125\n" ] } ], "source": [ "#Write your code below\n", "attack_mean = np._____(df['Attack'])\n", "\n", "#printing the mean\n", "print(attack_mean)" ] }, { "cell_type": "markdown", "id": "3f184d3c-ee2a-4ffb-9491-64387111d317", "metadata": {}, "source": [ "**Part b - 4 points**\n", "\n", "Run a t-test to check whether the average attack is greater than 73 or not." ] }, { "cell_type": "code", "execution_count": 11, "id": "3c9ffa2c-748f-487e-94e3-71bc52d9e9cf", "metadata": {}, "outputs": [], "source": [ "#Write your code below (Check documentation for the ttest_1samp formula to know how to use it)\n", "test, pval = ttest_1samp(_____, ______) #TODO: Complete the function to run the t-test on df['Attack']" ] }, { "cell_type": "code", "execution_count": 12, "id": "4afbb5eb-de50-4613-8358-973969b883f3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.170592033069667e-07\n" ] } ], "source": [ "#printing pval\n", "print(pval)" ] }, { "cell_type": "code", "execution_count": 13, "id": "86f927fe-8fb0-498f-a3ff-d7400c6dfb26", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Reject null hypothesis\n" ] } ], "source": [ "if pval < ______:\n", " print(\" Reject null hypothesis\")\n", "else:\n", " print(\"Fail to reject null hypothesis\")" ] }, { "cell_type": "markdown", "id": "a0d63238-9111-4ba8-94a7-032df654076d", "metadata": {}, "source": [ "**Part c - 5 points**\n", "\n", "Run a z-test to check whether the average defense is greater than 73 or not." ] }, { "cell_type": "code", "execution_count": 14, "id": "bcb59adc-8a48-4f1d-b4b9-caa134a2476d", "metadata": {}, "outputs": [], "source": [ "#importing libraries\n", "from scipy import stats\n", "from statsmodels.stats import weightstats as stests" ] }, { "cell_type": "code", "execution_count": 15, "id": "f7ee64b9-782b-477e-b788-849df7f53254", "metadata": {}, "outputs": [], "source": [ "# Write your code below (Check documentation to learn know to use the weightstats.ztest function to apply below)\n", "ztest ,pval = stests.ztest(_______, x2=None, value=______) #TODO: Complete the function to run the ztest" ] }, { "cell_type": "code", "execution_count": 16, "id": "70924482-c6fc-40d4-b69f-9f2e55f98869", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.4447658863352716\n" ] } ], "source": [ "print(float(pval))" ] }, { "cell_type": "code", "execution_count": 17, "id": "a4baca26-930b-40be-9755-f03958f44c87", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accept null hypothesis\n" ] } ], "source": [ "if pval<0.05:\n", " print(\"Reject null hypothesis\")\n", "else:\n", " print(\"Fail to reject null hypothesis\")" ] }, { "cell_type": "markdown", "id": "1893773f-6465-4be1-9427-68a5624ae8fc", "metadata": {}, "source": [ "# Benjamini-Hochberg correction (Bonus - 10 Points)\n", "\n", "The Benjamini-Hochberg (BH) procedure, also known as the False Discovery Rate (FDR) controlling procedure, is an approach designed to control the expected proportion of falsely rejected null hypotheses, or in other words, the expected proportion of false discoveries among the rejected hypotheses.\n", "\n", "**The problem:**\n", "\n", "When conducting multiple hypothesis tests, if you use the traditional significance level (like α=0.05) for each test, the likelihood of making one or more Type I errors (false positives) increases.\n", "\n", "**The solution:** \n", "\n", "Instead of controlling the familywise error rate (the probability of making one or more false discoveries) like the Bonferroni correction does, the Benjamini-Hochberg procedure controls the False Discovery Rate (FDR), which is the expected proportion of errors among the rejected hypotheses.\n", "\n", "**Procedure:**\n", "\n", "- Rank the p-values from your multiple tests in ascending order.\n", "- Compare each p-value to its Benjamini-Hochberg critical value, which is given by (i/m)×α, where i is the rank (from smallest to largest) and m is the total number of tests.\n", "- Find the largest p-value that is less than its critical value. Reject the null hypothesis for that p-value and all smaller p-values.\n", "\n", "Read more at: https://www.statisticshowto.com/benjamini-hochberg-procedure/\n" ] }, { "cell_type": "code", "execution_count": 18, "id": "d68819f8-7f7a-4499-9688-f20c7a3efa2d", "metadata": {}, "outputs": [], "source": [ "from scipy.stats import ttest_1samp, norm" ] }, { "cell_type": "markdown", "id": "8b6e099e-1335-4cd7-aadb-7612146ce01f", "metadata": {}, "source": [ "**Complete the function to perform Benjamini-Hochberg Correction on a set of p_values, default alpha value should be 0.05 - 10 Points**\n", "\n", "###### Psst.. If you are stuck and finding this hard, feel free to use Chat GPT, but please mention if you have done so. Using LLM is not wrong but passing of its work as your own is**" ] }, { "cell_type": "code", "execution_count": 19, "id": "82a4bfdf-f5ae-4995-a4a1-43e6afd1d722", "metadata": {}, "outputs": [], "source": [ "# Benjamini-Hochberg correction function\n", "def benjamini_hochberg_correction(p_values, alpha=0.05):\n", " # Sort p-values\n", " sorted_p_values = sorted(p_values)\n", " \n", " n = len(p_values)\n", " # Write code below to implement the Benjamini-Hochberg correction function\n", " \n", " return adjusted_p_values, reject_hypothesis\n", "\n", "# Drawing 200 samples and performing t-tests\n", "p_values = []" ] }, { "cell_type": "code", "execution_count": 20, "id": "6b926794-5fe4-42e0-b4a0-b1074e54f9d0", "metadata": {}, "outputs": [], "source": [ "for i in range(200):\n", " sample = df['Defense'].sample(50, replace=True)\n", " t_stat, p_value = ttest_1samp(sample, 60)\n", " p_values.append(p_value)" ] }, { "cell_type": "code", "execution_count": 21, "id": "21403791-d6e5-4ef2-a4ab-60ba20dd07de", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 10 adjusted p-values: [7.70619974093318e-07, 5.441161523929422e-06, 6.399684184868066e-06, 8.528489906437646e-06, 1.3658163581517232e-05, 1.3896950170629584e-05, 1.6809087927777643e-05, 1.715555924702999e-05, 1.7826037027142184e-05, 2.342789315507416e-05]\n", "First 10 reject_hypothesis decisions: [True, True, True, True, True, True, True, True, True, True]\n" ] } ], "source": [ "# Applying Benjamini-Hochberg correction\n", "adjusted_p_values, reject_hypothesis = benjamini_hochberg_correction(p_values)\n", "\n", "# Displaying some results\n", "print(\"First 10 adjusted p-values:\", adjusted_p_values[:10])\n", "print(\"First 10 reject_hypothesis decisions:\", reject_hypothesis[:10])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }