# DS 122 Discussion 8

In this discussion, we will review the different concepts we have discussed in this course. This discussion contains 3 parts:
- EDA using pandas on a dataset
- Sampling using numpy
- Sampling and Hypothesis testing on a dataset

In [32]:
#importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt

## Part A: Review of pandas

In this section, we will analyze data about players from the UCL. UCL is a football (soccer) league based in Europe. It has various Clubs (or teams) like Liverpool, Manchester United, Real Madrid and Bayern.

You all must have heard about names like Ronaldo or Messi. This dataset contains the names of more players.

The dataset is taken from Kaggle and can be found at the following link: https://www.kaggle.com/datasets/azminetoushikwasi/ucl-202122-uefa-champions-league?select=attacking.csv.

We will begin working with files called `attacking.csv` and `goals.csv` to find out players that are good at attacking.

### Loading the dataset

The first step of a complete data analysis project is to acquire the data. Let's do this by loading the CSV files given to us.

In [2]:
attack = pd.read_csv('/content/attacking.csv')
goals = pd.read_csv('/content/goals.csv')

Let us now view our datasets to get an overview of the data.

In [3]:
#view attack
attack.head()

Unnamed: 0,serial,player_name,club,position,assists,corner_taken,offsides,dribbles,match_played
0,1,Bruno Fernandes,Man. United,Midfielder,7,10,2,7,7
1,2,Vinícius Júnior,Real Madrid,Forward,6,3,4,83,13
2,2,Sané,Bayern,Midfielder,6,3,3,32,10
3,4,Antony,Ajax,Forward,5,3,4,28,7
4,5,Alexander-Arnold,Liverpool,Defender,4,36,0,9,9


In [4]:
#print attack's size
attack.shape

(176, 9)

In [5]:
#view goals
goals.head()

Unnamed: 0,serial,player_name,club,position,goals,right_foot,left_foot,headers,others,inside_area,outside_areas,penalties,match_played
0,1,Benzema,Real Madrid,Forward,15,11,1,3,0,13,2,3,12
1,2,Lewandowski,Bayern,Forward,13,8,3,1,1,13,0,3,10
2,3,Haller,Ajax,Forward,11,3,4,3,1,11,0,1,8
3,4,Salah,Liverpool,Forward,8,0,8,0,0,7,1,1,13
4,5,Nkunku,Leipzig,Midfielder,7,3,1,3,0,7,0,0,6


In [6]:
#print shape of goals
goals.shape

(183, 13)

### EDA

The next step of a data analysis project pipeline is to perform EDA or Exploratory Data Analysis. Our ultimate goal is to get a list of top 10 players from an offensive point of view.

EDA includes cleaning the data. Think about the goal of the project. We are not interested in the foot a player used to score the goal. It would've been relevant if we were thinking about which side of the field we want the player to be in. But we just want to know who are good players.

Keeping this in mind, let us first begin by cleaning the data.

In [7]:
#printing the names of the columns in all the dataframes
print(attack.columns)
print(goals.columns)

Index(['serial', 'player_name', 'club', 'position', 'assists', 'corner_taken',
 'offsides', 'dribbles', 'match_played'],
 dtype='object')
Index(['serial', 'player_name', 'club', 'position', 'goals', 'right_foot',
 'left_foot', 'headers', 'others', 'inside_area', 'outside_areas',
 'penalties', 'match_played'],
 dtype='object')


Player names, club, matches played and position are common in all the 4 datasets, so let's keep them.

Let's first work with `attack`. Notice the different columns. They all provide some information on whether the player is good (assists etc) or bad (off-sides etc). So let's preserve them.

We can delete columns like 'right_foot', 'left_foot', 'headers', 'others', 'inside_area', and 'outside_areas' from `goals`.

In [8]:
#deleting 'serial' from attack
attack = attack.drop(columns = ['serial'])

In [9]:
attack.head()

Unnamed: 0,player_name,club,position,assists,corner_taken,offsides,dribbles,match_played
0,Bruno Fernandes,Man. United,Midfielder,7,10,2,7,7
1,Vinícius Júnior,Real Madrid,Forward,6,3,4,83,13
2,Sané,Bayern,Midfielder,6,3,3,32,10
3,Antony,Ajax,Forward,5,3,4,28,7
4,Alexander-Arnold,Liverpool,Defender,4,36,0,9,9


In [10]:
#deleting columns from goals
del_col = ['right_foot', 'left_foot', 'headers', 'others', 'inside_area', 'outside_areas', 'serial']
#delete columns here
goals = goals.drop(columns = del_col)

Let us now merge the dataframes.

In [11]:
#merge ucl here
ucl = attack.merge(goals, on= ['player_name', 'club', 'position', 'match_played'])
ucl.head()

Unnamed: 0,player_name,club,position,assists,corner_taken,offsides,dribbles,match_played,goals,penalties
0,Vinícius Júnior,Real Madrid,Forward,6,3,4,83,13,4,0
1,Sané,Bayern,Midfielder,6,3,3,32,10,6,0
2,Antony,Ajax,Forward,5,3,4,28,7,2,0
3,De Bruyne,Man. City,Midfielder,4,18,0,14,10,2,0
4,Mbappé,Paris,Forward,4,4,8,43,8,6,0


In [12]:
#print shape of ucl
ucl.shape

(82, 10)

It seems like a lot of players were dropped because they didn't exist in both the datasets. Let us continue with a smaller dataset for the scope of this question.

Now, we have a final dataframe called `ucl` to work with. Let us now make a list of the top 10 players.

Columns like `assists`, `corner_taken`, `dribbles` and `goals` give a positive score.

Columns like `offsides` and `penalties` give a negative score.

Let us first normalize all the columns using `match_played`, then simply add or subtract the positive and negative scores and then make a dataframe of the top 10 players.

Remember, for more complex projects, we will normalize all columns to have a value between 0 and 1, which will not be the case for this project since columns like `dribble` have values like 83, 67 and so on.

In [13]:
#normalizing the data and storing them in new columns
ucl['assists_normalized'] = ucl['assists']/ucl['match_played']
ucl['corner_taken_normalized'] = ucl['corner_taken']/ucl['match_played']
ucl['dribbles_normalized'] = ucl['dribbles']/ucl['match_played']
ucl['goals_normalized'] = ucl['goals']/ucl['match_played']
ucl['offsides_normalized'] = ucl['offsides']/ucl['match_played']
ucl['penalties_normalized'] = ucl['penalties']/ucl['match_played']

In [14]:
# Calculate overall score
ucl['score'] = ucl['assists_normalized'] + ucl['corner_taken_normalized'] + ucl['dribbles_normalized'] + ucl['goals_normalized'] - ucl['offsides_normalized'] - ucl['penalties_normalized']

In [15]:
#ranking the players
top_10_players = ucl.nlargest(10, 'score')
top_10_players.head(10)

Unnamed: 0,player_name,club,position,assists,corner_taken,offsides,dribbles,match_played,goals,penalties,assists_normalized,corner_taken_normalized,dribbles_normalized,goals_normalized,offsides_normalized,penalties_normalized,score
9,Coman,Bayern,Forward,3,4,4,59,9,2,0,0.333333,0.444444,6.555556,0.222222,0.444444,0.0,7.111111
0,Vinícius Júnior,Real Madrid,Forward,6,3,4,83,13,4,0,0.461538,0.230769,6.384615,0.307692,0.307692,0.0,7.076923
51,Moumi Ngamaleu,Young Boys,Midfielder,1,1,0,34,6,1,0,0.166667,0.166667,5.666667,0.166667,0.0,0.0,6.166667
4,Mbappé,Paris,Forward,4,4,8,43,8,6,0,0.5,0.5,5.375,0.75,1.0,0.0,6.125
15,Mahrez,Man. City,Midfielder,2,30,5,28,12,7,2,0.166667,2.5,2.333333,0.583333,0.416667,0.166667,5.0
2,Antony,Ajax,Forward,5,3,4,28,7,2,0,0.714286,0.428571,4.0,0.285714,0.571429,0.0,4.857143
11,Bellingham,Dortmund,Midfielder,3,1,1,24,6,1,0,0.5,0.166667,4.0,0.166667,0.166667,0.0,4.666667
1,Sané,Bayern,Midfielder,6,3,3,32,10,6,0,0.6,0.3,3.2,0.6,0.3,0.0,4.4
19,Pedro Gonçalves,Sporting CP,Midfielder,2,7,0,10,5,4,1,0.4,1.4,2.0,0.8,0.0,0.2,4.4
16,Ziyech,Chelsea,Midfielder,2,23,0,12,9,1,0,0.222222,2.555556,1.333333,0.111111,0.0,0.0,4.222222


In [16]:
#cleaning the column
print(top_10_players.columns)

Index(['player_name', 'club', 'position', 'assists', 'corner_taken',
 'offsides', 'dribbles', 'match_played', 'goals', 'penalties',
 'assists_normalized', 'corner_taken_normalized', 'dribbles_normalized',
 'goals_normalized', 'offsides_normalized', 'penalties_normalized',
 'score'],
 dtype='object')


In [17]:
#delete
del_col = ['assists', 'corner_taken',
 'offsides', 'dribbles', 'match_played', 'goals', 'penalties',
 'assists_normalized', 'corner_taken_normalized', 'dribbles_normalized',
 'goals_normalized', 'offsides_normalized', 'penalties_normalized']
top_10_players = top_10_players.drop(columns = del_col)

In [18]:
top_10_players.head(10)

Unnamed: 0,player_name,club,position,score
9,Coman,Bayern,Forward,7.111111
0,Vinícius Júnior,Real Madrid,Forward,7.076923
51,Moumi Ngamaleu,Young Boys,Midfielder,6.166667
4,Mbappé,Paris,Forward,6.125
15,Mahrez,Man. City,Midfielder,5.0
2,Antony,Ajax,Forward,4.857143
11,Bellingham,Dortmund,Midfielder,4.666667
1,Sané,Bayern,Midfielder,4.4
19,Pedro Gonçalves,Sporting CP,Midfielder,4.4
16,Ziyech,Chelsea,Midfielder,4.222222


## Part B - Review of NumPy

Let us now review numpy.

**1. Create a 2-dimensional NumPy array with 3 rows and 4 columns, filled with random integers between 0 and 100 (inclusive).**

In [19]:
array_2d = np.random.randint(0,101,(3,4))
print(array_2d)

[[ 23 24 38 91]
 [100 92 80 54]
 [ 76 98 64 31]]


**2. Generate an array of 20 random integers between 1 and 100 (inclusive). Calculate the mean, median, and mode of the generated array. Create a new array containing 100 random samples from a standard normal distribution (mean=0, std=1).**

In [20]:
random_integers = np.random.randint(1,101, 20)

mean_sample = np.mean(random_integers)
median_sample = np.median(random_integers)
unique_values, counts = np.unique(random_integers, return_counts = True)
mode_sample = unique_values[counts.argmax()]

standard_normal_samples = np.random.normal(0,1,100)

print(random_integers)
print("Mean:", mean_sample)
print("Median:", median_sample)
print("Mode:", mode_sample)
print(standard_normal_samples[:10])


[ 21 62 55 50 69 30 47 15 75 6 37 93 94 58 39 26 100 40
 92 50]
Mean: 52.95
Median: 50.0
Mode: 50
[ 0.04254721 -1.38797692 -0.23088044 0.27202176 0.93998854 1.43665346
 -0.47827033 -1.07138487 1.03319647 -0.78915187]


**3. Create an array of 10 unique integers between 1 and 20 (inclusive) Sample 5 random integers from this array, allowing for replacement. Calculate the mean and standard deviation of the sampled integers.**

In [21]:
unique_integers = np.unique(np.random.randint(1,21,10))

sample_with_replacement = np.random.choice(unique_integers, 5, replace = True)

mean_sample_replacement = np.mean(sample_with_replacement)
std_dev_sample_replacement = np.std(sample_with_replacement)

print(unique_integers)
print(sample_with_replacement)
print("Mean:", mean_sample_replacement)
print("Standard Deviation:", std_dev_sample_replacement)

[ 1 3 4 5 6 13 14 16]
[14 6 3 4 3]
Mean: 6.0
Standard Deviation: 4.147288270665544


**4. Create an array of 15 random integers between 1 and 10 (inclusive). Sort the array in ascending order. Find the unique values in the sorted array.**

In [22]:
array_15_random = np.random.randint(1,11,15)

sorted_array = np.sort(array_15_random)

unique_sorted_values = np.unique(sorted_array)

print(array_15_random)
print(sorted_array)
print(unique_sorted_values)

[ 9 8 6 10 1 6 7 7 1 7 9 8 7 7 9]
[ 1 1 6 6 7 7 7 7 7 8 8 9 9 9 10]
[ 1 6 7 8 9 10]


## Part C - Sampling and Hypothesis testing

In this part, we will work with the `tips` dataset that can be downloaded from the `seaborn` library.

The "tips" dataset contains information about restaurant tips, including total bills, tips, and other relevant details.

In [23]:
#downloading the dataset into a dataframe
tips = sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [24]:
print(tips.shape)

(244, 7)


**1. You want to analyze the tipping behavior of a sample of customers. Perform simple random sampling to select a sample of 30 customers from the "tips" dataset. Calculate the sample mean and sample variance of the "tip" column for this sample.**

In [25]:
sample_size = 30
sample = tips.sample(n = 30, random_state = 42)

In [26]:
sample_mean = sample['tip'].mean()
sample_variance = sample['tip'].var()

In [27]:
print("Sample Mean:", sample_mean)
print("Sample Variance:", sample_variance)

Sample Mean: 2.832333333333333
Sample Variance: 1.3560667816091954


**2. You suspect that the tipping behavior varies by the day of the week. Perform stratified sampling to select a sample of 20 customers from each day (Thursday, Friday, Saturday, and Sunday) from the "tips" dataset. Calculate the sample mean and sample variance of the "tip" column for each stratum.**

In [29]:
#initialize an empty list to store the sample means and variances for each stratum
stratum_sample_statistics = []

In [30]:
#sampling
days_of_week = ['Thur', 'Fri', 'Sat', 'Sun']
sample_size_per_stratum = 20

for day in days_of_week:
 stratum = tips[tips['day'] == day]
 stratum_sample = stratum.sample(n=20, random_state = 42, replace = True)
 stratum_mean = stratum_sample['tip'].mean()
 stratum_variance = stratum_sample['tip'].var()
 stratum_sample_statistics.append({
 'Day': day,
 'Sample Mean': stratum_mean,
 'Sample Variance': stratum_variance
 })

In [31]:
for stats in stratum_sample_statistics:
 print(f"Day: {stats['Day']}, Sample Mean: {stats['Sample Mean']}, Sample Variance: {stats['Sample Variance']}")

Day: Thur, Sample Mean: 2.894, Sample Variance: 1.6887410526315791
Day: Fri, Sample Mean: 2.7265, Sample Variance: 1.1247186842105263
Day: Sat, Sample Mean: 3.2269999999999994, Sample Variance: 3.2082957894736843
Day: Sun, Sample Mean: 2.9415, Sample Variance: 0.9116028947368421


**3. You want to test whether the average total bill in the dataset is significantly different (greater) from $20. Perform a one-tail t-test to answer this question.**

In [33]:
# Null hypothesis (H0) = bill = 20
# Alternative hypothesis (H1): the average total bill in the dataset is significantly different (greater) from $20

alpha = 0.05

# We will use a one-tail t-test because we are testing if the average is greater than $20.
t_statistic, p_value = stats.ttest_1samp(tips['total_bill'], popmean = 20, alternative = 'greater')

if p_value < alpha:
 print("Reject the null hypothesis. The average total bill is significantly greater than $20.")
else:
 print("Fail to reject the null hypothesis. There is no significant evidence that the average total bill is greater than $20.")

print("t-statistic:", t_statistic)
print("p-value:", p_value)

Fail to reject the null hypothesis. There is no significant evidence that the average total bill is greater than $20.
t-statistic: -0.37559294451919506
p-value: 0.646226403218664


**4. You suspect that there is a significant difference in the total bill between lunch and dinner. Perform a two-tail t-test to test this hypothesis.**

In [34]:
#Separate the total bill data into two groups: lunch and dinner
total_bill_lunch = tips[tips['time']=='Lunch']['total_bill']
total_bill_dinner = tips[tips['time']=='Dinner']['total_bill']

#Define the null and alternative hypotheses
# Null hypothesis (H0): bills lunch = bill dinner
# Alternative hypothesis (H1): significant difference between the bills of lunch and dinner

alpha = 0.05

t_statistic, p_value = stats.ttest_ind(total_bill_lunch, total_bill_dinner, equal_var = False)

if p_value < alpha:
 print("Reject the null hypothesis. There is a significant difference in total bill between lunch and dinner.")
else:
 print("Fail to reject the null hypothesis. There is no significant difference in total bill between lunch and dinner.")

print("t-statistic:", t_statistic)
print("p-value:", p_value)

Reject the null hypothesis. There is a significant difference in total bill between lunch and dinner.
t-statistic: -3.122986183296264
p-value: 0.0021665735148038933


**5. You are interested in estimating the mean total bill amount for all the restaurant visits in the "tips" dataset. Create a 95% confidence interval for this estimate.**

a. Calculate the sample mean and sample standard deviation for the total bill amounts.

In [35]:
sample_mean = tips['total_bill'].mean()
sample_std = tips['total_bill'].std()

print("Sample Mean:", sample_mean)
print("Sample Standard Deviation:", sample_std)

Sample Mean: 19.78594262295082
Sample Standard Deviation: 8.902411954856856


b. Determine the margin of error for a 95% confidence interval.

The margin of error for a 95% confidence interval can be determined using the formula:

Margin of Error = (Critical Value) * (Standard Deviation / √Sample Size)

In [37]:
# Set the confidence level and find the critical value (z-value for a normal distribution)
confidence_level = 0.95
alpha = 1 - confidence_level
critical_value = stats.norm.pdf(1-alpha/2)

# Set the sample size
sample_size = len(tips)

# Calculate the margin of error
margin_of_error = critical_value * (sample_std/np.sqrt(sample_size))

print("Margin of Error:", margin_of_error)


Margin of Error: 0.14135046577410887


c. Calculate the lower and upper bounds of the confidence interval.

In [38]:
# Calculate the lower and upper bounds of the confidence interval
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

print("Confidence Interval (95%): [{:.2f}, {:.2f}]".format(lower_bound, upper_bound))


Confidence Interval (95%): [19.64, 19.93]
