WNBA Statistics Project

Introduction

This project was done to learn and perform data modeling calculations for WNBA statistics and to understand excel parsing and data management.

Part 1 - Questions

Question 1: What is the average Player Efficiency Rating among WNBA players from 1997-2019?

  • Who (population): WNBA Players
  • What (subject, discipline): Baskeball, WNBA Statistics
  • Where (location): WNBA League in North America
  • When (snapshot, longitudinal): 1997-2019 WNBA Seasons, N/A
  • How much data do you need to do the analysis/work: CSV with WNBA Player Statistics highlighting Player Efficiency Rating
Question 2: What is the average age of players among WNBA players from 1997-2019?
  • Who (population): WNBA Players
  • What (subject, discipline): Baskeball, WNBA Statistics
  • Where (location): WNBA League in North America
  • When (snapshot, longitudinal): 1997-2019 WNBA Seasons, N/A
  • How much data do you need to do the analysis/work: CSV with WNBA Player Statistics highlighting age and season
Question 3: Who is the has the highest Player Efficiency Rating among WNBA players from 1997-2019?
  • Who (population): WNBA Players
  • What (subject, discipline): Baskeball, WNBA Statistics
  • Where (location): WNBA League in North America
  • When (snapshot, longitudinal): 1997-2019 WNBA Seasons, N/A
  • How much data do you need to do the analysis/work: CSV with WNBA Player Statistics highlighting Player Name and Player Efficiency Rating

Who Might Collect Relevant Data / What Articles or Publications Cite a Relevant Data Set?

The WNBA may already collect statistics and data regarding WNBA players, so I can possibly look on the WNBA website for more details. There are also other websites that are collecting data regarding WNBA players and those may also be useful when trying to find the relevant data to answer these questions.

Part 2 - Selecting a Data Set, Adding Documentation

  • Name / Title: wnba-player-stats
  • Link to Data: https://github.com/fivethirtyeight/WNBA-stats
  • Source / Origin:
  • Author or Creator: Neil Paine
  • Publication Date: May 25, 2020
  • Publisher: FiveThirtyEight
  • Version or Data Accessed: 2/3/2021
  • License: Creative Commons Attribution 4.0 International license
  • Format: .csv (comma separated values)
  • Size: 530 KB (543,266 bytes)
  • Number of Records: 3884

Sample of Data


import csv

with open('../data/raw/wnba-player-stats.csv', 'r') as f:
    reader = csv.reader(f, delimiter=",", quotechar='"')
    next(reader, None)
    data_read = [row for row in reader]
    for i in range(0,3):
        print(data_read[i])

['montgre01w', 'Renee Montgomery', '2019', '32', 'ATL', '34', '-9.8', 'G', '34', '949', '69.5%', '11.1', '.520', '.727', '.176', '1.0', '4.1', '18.0', '1.7', '0.5', '16.8', '17.5', '0.4', '0.5', '0.9', '.039', '-2.4', '1.22']
['williel01w', 'Elizabeth Williams', '2019', '26', 'ATL', '34', '-9.8', 'C-F', '32', '909', '66.6%', '16.7', '.521', '.000', '.477', '11.4', '12.1', '7.8', '1.4', '4.7', '12.9', '15.9', '1.6', '1.0', '2.7', '.117', '+0.6', '2.51']
['sykesbr01w', 'Brittney Sykes', '2019', '25', 'ATL', '34', '-9.8', 'G', '34', '880', '64.5%', '11.3', '.445', '.308', '.259', '3.0', '9.1', '19.6', '1.2', '1.5', '14.8', '23.1', '-0.8', '0.8', '0.0', '-.001', '-3.4', '0.70']

                

Fields or Column Headers

  • Field/Column 1: Player Name (String)
  • Field/Column 2: Year (String)
  • Field/Column 3: Age (Integer)
  • Field/Column 4: Player Efficiency Rating (Float)

Part 3 - Extract / Transform

Process for extracting, transforming, cleaning incoming data:
Columns that I plan on using are the Player Name, Year, Age, and Player Efficiency Rating columns from my dataset. The Player Name column is already a string which I need. The Year column is tricky, as it is a numerical type but I still need it in the type of string as it is a categorical data type rather than a quantitative one. For Age and Player Efficiency Rating I will need to convert to Integer and Float types. There are some empty numerical cells for players and those will be turned into 0 because in this case means that there are not statistics recorded for that player. It will show up as an empty string so it is necessary to fill in those cells. I also need to change some of the data types for some columns. I'll use a generator to take in values from columns that I want. I will then put this data into a list where I can manipulate and hopefully answer the questions that I came up with.


import csv

def get_player_name():
    with open('../data/raw/wnba-player-stats.csv', 'r') as f:
        reader = csv.reader(f, delimiter=",", quotechar='"')
        next(reader, None)
        data_read = [row for row in reader]
        for i in range(len(data_read)):
            #if data_read[i][1] == None:
            yield data_read[i][1]

def get_year():
    with open('../data/raw/wnba-player-stats.csv', 'r') as f:
        reader = csv.reader(f, delimiter=",", quotechar='"')
        next(reader, None)
        data_read = [row for row in reader]
        for i in range(len(data_read)):
            yield data_read[i][2]
        
def get_age():
    with open('../data/raw/wnba-player-stats.csv', 'r') as f:
        reader = csv.reader(f, delimiter=",", quotechar='"')
        next(reader, None)
        data_read = [row for row in reader]
        for i in range(len(data_read)):
            if data_read[i][3] == '':
                data_read[i][3] = 0
            yield int(data_read[i][3])

def get_PER():
    with open('../data/raw/wnba-player-stats.csv', 'r') as f:
        reader = csv.reader(f, delimiter=",", quotechar='"')
        next(reader, None)
        data_read = [row for row in reader]
        print(data_read[12][11])
        for i in range(len(data_read)):
            if data_read[i][11] == '':
                data_read[i][11] = 0
            yield float(data_read[i][11])
        
player_names = list(get_player_name())
years = list(get_year())
ages = list(get_age())
PERS = list(get_PER())
                

Part 4 - Visualizations


import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import numpy as np

y = []
x = []
for year in c:
    x.append(year)
    y.append(c[year])

y.reverse()
x.reverse()
    
figure(num=None, figsize=(20, 6), dpi=80, facecolor='w', edgecolor='k')
plt.bar(x, y, align='center', color='#aaddff')
plt.ylim(0, 300)
plt.ylabel('Number of Players')
plt.xlabel('Year')
plt.title('Number of Players Per Year')

plt.show()
                

This visualization that I came up with shows the number of players per year in the WNBA from 1997-2019. As you can see, the number of players floats around 150-250 players per year, with the highest being in 2002 with 219 active players in the WNBA.


import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import numpy as np

age_sum = 0
count = 0
avg_ages = []

with open('../data/raw/wnba-player-stats.csv', 'r') as f:
    reader = csv.reader(f, delimiter=",", quotechar='"')
    next(reader, None)
    data_read = [row for row in reader]
    for year in x:
        age_sum = 0
        count = 0
        for i in range(len(data_read)):
            if data_read[i][2] == year:
                age_sum += int(data_read[i][3])
                count += 1
        avg_ages.append(age_sum/count)

figure(num=None, figsize=(20, 6), dpi=80, facecolor='w', edgecolor='k')
plt.bar(x, avg_ages, align='center', color='#aaddff')
plt.ylim(0, 30)
plt.ylabel('Average Age of Players')
plt.xlabel('Year')
plt.title('Average Age of Players Per Year')

plt.show()
                

This visualization shows the average age of players per year in the WNBA. As you can see, there is not much fluctuation with the average age of players as the average only ranges from 25-30 years old each WNBA season. We can see that the max average age of players was 27 in 2003.

Part 5 - Conclusion

In conclusion, I was able to answer 3/3 of my questions. For the 1st question, the answer to that is to use the mean() function and put PERS as the parameter. So mean(PERS) returned 11.900386299253112 which was the average Player Efficiency Rating in the WNBA from 1997-2019. For the 2nd question, I was able to do the same thing as above and use mean(ages) to find the answer which is 26.29513262941025. For the last question, I can use max(PERS) to find the answer to that which is 78.9. For the rest of my analysis, I was able to also find an extension to my second question which is to find the average age of players per year in the WNBA from 1997-2019. Above is the graph for that and it shows that the average age always hovers around 26. Overall, my analysis returned some pretty interesting statistics for WNBA players from 1997-2019.

Github Repository

⇐ Back