Analyzing Player Attributes from Top 10 NBA First Round Draft Picks
Authors: Matthew Muccio, William Miller
Every June, the National Basketball Association (NBA) holds a draft, where each of the thirty teams have an oppurtunity to select two top prospects to join their organization. With only two rounds in the draft - and only two chances per team - it is crucial that a team does proper research, scouting, and analysis to ensure that their draft picks have a significant impact on their odds at winning a championship.
Using the NBA data API for all players currently in the league, we will examine top 10 draft picks and see how they compare to all players in the league. Then we will make use of various individual player attributes to determine what properties correlate to top NBA draft picks. After having read through our data analysis, we hope that you will understand the importance of research and scouting that NBA teams undergo in selecting top draft picks.
Python Standard Library Modules:
Third-Party Library Modules:
import json
import matplotlib.pyplot as plt
import pandas as pd
import requests
import seaborn
from sklearn import linear_model
from sklearn import model_selection
from statsmodels import api as sm
The dataset used in our analysis includes all information from almost 500 current players in the NBA. The NBA releases an updated version of this data everyday. It contains information such as player names, height, weight, college, draft number, and country of origin. We will be looking at information from only the players who were drafted in the top 10 of their class.
The dataset can be found here. It comes directly from the NBA website.
Loading the JSON into a Pandas dataframe. We organize the dataframe into only the columns of data we will use for analysis. These include:
- First Name
- Last Name
- Position
- Height (Feet)
- Height (Inches)
- Weight (Pounds)
- Date of Birth (Year)
- Date of Birth (Month)
- Date of Birth (Day)
- NBA Debut Year
- Number of Years in NBA
- College
- Last affiliation (College or Location)
- Country
- Draft Round Number
- Draft Pick Number
- Draft Year
- Team
# Prepping player data from NBA API to be added to a Pandas DataFrame.
endpoint = "http://data.nba.net/10s/prod/v1/2018/players.json"
r = requests.get(endpoint)
raw_data = r.json()
player_data = raw_data["league"]["standard"]
players = []
for player in player_data:
players.append(
{
"firstName": player["firstName"],
"lastName": player["lastName"],
"pos": player["pos"],
"heightFeet": int(player["heightFeet"]) if not player["heightFeet"] == "-" else "",
"heightInches": int(player["heightInches"]) if not player["heightInches"] == "-" else "",
"heightTotal": player["heightFeet"] + " " + player["heightInches"],
"weightPounds": int(player["weightPounds"]) if not player["weightPounds"] == "" else "",
"birthYear": int(player["dateOfBirthUTC"][:4:]) if not player["dateOfBirthUTC"][:4:] == "" else "",
"birthMonth": int(player["dateOfBirthUTC"][5:7:]) if not player["dateOfBirthUTC"][5:7:] == "" else "",
"birthDay": int(player["dateOfBirthUTC"][8:10:]) if not player["dateOfBirthUTC"][8:10:] == "" else "",
"nbaDebutYear": player["nbaDebutYear"],
"yearsPro": player["yearsPro"],
"collegeName": player["collegeName"],
"lastAffiliation": player["lastAffiliation"],
"country": player["country"],
"roundNum": int(player["draft"]["roundNum"]) if not player["draft"]["roundNum"] == "" else "",
"pickNum": int(player["draft"]["pickNum"]) if not player["draft"]["pickNum"] == "" else "",
"draftYear": int(player["draft"]["seasonYear"]) if not player["draft"]["seasonYear"] == "" else "",
"teamId": player["draft"]["teamId"]
})
df = pd.DataFrame(players)
df.head()
We will use pandas describe() to show some statistics about the dataset. First, we can see there is 498 players in the current NBA league. There is some pretty interesting results to look at here. Like the top 'lastAffiliation' is Kentucky (University of Kentucky). We will see a lot of these results play in later as we look deeper into the dataset. Another interesting result is, the 'firstName' and 'lastName' columns.
df.describe()
Since team identification for each player is stored as a number instead of a team name, we need to convert the id's to team names. To do this we will be looking at another page of data that the NBA provides. The data page can be found here and includes all of the information of teams that are associated with the NBA, including the team id's. We can use this data to match the team id with an actual team name in our DataFrame.
# Prepping team data from NBA API to be added to a teams dictionary
# to replace teamId with actual team names.
endpoint = "http://data.nba.net/"
r = requests.get(endpoint)
raw_data = r.json()
team_data = raw_data["sports_content"]["teams"]["team"]
teams = {}
for team in team_data:
if team["is_nba_team"]:
teamId = team["team_id"]
teamName = team["team_name"] + " " + team["team_nickname"]
teams[teamId] = teamName
# Converting teamId column to actual team name.
# Converting collegeName column to None if blank.
for i, row in df.iterrows():
teamId = row["teamId"]
if teamId:
df.at[i, "teamId"] = teams[int(row["teamId"])]
else:
df.at[i, "teamId"] = "None"
def convert_to_none(col):
colName = row[col]
if not colName:
df.at[i, col] = "None"
def convert_to_zero(col):
colName = row[col]
if not colName:
df.at[i, col] = 0
convert_to_none("collegeName")
convert_to_none("draftYear")
convert_to_none("nbaDebutYear")
convert_to_zero("pickNum")
convert_to_zero("roundNum")
df
We now have a tidy dataframe that is much easier to read and has the pertinent information. The dataframe also has now accounted for information that was missing and put 'None' its place.
Since our project focuses on Top 10 draft picks, we will organize our data to isolate only players who were drafted within the top ten of their class.
top10_df = df.loc[(df["pickNum"] <= 10) & (df["roundNum"] == 1)]
top10_df
Now let's look at some data trends within the data set. There are many important attributes to a player like position, colleges, height, and weight. Many of these reasons go into why players would get drafted top 10 or not.
Expected results! Historically, the 'Guard' position is the top position to be drafted due to the all around playmaking ability the individual has to have to fill this role. Guards also have multiple positions on the court and many take up leadership roles on teams.
# Unique positions of Top 10 Draft Picks
top10_df["pos"].unique()
Top 10 draft picks come from colleges all over, even overseas! We will look at colleges more in detail later.
# Unique college names of Top 10 Draft Picks
top10_df["collegeName"].unique()
Top 10 draft picks also come from all over the world! Even from the newest country in the world, South Sudan.
# Unique country of origin of Top 10 Draft Picks
top10_df["country"].unique()
A cool little data fact about the University of Maryland basketball program. Yes, there is a player active in the NBA who was a top 10 draft pick!
# Top 10 NBA Draft Picks from University of Maryland
maryland_df = top10_df.loc[df["collegeName"] == "Maryland"]
maryland_df
Now, let's look a bit deeper into our dataset and specifically active top 10 draft picks. Below, according to the top 10 draft picks Pandas DataFrame created in part 1D, there are 124 current NBA players that were top 10 NBA Draft picks.
# Getting the amount of top 10 draft picks currently in the NBA, 124.
num_top10_draft_picks = top10_df.shape[0]
print("Number of current NBA players who are top 10 draft picks:")
num_top10_draft_picks
Now, let's look at how many active NBA players are not top 10 draft picks. According, to the Pandas DataFrame that was created based on the NBA API JSON data, there are 498 active NBA players and, out of those, 374 players were not top 10 draft picks. This is a large majority of active NBA players, a little over 75% of active players were not top 10 draft picks. This shows it is uncommon to be a top 10 draft pick.
# According to the Pandas DataFrame that was created based on the
# NBA API JSON data, there are 498 current NBA players.
num_nba_players = df.shape[0]
print("Number of current NBA players:")
num_nba_players
# By finding the difference of the total number of current NBA players
# and the total number of current NBA players that were not top 10
# draft picks (374).
print("Number of current NBA players who are not top 10 draft picks:")
num_not_top10_draft_picks = num_nba_players - num_top10_draft_picks
num_not_top10_draft_picks
# There are 374 current NBA players who were not top 10 draft picks, or
# approximately 75% of all NBA players.
percent_not_top10_draft_picks = num_not_top10_draft_picks / num_nba_players
print("Percentage of current NBA players who are not top 10 draft picks:")
percent_not_top10_draft_picks
# Plotting bar graph of total current players in NBA, current players
# who are top 10 NBA draft picks, and current players who are not
# top 10 NBA draft picks.
current_nba_data = [
{
"Group": "Total NBA Players",
"Count": num_nba_players
},
{
"Group": "Total Non-Top 10 Draft Picks",
"Count": num_not_top10_draft_picks
},
{
"Group": "Total Top 10 Draft Picks",
"Count": num_top10_draft_picks
}
]
current_nba_df = pd.DataFrame(current_nba_data)
plt.figure(figsize=(13,7))
plt.title("How Many Current NBA Players Are Top 10 Draft Picks?", fontsize=18)
seaborn.barplot(data=current_nba_df, x="Group", y="Count")
plt.show()
By looking at the chart below for sorting top 10 draft picks by position, it again, shows the 'Guard' position holds the top spot in terms of positions in the top 10 active draft picks in the league.
by_pos_df = top10_df.groupby(top10_df["pos"]).count().reset_index()
by_pos_df = by_pos_df[["pos", "pickNum"]]
by_pos_df = by_pos_df.sort_values("pickNum", ascending=False)
by_pos_df.columns = ["Position", "Count"]
by_pos_df
# Plotting top 10 draft picks by position.
plt.figure(figsize=(13,7))
plt.title("Top 10 NBA Draft Picks by Position", fontsize=18)
seaborn.barplot(data=by_pos_df, x="Position", y="Count")
plt.show()
We can also look at the top 10 draft picks dataset by strictly size. This includes height and weight. If we look at the chart below we can see there is an obvious middle ground between 220 - 250 pounds. There is obviously a less of cluster as you get further away from this range. From this information, we also see there are few players that lie below 200 pounds. This contradicts our data from sorting top 10 draft picks by position. The 'Guard' position, historically, are the lowest weight players in the NBA, averaging around 190 pounds. However, as seen above, the 'Guard' position is the highest occuring position.|
# Top 10 Draft Picks by Weight
by_wt_df = top10_df.groupby(top10_df["weightPounds"]).count().reset_index()
by_wt_df = by_wt_df[["weightPounds", "pickNum"]]
by_wt_df = by_wt_df.sort_values("pickNum", ascending=False)
by_wt_df.columns = ["Weight", "Count"]
by_wt_df
As we can see in the bar graph, the weight varies quite a bit, but there is a small cluster between ~220 - 250 pounds.
# Graphing the top 10 NBA draft picks by weight.
plt.figure(figsize=(13,7))
plt.title("Top 10 NBA Draft Picks by Weight", fontsize=18)
seaborn.barplot(data=by_wt_df, x="Weight", y="Count")
plt.show()
Height is also included in size. Based on just the table, we can already see top 10 NBA draft picks are quite tall. Height is another major attribute when drafting players. A very tall player with quality coordination can dominate opponents inside the paint.
# Top 10 Draft Picks by Height
by_ht_df = top10_df.groupby(top10_df["heightTotal"]).count().reset_index()
by_ht_df = by_ht_df[["heightTotal", "pickNum"]]
by_ht_df = by_ht_df.sort_values("pickNum", ascending=False)
by_ht_df.columns = ["Height", "Count"]
by_ht_df
Based on this bar graph, top 10 NBA Draft picks have a slight tendency to be taller on average. Unlike the previous graph, which was by weight, this graph's data is less sporadic. A majority of top 10 draft picks are 6' 7" or taller.
# Bar graph for top 10 NBA draft picks by height.
plt.figure(figsize=(13,7))
plt.title("Top 10 NBA Draft Picks by Height", fontsize=18)
seaborn.barplot(data=by_ht_df, x="Height", y="Count")
plt.show()
Another attribute to graph by is place of origin. Many NBA rookies, and eventually NBA stars, are drafted from major college basketball programs. The more money college basketball programs have, the more NBA prodigy's they can churn out. The results below are very expected. It is logical that the colleges which have historic basketball programs, which tend to have the most success each year as well as are perennial championship contenders, and which recruit the top prospects out of high school, are represented the most significantly in the Top 10 NBA Draft Pick data. Schools like Kentucky, Duke, Arizona, Indiana, Syracuse, and UNC are not only among the winningest college basketball teams, they also attract the most talented high school prospects. This results in the school also sending a fair amount of their most-talented players to the NBA. For example, the University of Kentucky is famous for their one-and-done basketball legacy. This means top recruited basketball players out of high school to the University of Kentucky only spend a year playing NCAA basketball, then get recruited in the top of their NBA class. This has been very frowned upon in the recent years due to schools seeming to focus more on what makes them money, sports, then academics, expecially for those who are recruited and stay in the college system for only one year. They lack any sort of education if their NBA career fails. However, the schools at the top of the list below are famous for winning seasons and winning the National Championship for college basketball. As an NBA team, why wouldn't you want to recruit from the teams who consistently win National Championships and produce the best basketball players?
# Top 10 Draft Picks by Place of Origin (College if attended, or city/country of residence).
by_origin_df = top10_df.groupby(top10_df["lastAffiliation"]).count().reset_index()
by_origin_df = by_origin_df[["lastAffiliation", "pickNum"]]
by_origin_df = by_origin_df.sort_values("pickNum", ascending=False)
by_origin_df.columns = ["Place of Origin", "Count"]
by_origin_df
After we have gotten more comfortable with our dataset, we can continue to data analysis. First, we want to determine what attributes are important and which ones are not when selection a top 10 NBA draft pick. The first cell below shows all of the attributes we are currently holding.
# Viewing all the columns, or attributes, that we can use from the dataset.
attributes = top10_df.columns.tolist()
for a in attributes:
print(a)
Not all of the player attributes captured from the NBA API are pertinent to our analysis. We can remove unimportant player information that does not factor into whether a player is a top 10 draft pick, from the Top 10 Draft Picks DataFrame. We chose:
- First Name
- Last Name
- Position
- Height (Feet)
- Height (Inches)
- Weight (Pounds)
- Date of Birth (Year)
- Date of Birth (Month)
- Date of Birth (Day)
- College
- Draft Round Number
- Draft Pick Number
- Draft Year
# Not all of the player attributes captured from the NBA API are needed
# for our analysis.
edited_top10_df = top10_df[["firstName", "lastName", "pickNum", \
"birthDay", "birthMonth", "birthYear", \
"collegeName", "country", "heightFeet", \
"heightInches", "pos", "weightPounds", \
"draftYear"]]
edited_top10_df.head()
We must reduce our table to numerical values in order to run some kind of regression on it. We will have to measure our players based on objective numerical data, mainly values related to size, such as age, height, and weight, but also position and country of origin (as numerical values). View the columns on the far right-hand side of the DataFrame below to see these new columns. The follows the comments below.
# Converts position strings to corresponding numbers.
# Guard - 3
# Forward - 2
# Center - 1
def convert_position_to_num(pos):
pos_str = pos[0]
if pos_str == "G":
return 3
elif pos_str == "F":
return 2
elif pos_str == "C":
return 1
# Converts country strings to corresponding numbers.
# USA = 1
# Other country = 0
def convert_country_to_num(country):
if country == "USA":
return 1
else:
return 0
edited_top10_df.loc[:, "Name"] = edited_top10_df["firstName"] + " " + edited_top10_df["lastName"]
edited_top10_df.loc[:, "Draft Pick"] = edited_top10_df["pickNum"]
edited_top10_df.loc[:, "Age"] = edited_top10_df["draftYear"] - edited_top10_df["birthYear"]
edited_top10_df.loc[:, "Height"] = edited_top10_df["heightFeet"] * 12 - edited_top10_df["heightInches"]
edited_top10_df.loc[:, "Weight"] = edited_top10_df["weightPounds"]
edited_top10_df.loc[:, "Position"] = edited_top10_df.loc[:, "pos"].apply(convert_position_to_num)
edited_top10_df.loc[:, "Country"] = edited_top10_df.loc[:, "country"].apply(convert_country_to_num)
edited_top10_df
Now, that we have limited our attributes, we will create a new dataframe to look at only the attributes that are most important.
# Creating a new reduced dataframe.
reduced_top10_df = edited_top10_df[["Name", "Draft Pick", "Age", "Height", "Weight", "Position", "Country"]]
reduced_top10_df.head()
We are further reducing our DataFrame such that each individual value will become a rating of that category on a 0-100 scale. Simply divide the value by the maximum value in its column to obtain the rating. For the Position rating, it is more favorable for a player to be a Guard, then a Forward, than a Center in order to be selected higher, according to the data trends we saw above. (Thus, 1 = G, 0.66 = F, 0.33 = C) For the Country rating, it is more favorable for a player to be from the USA in order to be selected higher, according to the data trends we saw above. (Thus, 1 = USA, 0 = Any other country).
final_top10_df = reduced_top10_df.copy()
final_top10_df.loc[:, "Draft Pick"] = 100 * (1 - ((final_top10_df["Draft Pick"] - 1) / final_top10_df["Draft Pick"].max()))
final_top10_df.loc[:, "Age"] = 100 * (final_top10_df["Age"] / final_top10_df["Age"].max())
final_top10_df.loc[:, "Height"] = 100 * (final_top10_df["Height"] / final_top10_df["Height"].max())
final_top10_df.loc[:, "Weight"] = 100 * (final_top10_df["Weight"] / final_top10_df["Weight"].max())
final_top10_df.loc[:, "Position"] = 100 * (final_top10_df["Position"] / final_top10_df["Position"].max())
final_top10_df.loc[:, "Country"] = 100 * (final_top10_df["Country"] / final_top10_df["Country"].max())
final_top10_df.head()
We are trying to view the effects of five different, or aggregate player attributes (Age, Weight, Height, Position, and Country) on the overall player draft pick.
Null Hypothesis: None of the player attributes have a veritable impact on the player draft pick. In order to test the null hypothesis, we are going to perform Multiple Linear Regression on the dataset using SciKit-Learn.
We will use our reduced Top 10 Draft Pick DataFrame with the numerical player attributes for the multiple regression model. We will create a new DataFrame for the features of the regression. We will also create a new DataFrame for the target of the regression. The features act as the independent variables, which include Age, Weight, Height, Position, and Country when drafted. The target acts as the dependent variable, which is the player's overall draft pick.
columns = ["Age", "Weight", "Height", "Position", "Country"]
features = final_top10_df[columns]
target = final_top10_df[["Draft Pick"]]
Defines X and y for use in the LinearRegression() function from SciKit-Learn. We will then fit the linear regression model.
X, y = features, target["Draft Pick"]
lin_model = linear_model.LinearRegression()
model = lin_model.fit(X, y)
Our R-squared score is supposed to test how well the variance is explained by the model. As values range from 0 to 1, the 0.066 value means that none of the variance can be explained by the model.
lin_model.score(X, y)
We will now find the coefficients from the model in order to determine which attributes had the least or most significant impact overall. It seem as though weight has the largest impact overall, and not much else seems to have a correlation. However, in order to determine with certainty which player attributes have the most impact on the overall draft pick and to test the null hypothesis correctly, we must calculate the p-values using StatsModels.
sklearn_coefficients = lin_model.coef_.tolist()
for i in range(len(columns)):
print("Player Attribute: {0}, Coefficient: {1}".format(columns[i], sklearn_coefficients[i]))
print()
We can reuse the Features (X) and Target (y) variables that we created in the last step to create a regression with stats model. We simply must add a constant in StatsModels. The model uses the method of Ordinary Least Squares. Its objective is to minimize the sum of squared distances between the actual numerical values in the dataset and the generated predicted values in the regression.
sm_X = X
sm_y = y
sm_X = sm.add_constant(X)
ols_model = sm.OLS(y.astype(float), X.astype(float)).fit()
ols_model.summary()
The only player attributes that seem to have a meaningful impact on the model is weight and position. As seen in the "P > |t|" column below, only two of the values are close to the critical value of 5% (p-value of 0.05), 0.002 and 0.083, for weight and position, respectively. We can partially reject the null hypothesis, because it is clear that some player attributes contribute to the overall draft pick. Our R-squared value, 0.807 is fair, as it shows that we did not overfit the model. We will be able to figure out more regarding our R-squared value in the next step with training and testing.
We will be able to re-use the features (X) and target (y) from the SciKit-Learn regression model. We will split the dataset into training data and testing data for both of the variables. Variables, training_X and training_y are used to generate, or train, the regression model. Then, the testing_X data is used with the model in order to make predictions for the predicted overall draft picks. Then, the predictions are compared to the actual draft picks in testing_y. We decided to split up training and testing data in a 75%/25% split, in favor of training. A majority of the dataset should be used for training the model, but since our dataset is not significantly large, we decided on 75% over 80%, 90%, or higher. We will display the first 15 results of the predicted draft picks.
training_X, testing_X, training_y, testing_y = model_selection.train_test_split(X, y, test_size=0.25)
lin_model = linear_model.LinearRegression()
model = lin_model.fit(training_X, training_y)
predictions = lin_model.predict(testing_X)
for p in predictions[0:15]:
print(p)
We will then plot the predicted draft pick data from the linear regression model against the actual values from the dataset in the testing_y variable. We will add a trend/identity line to manifest how closely the predictions from the new linear regression model are to the actual player draft picks. If the predictions are accurate, the plot points will follow the trend line.
plt.figure(figsize=(6, 4))
plt.title("Predicted Values vs. Actual Values for Top 10 Player Draft Picks", fontsize=15)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.scatter(testing_y, predictions)
plt.plot(testing_y, testing_y, color="Red")
plt.show()
As we can see plot points do not follow the trend line very well. This is, again, proves only limited attributes have relationships.
From our models, it is clearly difficult to correlate player attributes to the overall draft pick. This only adds to the difficulty that Coaches, General Managers, Owners of NBA franchises face when the NBA Draft season comes around each year.
The only attribute that truly seemed to correlate with a higher draft pick was the weight of the player. It seems as though more research must be done in order to determine which player attributes matter most in regard to drafting a top 10 pick. It is surprising that other attributes such as height, colleges, or place of origin did not correlate more with the data. These attributes where among the strongest when analyzing different parts of the data.
There are many other attributes that can be added or subtracted from this dataset. We did not touch any gameplay statistics from players. This could be the logical next step to creating an accurate model. One, the dataset would be much larger, since there are many, many gameplay statistics kept on each player. Two, we might be able to find that those attributes, meaning high gameplay statistics, would correlate better with players who were top 10 NBA draft picks and players who were not top 10 NBA draft picks.
We now have a couple conclusions from our models. One, we can see that these attributes, while they seem strong on the surface, do not correlate closely to why a player is a top 10 draft pick. Two, these results give us a next step to look at other player attributes such as gameplay statistics from a top 10 draft pick before they were drafted, or looking at their performance in the league after they were drafted. These could give insight into what makes a top 10 NBA draft pick, while also making the sample size larger.
Even though the results where weak in terms of correlation, this does bring up a few good points. For example, there stereotypes like 'because a player played for University of Kentucky they will be a top 10 draft pick'. However, we can see that they almost do not correlate at all. From these results we can see that many aspects of the data can be improved and added upon to make a more accurate model and truly find 'What Makes a Top 10 NBA Draft Pick?'.
Thank you very much for reading through our project and data analysis on Top 10 NBA Draft Picks. Please feel free to contact us with any comments, concerns, or feedback regarding the project, data analysis, or decriptions.
If you would like to learn more about similar topics and research about NBA statistics please visit the links below:
The Length and Success of NBA Careers:Does College Production Predict Professional Outcomes?