College football has CFBD; college basketball has sportsdataverse. It's an open-source family of tools that bundles public, ESPN-derived data into clean tables you can load in one line — and, best of all for our purposes, it covers both the men's game (via hoopR) and the women's game (via wehoop) with an identical structure. That symmetry is the secret weapon: write your analysis once, run it on both. No API key required. The full script is in scripts/sportsdataverse-basketball-tutorial.py.
Step 1: Install
pip install sportsdataverse pyarrow
pyarrow lets the library read the cached data files efficiently.
sportsdataverse pulls pre-built season files (so you're not hammering any live API), then hands you a dataframe. The first load of a season downloads and caches it; after that it's instant.
Step 2: Load a season of team box scores
The men's module is sportsdataverse.mbb, the women's is sportsdataverse.wbb. Their loaders mirror each other:
import sportsdataverse.mbb as mbb
import sportsdataverse.wbb as wbb
season = 2025 # the season-ending year
men = mbb.load_mbb_team_boxscore(seasons=[season]).to_pandas()
women = wbb.load_wbb_team_boxscore(seasons=[season]).to_pandas()
.to_pandas() converts the result to a familiar pandas DataFrame.
Each row is one team's stat line from one game: team_score, field_goals_attempted, offensive_rebounds, turnovers, free_throws_attempted, three_point_field_goals_attempted, and dozens more. Crucially, the column names are the same for men and women — which is why the next function works on either.
Step 3: Compute scoring and pace yourself
Let's turn raw box scores into the possession-based numbers from our tempo and efficiency guide. One function, reused for both leagues:
def league_averages(df, label):
df = df[df["team_score"] > 0] # drop blank rows
pts = df["team_score"]
poss = (df["field_goals_attempted"] - df["offensive_rebounds"]
+ df["total_turnovers"] + 0.475 * df["free_throws_attempted"])
print(f"{label}: {len(df):,} team-games | "
f"{pts.mean():.1f} pts/team | "
f"{poss.mean():.1f} possessions | "
f"{df['three_point_field_goals_attempted'].mean():.1f} 3PA")
league_averages(men, "Men's 2025")
league_averages(women, "Women's 2025")
The same code runs on both dataframes because the schemas match.
Step 4: Read the output
Run it and you get real, league-wide averages computed from every Division I game in the season:
Men's 2025: 12,572 team-games | 73.2 pts/team | 69.1 possessions | 22.9 3PA
Women's 2025: 11,252 team-games | 65.4 pts/team | 71.2 possessions | 19.7 3PA
Actual output, sportsdataverse data retrieved June 2026.
Look what fell out of four lines of analysis: the women's game is played at a higher pace (71.2 possessions to the men's 69.1) yet produces fewer points (65.4 to 73.2). The gap is the three-pointer — men attempt nearly 23 a game to the women's ~20, at higher accuracy. That's a genuine, sourced insight you generated yourself, and it's the backbone of our women's game analysis. This is the entire promise of the toolkit: real conclusions, from public data, in minutes.
What else is in the box
The same modules expose much more than team box scores:
load_mbb_player_boxscore()/load_wbb_player_boxscore()— player-level lines for leaderboards and usage analysis.load_mbb_schedule()/load_wbb_schedule()— full schedules and results, perfect for strength-of-schedule work.load_mbb_pbp()/load_wbb_pbp()— play-by-play, for possession-level and lineup analysis (these files are large).
Good habits
- Let the cache work. Load a season once; the library stores it locally. Don't re-download in a loop.
- Filter junk rows. Drop rows where
team_scoreis zero or missing before averaging, as we did above. - Mind the season convention. "2025" means the 2024-25 season (the ending year). Off-by-one here is the most common beginner mistake.
- Credit the source. sportsdataverse aggregates public data; cite it (and respect that some underlying providers have their own terms).
Where to go next
Try computing the same averages across several seasons to build a trend (that's exactly how we charted the women's game over time), or join the box scores to schedules to make your own opponent-adjusted ratings. Because the men's and women's data share a schema, every tool you build works on both halves of the sport for free — which, frankly, is how all of college basketball analysis should work.
Sources & further reading
- sportsdataverse — sportsdataverse.org (hoopR and wehoop)
- Companion code:
scripts/sportsdataverse-basketball-tutorial.py - Related: Adjusted tempo and efficiency · The women's game's boom