sportsdataverse Basketball Tutorial — CollegeAthleteInsider

College football has CFBD; college basketball has sportsdataverse. It's an open-source family of tools that bundles public, ESPN-derived data into clean tables you can load in one line — and, best of all for our purposes, it covers both the men's game (via hoopR) and the women's game (via wehoop) with an identical structure. That symmetry is the secret weapon: write your analysis once, run it on both. No API key required. The full script is in scripts/sportsdataverse-basketball-tutorial.py.

Step 1: Install

pip install sportsdataverse pyarrow

pyarrow lets the library read the cached data files efficiently.

sportsdataverse pulls pre-built season files (so you're not hammering any live API), then hands you a dataframe. The first load of a season downloads and caches it; after that it's instant.

Step 2: Load a season of team box scores

The men's module is sportsdataverse.mbb, the women's is sportsdataverse.wbb. Their loaders mirror each other:

import sportsdataverse.mbb as mbb
import sportsdataverse.wbb as wbb

season = 2025  # the season-ending year

men   = mbb.load_mbb_team_boxscore(seasons=[season]).to_pandas()
women = wbb.load_wbb_team_boxscore(seasons=[season]).to_pandas()

.to_pandas() converts the result to a familiar pandas DataFrame.

Each row is one team's stat line from one game: team_score, field_goals_attempted, offensive_rebounds, turnovers, free_throws_attempted, three_point_field_goals_attempted, and dozens more. Crucially, the column names are the same for men and women — which is why the next function works on either.

Step 3: Compute scoring and pace yourself

Let's turn raw box scores into the possession-based numbers from our tempo and efficiency guide. One function, reused for both leagues:

def league_averages(df, label):
    df = df[df["team_score"] > 0]   # drop blank rows
    pts  = df["team_score"]
    poss = (df["field_goals_attempted"] - df["offensive_rebounds"]
            + df["total_turnovers"] + 0.475 * df["free_throws_attempted"])
    print(f"{label}: {len(df):,} team-games | "
          f"{pts.mean():.1f} pts/team | "
          f"{poss.mean():.1f} possessions | "
          f"{df['three_point_field_goals_attempted'].mean():.1f} 3PA")

league_averages(men,   "Men's 2025")
league_averages(women, "Women's 2025")

The same code runs on both dataframes because the schemas match.

Step 4: Read the output

Run it and you get real, league-wide averages computed from every Division I game in the season:

Men's 2025:   12,572 team-games | 73.2 pts/team | 69.1 possessions | 22.9 3PA
Women's 2025: 11,252 team-games | 65.4 pts/team | 71.2 possessions | 19.7 3PA

Actual output, sportsdataverse data retrieved June 2026.

Look what fell out of four lines of analysis: the women's game is played at a higher pace (71.2 possessions to the men's 69.1) yet produces fewer points (65.4 to 73.2). The gap is the three-pointer — men attempt nearly 23 a game to the women's ~20, at higher accuracy. That's a genuine, sourced insight you generated yourself, and it's the backbone of our women's game analysis. This is the entire promise of the toolkit: real conclusions, from public data, in minutes.

What else is in the box

The same modules expose much more than team box scores:

load_mbb_player_boxscore() / load_wbb_player_boxscore() — player-level lines for leaderboards and usage analysis.
load_mbb_schedule() / load_wbb_schedule() — full schedules and results, perfect for strength-of-schedule work.
load_mbb_pbp() / load_wbb_pbp() — play-by-play, for possession-level and lineup analysis (these files are large).

Good habits

Let the cache work. Load a season once; the library stores it locally. Don't re-download in a loop.
Filter junk rows. Drop rows where team_score is zero or missing before averaging, as we did above.
Mind the season convention. "2025" means the 2024-25 season (the ending year). Off-by-one here is the most common beginner mistake.
Credit the source. sportsdataverse aggregates public data; cite it (and respect that some underlying providers have their own terms).

Where to go next

Try computing the same averages across several seasons to build a trend (that's exactly how we charted the women's game over time), or join the box scores to schedules to make your own opponent-adjusted ratings. Because the men's and women's data share a schema, every tool you build works on both halves of the sport for free — which, frankly, is how all of college basketball analysis should work.

Sources & further reading

For the fundamentals, see Chapter 3: Python for Sports Analytics in DataField.dev’s free textbook library.
sportsdataverse — sportsdataverse.org (hoopR and wehoop)
Companion code: scripts/sportsdataverse-basketball-tutorial.py
Related: Adjusted tempo and efficiency · The women's game's boom

C. B. Zakarian

C. B. Zakarian is an independent analyst who writes about what he can measure: ball sports and the player-run economies inside Roblox. He builds every model, chart, and calculator here himself from public data, shows the working, and never invents a number. When the data can't answer a question, he says so. On CollegeAthleteInsider, that means college football and basketball by the numbers, plus a plain-English read on the NIL-era rules. More about the methodology →

Step 1: Install

Step 2: Load a season of team box scores

Step 3: Compute scoring and pace yourself

Step 4: Read the output

What else is in the box

Good habits

Where to go next

Sources & further reading

C. B. Zakarian

Related in Tutorials

Simulate March Madness with a Monte Carlo Bracket (Python)

Build a Strength-of-Schedule-Adjusted Ranking in a Spreadsheet

From CSV to Chart: a Repeatable Analysis Pipeline with pandas