Part 2: Benchmark Workouts

Part 3: Exercise Statistics

I've collected profile data for the 1,706,699 CrossFit athletes that have ever (presumably) participated in the CrossFit Open. Having all these statistics in one place is useful as a reference for comparison, and interesting to see how Open athletes fill their profile.

It is important to remember that these statistics are all optional and self-reported. Right off the bat, the biggest problem is that people will typically not report the numbers they aren't proud of. So it's important to take this analysis with a grain of salt, and not treat it like data collected to write a scientific paper. We can still draw some very important conclusions from the analysis, keeping in mind that the numbers are probably rounded, excluded, outdated or outright false, all in order to look good. I'll do my best to include the sample size and trends in the analyses.

📣 These posts used to have great interactive graphs with plotly express. But they were huge and unruly and weren't showing up, so I downgraded them to regular plots. Sorry!

Goal¶

The reason behind this analysis is — you've guessed it, trying to optimize my training and working on my weaknesses. The sport of CrossFit is about being an all-round athlete, so if I see that I'm in the 15^th percentile for one movement but 85^th in another, I'll want to work on the first one instead of the second. Of course, any coach worth his/her salt would be able to tell you this, but it's another to be able to put a number on it. Put succintly:

To provide CrossFit athletes with a tool to see where their performance sits in regards to a small number of exercise standards and benchmarks. With this tool, athletes should be able to see:

What their strengths and weaknesses are compared to other CrossFit Open participants

Where they should focus their training for next season

Methodology¶

I scraped these 1.7M+ athlete profiles from the CrossFit Games profiles over the course of several days. A more thorough going into the actual methodology used to scrape, parse and load that data into a database will be the subject of another article.

Variables¶

We find a small number of athlete stats on their profile page:

Athlete Stats
- Age
- Sex
- Height
- Weight
Weightlifting
- Back Squat
- Clean and Jerk
- Snatch
- Deadlift
Benchmark Workouts
- Fight Gone Bad
- Fran
- Grace
- Helen
- Filthy 50
Bodyweight Exercices
- Max Pull-ups
- Sprint 400m
- Run 5k

As this list can be quite extensive, we'll be splitting this post into the four sections above. Another reason for this is quite simple: the interactive visualization package I'm using to make the graphs make for very large images, and I want to keep load times to a respectable minimum.

Inspiration¶

This post is largely inspired by Sam Swift's 2015 post called "What's normal (or top 5%) for a CrossFit athlete?" and a huge thank you to him for having created it in the first place. Be sure to check out his other posts on the sport, they're the best out there, bar none.

I won't be using his data directly nor do I have the time to make comparisons across the four years separating the posts, but it'd certainly be interesting to see how the sport has evolved with time.

Pre-processing¶

Even though a lot of the pre-processing was done in the steps leading to this analysis (see upcoming post on scraping methodology), we still have to deal with assigning sexes to participants, and with the all-important question of how to deal with missing values.

In [5]:

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
import sqlalchemy as sql

from IPython.display import display_html

Pull the data¶

In this case, our data was saved as a table in a PostgreSQL database running locally.

In [2]:

URI_DB = "postgres://leblancfg@localhost:5432/cf_analysis"
db = sql.create_engine(URI_DB)

df = pd.read_sql("cf_athletes", db, index_col="id")
df.sample(5)

Out[2]:

	name	country	division	age	height	weight	affiliate	fran	helen	grace	filthy50	fgonebad	run400	run5k	candj	snatch	deadlift	backsq	pullups	modified_date
id
1267471	Emanuel Maftey	United States	Men	28	5.92	162.92	CrossFit 3-46 Grit	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.553219e+09
970640	Cody Jelinek	None	None	28	5.75	184.97	None	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.553156e+09
298034	Trent Pfeiffer	None	None	22	5.92	190.04	None	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	55.0	1.552960e+09
772332	James Eason	None	None	35	NaN	NaN	None	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.553109e+09
1237555	Lilian Parpinelli Lopes	None	None	31	NaN	NaN	None	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.553213e+09

As we can see, a lot of the athletes simply input their name, division and age; they either don't bother entering their stats, or don't know them well enough.

Sex¶

First off, we know already that a sizeable fraction of the athletes disclose sex in their profile, in the form of the Division column.

In [3]:

F_STRINGS = ["women", "girls"]
COLOUR_SCHEME = px.colors.colorbrewer.Set1


def describe(df, col):
    """Takes a DataFrame and column name, prints a groupby-describe"""
    display(df.groupby("sex")[col].describe().round(1))


def parse_division(text):
    """Given a division title, e.g. 'Men', returns sex as 'F', 'M' or None"""
    if text is None:
        return None
    if any(word in text.lower() for word in F_STRINGS):
        return "F"
    return "M"


df["sex"] = df["division"].apply(parse_division)
has_division = sum(df["sex"].isna())

print(
    f"There are {round(100 * has_division / len(df))}% of athletes that submitted sex"
)

There are 79% of athletes that submitted sex

Body Measurements¶

Let's make a new datarame containing the data of the participants who've included weight, height and age stats. This will whittle down our possible number of participant data by quite a bit, but this way we should be able to get decent quality in our numbers.

We also see a non-insignificant number of bogus entries, with participants whose declared age is 125, weight is 20 lb, or height is over 10 feet. So we take care of that in one fell swoop.

In [5]:

# Make new DataFrame called `bm` with just the ones with body measurements
bm = df[~df["sex"].isna()].query("80 < weight < 400 & 4 < height < 7 & 13 < age < 81")

print(
    f"Body measurement statistics available for {round(100 * len(bm) / len(df))}%"
    f' of athletes out of a possible {"{:,}".format(len(df))}.'
)

Body measurement statistics available for 11% of athletes out of a possible 1,706,699.

Age¶

In [6]:

px.histogram(
    bm,
    "age",
    nbins=128,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Age Distribution % by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)

In [7]:

describe(bm, "age")

	count	mean	std	min	25%	50%	75%	max
sex
F	58233.0	34.5	9.5	14.0	28.0	33.0	40.0	77.0
M	121472.0	35.4	9.3	14.0	29.0	34.0	41.0	80.0

Height¶

In [8]:

height = bm.query("4.5 < height < 7")
px.histogram(
    height,
    "height",
    nbins=32,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Height Distribution % by Sex (in decimal feet)",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)

In [9]:

describe(bm, "height")

	count	mean	std	min	25%	50%	75%	max
sex
F	58233.0	5.4	0.2	4.1	5.2	5.4	5.6	6.8
M	121472.0	5.9	0.2	4.0	5.7	5.9	6.0	6.9

These distributions are very, very interesting. Obviously there's the gigantic notch at 6' for the men and 5'6" for the women — but we can probably ascribe that to the fact that it's self-reported data.

But that's OK! Looks like the 5'10" is still the self-reported average across men and 5'5" for the women.

Weight¶

In [10]:

px.histogram(
    bm,
    "weight",
    nbins=128,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Weight Distribution % by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)

In [11]:

describe(bm, "weight")

	count	mean	std	min	25%	50%	75%	max
sex
F	58233.0	142.0	21.8	80.0	127.9	140.0	152.1	381.4
M	121472.0	187.6	26.6	80.0	170.0	185.0	201.9	399.9

Now when it comes to weight, though, the situation is a bit different.

For the men, you could almost way that the distribution is bimodal: one of the peaks is around 175, and that there's another, much sharper peak at 185. The situation is reversed for the women, where the peak is around 130.

Hard to say if we can say that's caused by athletes trying to maintain a certain weight, wishful thinking, or rounding to a nearest multiple of 5 — in reality, the situation is probably in-between those hypotheses.

Height / Mass Ratio¶

In [12]:

px.density_contour(
    bm,
    x="height",
    y="weight",
    marginal_x="box",
    marginal_y="box",
    color="sex",
    trendline="lowess",
    labels={"height": "Height (feet)", "weight": "Weight (lb)"},
    title="CrossFit Open 2019 Athlete Pages — Distribution of Height and Weight by Sex",
    color_discrete_sequence=COLOUR_SCHEME,
)

I wish I could have made the density contours a little less sharp, because we're seeing clumps around every inch. But still, at least we can see the overall shape, and even a nicely fit trendline showing the trend weight for a given height.

We also find a large number of outliers in the weight distributions, but most of them are located near the top — safe to assume we're seeing traces of America's obesity problem right there — and great on these folks in particular to take things into their hands!

Part 2: Benchmark Workouts

Part 3: Exercise Statistics