CrossFit Open 2019 Analysis - Exercise Statistics
Posted on June 30, 2019 in CrossFit, Python
I've collected profile data for the 1,706,699 CrossFit athletes that have ever (presumably) participated in the CrossFit Open. Having all these statistics in one place is useful as a reference for comparison, and interesting to see how Open athletes fill their profile.
In this article, we'll be taking a look at the exercise statistics self-reported by the athletes: bodyweight and weightlifting combined.
It is important to remember that these statistics are all optional and self-reported. Right off the bat, the biggest problem is that people will typically not report the numbers they aren't proud of. So it's important to take this analysis with a grain of salt, and not treat it like data collected to write a scientific paper. We can still draw some very important conclusions from the analysis, keeping in mind that the numbers are probably rounded, excluded, outdated or outright false, all in order to look good. I'll do my best to include the sample size and trends in the analyses.
Goal¶
The reason behind this analysis is — you've guessed it, trying to optimize my training and working on my weaknesses. The sport of CrossFit is about being an all-round athlete, so if I see that I'm in the 15th percentile for one movement but 85th in another, I'll want to work on the first one instead of the second. Of course, any coach worth his/her salt would be able to tell you this, but it's another to be able to put a number on it. Put succintly:
To provide CrossFit athletes with a tool to see where their performance sits in regards to a small number of exercise standards and benchmarks. With this tool, athletes should be able to see:
- What their strengths and weaknesses are compared to other CrossFit Open participants
- Where they should focus their training for next season
Methodology¶
I scraped these 1.7M+ athlete profiles from the CrossFit Games profiles over the course of several days. A more thorough going into the actual methodology used to scrape, parse and load that data into a database will be the subject of another article.
Variables¶
We find a small number of athlete stats on their profile page:
- Athlete Stats
- Age
- Sex
- Height
- Weight
- Weightlifting
- Back Squat
- Clean and Jerk
- Snatch
- Deadlift
- Benchmark Workouts
- Fight Gone Bad
- Fran
- Grace
- Helen
- Filthy 50
- Bodyweight Exercices
- Max Pull-ups
- Sprint 400m
- Run 5k
As this list can be quite extensive, we'll be splitting this post into the four sections above. Another reason for this is quite simple: the interactive visualization package I'm using to make the graphs make for very large images, and I want to keep load times to a respectable minimum.
Inspiration¶
This post is largely inspired by Sam Swift's 2015 post called "What's normal (or top 5%) for a CrossFit athlete?" and a huge thank you to him for having created it in the first place. Be sure to check out his other posts on the sport, they're the best out there, bar none.
I won't be using his data directly nor do I have the time to make comparisons across the four years separating the posts, but it'd certainly be interesting to see how the sport has evolved with time.
Pre-processing¶
Even though a lot of the pre-processing was done in the steps leading to this analysis (see upcoming post on scraping methodology), we still have to deal with assigning sexes to participants, and with the all-important question of how to deal with missing values.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly_express as px
from scipy import stats
import seaborn as sns
import sqlalchemy as sql
from IPython.display import display_html
Pull the data¶
In this case, our data was saved as a table in a PostgreSQL database running locally. This makes it easier to retrieve data and query it as required.
URI_DB = "postgres://leblancfg@localhost:5432/cf_analysis"
db = sql.create_engine(URI_DB)
df = pd.read_sql("cf_athletes", db, index_col="id")
dl = bm.query("0 < deadlift < 800")
px.histogram(
dl,
"deadlift",
nbins=64,
color="sex",
barmode="group",
title="CrossFit Open 2019 Athlete Pages — Distribution % of Max Deadlift by Sex",
marginal="box",
color_discrete_sequence=COLOUR_SCHEME,
)
describe(dl, "deadlift")
From anecdotal evidence, DAAAANG people are strong! And a lot of people are submitting these, too, so it sure seems as though athletes can regularly crank these weights. I guess that means back to the barbell for this man... ha!
Back Squat¶
bs = bm.query("35 < backsq < 600")
px.histogram(
bs,
"backsq",
nbins=64,
color="sex",
barmode="group",
title="CrossFit Open 2019 Athlete Pages — Distribution % of Max Back Squat by Sex",
marginal="box",
color_discrete_sequence=COLOUR_SCHEME,
)
describe(bs, "backsq")
Clean & Jerk¶
candj = bm.query("20 < candj < 400")
px.histogram(
candj,
"candj",
nbins=64,
color="sex",
barmode="group",
title="CrossFit Open 2019 Athlete Pages — Distribution % of Max Clean & Jerk by Sex",
marginal="box",
color_discrete_sequence=COLOUR_SCHEME,
)
describe(candj, "candj")
Snatch¶
snatch = bm.query("5 < snatch < 350")
px.histogram(
snatch,
"snatch",
nbins=64,
color="sex",
barmode="group",
title="CrossFit Open 2019 Athlete Pages — Distribution % of Max Snatch by Sex",
marginal="box",
color_discrete_sequence=COLOUR_SCHEME,
)
describe(snatch, "snatch")
pu = bm.query("0 < pullups < 101")
px.histogram(
pu,
"pullups",
nbins=64,
color="sex",
barmode="group",
title="CrossFit Open 2019 Athlete Pages — Distribution % of Pullups by Sex",
marginal="box",
color_discrete_sequence=COLOUR_SCHEME,
)
describe(pu, "pullups")
Run 400 m¶
run_400 = bm.query("44 < run400 < 200")
px.histogram(
run_400,
"run400",
nbins=64,
color="sex",
barmode="group",
title="CrossFit Open 2019 Athlete Pages — Distribution % of 400 m Run Times by Sex",
marginal="box",
color_discrete_sequence=COLOUR_SCHEME,
)
describe(run_400, "run400")
Run 5km¶
run_5k = bm.query("777 < run5k < 3000")
px.histogram(
run_5k,
"run5k",
nbins=64,
color="sex",
barmode="group",
title="CrossFit Open 2019 Athlete Pages — Distribution % of 5 km Run Times by Sex",
marginal="box",
color_discrete_sequence=COLOUR_SCHEME,
)
describe(run_5k, "run5k")
A lot of outliers here, but at least we know that they probably didn't beat the World record time of 777 seconds.
Aside¶
Now, my PR is 22 minutes (or 1320 seconds), which I thought was respectable. Turns out:
men_5k = run_5k[run_5k["sex"] == "M"]["run5k"]
print(
f"A 22 minute 5 km run for a man is in the {round(stats.percentileofscore(men_5k, 22 * 60), 1)}th percentile"
)
So... it's better than the mean and interquartile median, but the distribution is heaily skewed towards the mode, that's sitting around 21 minutes. I've got some catching up to do!