Benchmarks for writing pandas DataFrames to SQL Server (ODBC)
Posted on March 15, 2020 in Python, SQL, MS
Speed tests for writing to MSSQL from pandas (ODBC)¶
In this notebook, we'll be playing with various methods of writing data to a Microsoft SQL Server database. Compared to PostgreSQL, which I was using previously, the connection is approximately an order of magnitude slower. This is apparently a known issue. It boggles the mind why, and I might not be searching deep enough to understand the reasons for this, but it's something I can't work around.
Trying to transmit larger volumes of data can also result in errors, as I get memory and connection timeout errors I typically was not having previously.
According to some research, this is due to the ODBC Driver. Whether this is true or not, I do not know. The naive implementation for the .to_sql()
method from the Pandas docs is:
df.to_sql(name,
con,
schema=None,
if_exists='fail',
index=True,
index_label=None,
chunksize=None,
method=None,
)
Options¶
There are various ways known to speed up the process:
- Use the
method='multi'
parameter for theto_sql()
method call - Add the
execute_fastmany=True
parameter to the SQLAlchemy engine - Use both methods 1 and 2
- Use
sqlalchemy_turbodbc
instead of vanilla SQLAlchemy - Use
pymssql
instead of SQLAlchemy
But so far none have been benchmarked for comparison. In this article, we will be juxtaposing these methods to find the best performance in order to write data from a pandas DataFrame to Microsoft SQL Server.
Failed implementations¶
I also tried the following methods, but there was some issue or reason behind not including them in the list. This includes:
BULK INSERT
. A workaround we see a lot of in StackOverflow answers is to write your DataFrame to CSV and read it directly with BULK INSERT. I don't have those permissions on this database, so I won't include it. Feels definitely like something that if you can do, you should try.turbodbc
+fast_executemany
, as this method is not implemented for that SQLAlchemy dialectpymssql
+fast_executemany
, as this method is also not implemented for that SQLAlchemy dialect
Imports¶
%load_ext nb_black
import json
from os.path import expanduser
import math
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymssql as ms
import sqlalchemy as sql
import sqlalchemy_turbodbc as st
We also needed to pip install sqlalchemy_turbodbc pymssql
, which thankfully get detected when you load them as SQLAlchemy dialects.
Versions¶
print("SQLAlchemy version:", sql.__version__)
print("Pandas version", pd.__version__)
print("Numpy version:", np.__version__)
print("pymssql version:", ms.__version__)
try:
print(st.__version__)
except AttributeError:
print("No sqlalchemy_turbodbc version available")
Environment¶
HOME = expanduser("~")
# Enterprise DB to be used
DRIVER = "ODBC Driver 17 for SQL Server"
DSN = "tbsuatsql05"
DB = "DS_DEV"
TABLE = "perftest"
conn_sqlalchemy = sql.create_engine(f"mssql+pyodbc://{DSN}/{DB}?driver={DRIVER}")
conn_executemany = sql.create_engine(
f"mssql+pyodbc://{DSN}/{DB}?driver={DRIVER}", fast_executemany=True
)
conn_turbodbc = sql.create_engine(f"mssql+turbodbc://{DSN}/{DB}?driver={DRIVER}")
conn_pymssql = sql.create_engine(f"mssql+pymssql://{DSN}/{DB}")
Create dummy dataset¶
We'll create somehing here that is sufficiently large to both have a large impact on write times, and that triggers the "too many arguments" error when trying to write with the 'multi'
parameter.
df = pd.DataFrame(np.random.random((10 ** 4, 10)))
print("DataFrame contains", len(df), "rows by", len(df.columns), "columns")
df.head()
print("DataFrame is", round(sys.getsizeof(df) / 1024 ** 2, 1), "MB")
Results¶
Option 1 — Vanilla pandas
%%time
df.to_sql(TABLE, conn_sqlalchemy, index=False, if_exists='replace')
Option 2 — df.to_sql(..., method='multi')
This method smashed the chunksize limit for MS SQL, so we have to break up the insert by chunks of 2100 parameters. Pandas has the chunksize=2100
parameter, but unfortunately in our testing we get database errors I can't troubleshoot only using pandas.
Instead, we'll build a little helper function to do this for us:
def chunker(seq, size):
return (seq[pos : pos + size] for pos in range(0, len(seq), size))
def insert_with_progress(df, engine, table="", schema=""):
con = engine.connect()
# Replace table
engine.execute(f"DROP TABLE IF EXISTS {schema}.{table};")
# Insert with progress
SQL_SERVER_CHUNK_LIMIT = 2099
chunksize = math.floor(SQL_SERVER_CHUNK_LIMIT / len(df.columns))
for chunk in chunker(df, chunksize):
chunk.to_sql(
name=table,
con=con,
if_exists="append",
index=False,
schema=schema,
method="multi",
)
%%time
# Doing it like this errors out. Can't seem to be able to debug the straight pandas call.
# df.to_sql(TABLE, conn_sqlalchemy, index=False, if_exists='replace', method='multi', chunksize=2100)
insert_with_progress(df, conn_sqlalchemy, table=TABLE)
Option 3 — fast_executemany=True
N.B. This is currently the recommended method in the pandas documentation.
%%time
df.to_sql(TABLE, conn_executemany, index=False, if_exists='replace')
For a 0.8MB table this is still remarkably slow, but we're now ~50 x faster than vanilla!
Option 4 — 'multi'
+ fast_executemany
%%time
insert_with_progress(df, conn_executemany, table=TABLE)
Oddly, these two methods to make things faster don't compound well! We get basically the same speed as with 'multi'
Option 5 — sqlalchemy_turbodbc
%%time
df.to_sql(TABLE, conn_turbodbc, index=False, if_exists='replace')
So, over a 20x speedup from the vanilla settings, but still 2-3x slower than with the fast_executemany
option.
Option 6 — sqlalchemy_turbodbc + 'multi'
fast_executemany
is not available as an option when creating the engine, but we can still test with the method='multi'
parameter.
%%time
insert_with_progress(df, conn_turbodbc, table=TABLE)
Option 7 — pymssql
Another ODBC interface that pops up in the StackOverflow answers, this one doesn't quite seem to be up to snuff. Also doesn't have the fast_executemany
option.
%%time
df.to_sql(TABLE, conn_pymssql, index=False, if_exists='replace')
Option 8 — pymssql + 'multi'
%%time
insert_with_progress(df, conn_pymssql, table=TABLE)
Analysis¶
times = {
"vanilla": 23.5,
"multi": 3.09,
"fast_executemany": 0.5,
"multi+fast": 3,
"turbodbc": 1.17,
"turbodbc+multi": 10.9,
"pymssql": 31.7,
"pymssql+multi": 6.42,
}
# We're done with the testing, so remove the testing table
conn_sqlalchemy.execute(f"DROP TABLE IF EXISTS {TABLE};")
# Then plot the results
df_times = pd.DataFrame.from_dict(times, orient="index", columns=["time (s)"])
df_times["time (s)"].sort_values().plot.bar(
title="SQL Server write times (log)\n(lower is better)", logy=True
)
Conclusion¶
So, there we are! In our testing, we've seen that writing to SQL Server through the ODBC connection is substantially slower than e.g. Postgres. There are many ways to make this process faster, but here's the definitive rule of thumb when faced with this situation.
Rule of thumb when writing to SQL Server from Pandas:
- If you can use
BULK INSERT
, do so- Else, pass
fast_executemany=True
to your database engine object.