Unhashable in Python - Getting the unique number of locations in a GeoDataFrame
Posted on September 14, 2018 in Python
I end up using geopandas on a regular basis, and one of its minor irritants is getting the unique number of geometries in a GeoDataFrame.
In this particular instance, I want to know how many duplicate locations I had in my dataset. The solution involves a way to sidestep the issue. In it, we'll get to learn the difference between mutable and immutable objects in Python, and their various properties.
Background¶
import geopandas as gpd
from shapely.geometry import Point
# Immutable objects are hashable, e.g. strings
print(set(['apple', 'orange']))
# But not mutable ones
print(set(['apple', ['orange', 'banana']]))
That's because we're feeding set()
with a string and a list, the latter being a mutable object. Lists are meant to change readily (as opposed to tuples), and Python disallows their serialization when trying to hash them.
The problem¶
Geopandas geometries are stored as shapely.geometry
objects, which have the interesting following attribute of being unhashable in Python.
points = [Point([0, 0]), Point([1, 1])]
print(len(set(points)))
Similarly with Geopandas, which stores its geometry
column as shapely.geometry
objects:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
print(len(world['geometry'].unique()))
The reason for this is because shapely objects are mutable — each object is meant to be transformed and rewritten-to in memory under the same variable name.
The solution¶
To get around this issue, you need to modify the way the geometry
object is represented before passing your iterable to set()
or GeoSeries.unique()
.
Like above, we'll change them to type str
, which are immutable.
geometries = world['geometry'].apply(lambda x: x.wkt).values
print(geometries[0][:100], '...')
len(set(geometries))
See also¶
See discussion on the project's Github repo, where the issue is discussed with the owner of the package:
sgillies commented on Sep 6, 2015
Shapely's geometries are mutable, but we're providing a hash function. These two features are inconsistent. Rather than remove mutability (for now) we'll remove the hashability.
Also of note, the issue is also discussed in geopandas issue 221. In some ways, this behaviour jars with the rest of pandas, which provide "view access" to the data is stores... immutably.
A word to the wise!