Unhashable in Python - Getting the unique number of locations in a GeoDataFrame

Posted on September 14, 2018 in Python

I end up using geopandas on a regular basis, and one of its minor irritants is getting the unique number of geometries in a GeoDataFrame.

In this particular instance, I want to know how many duplicate locations I had in my dataset. The solution involves a way to sidestep the issue. In it, we'll get to learn the difference between mutable and immutable objects in Python, and their various properties.

Background

In [44]:
import geopandas as gpd
from shapely.geometry import Point
In [45]:
# Immutable objects are hashable, e.g. strings
print(set(['apple', 'orange']))
{'apple', 'orange'}
In [46]:
# But not mutable ones
print(set(['apple', ['orange', 'banana']]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-46-6bb9ecf39f11> in <module>()
      1 # But not mutable ones
----> 2 print(set(['apple', ['orange', 'banana']]))

TypeError: unhashable type: 'list'

That's because we're feeding set() with a string and a list, the latter being a mutable object. Lists are meant to change readily (as opposed to tuples), and Python disallows their serialization when trying to hash them.

The problem

Geopandas geometries are stored as shapely.geometry objects, which have the interesting following attribute of being unhashable in Python.

In [47]:
points = [Point([0, 0]), Point([1, 1])]
print(len(set(points)))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-47-bf8d7d6f42c6> in <module>()
      1 points = [Point([0, 0]), Point([1, 1])]
----> 2 print(len(set(points)))

TypeError: unhashable type: 'Point'

Similarly with Geopandas, which stores its geometry column as shapely.geometry objects:

In [48]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
print(len(world['geometry'].unique()))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-48-168312a37ad7> in <module>()
      1 world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
----> 2 print(len(world['geometry'].unique()))

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py in unique(self)
   1491         Categories (3, object): [a < b < c]
   1492         """
-> 1493         result = super(Series, self).unique()
   1494 
   1495         if is_datetime64tz_dtype(self.dtype):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\base.py in unique(self)
   1047         else:
   1048             from pandas.core.algorithms import unique1d
-> 1049             result = unique1d(values)
   1050 
   1051         return result

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\algorithms.py in unique(values)
    366 
    367     table = htable(len(values))
--> 368     uniques = table.unique(values)
    369     uniques = _reconstruct_data(uniques, dtype, original)
    370 

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.unique()

TypeError: unhashable type: 'Polygon'

The reason for this is because shapely objects are mutable — each object is meant to be transformed and rewritten-to in memory under the same variable name.

The solution

To get around this issue, you need to modify the way the geometry object is represented before passing your iterable to set() or GeoSeries.unique().

Like above, we'll change them to type str, which are immutable.

In [49]:
geometries = world['geometry'].apply(lambda x: x.wkt).values
print(geometries[0][:100], '...')
POLYGON ((61.21081709172574 35.65007233330923, 62.23065148300589 35.27066396742229, 62.9846623065766 ...
In [50]:
len(set(geometries))
Out[50]:
177

See also

See discussion on the project's Github repo, where the issue is discussed with the owner of the package:

sgillies commented on Sep 6, 2015

Shapely's geometries are mutable, but we're providing a hash function. These two features are inconsistent. Rather than remove mutability (for now) we'll remove the hashability.

Also of note, the issue is also discussed in geopandas issue 221. In some ways, this behaviour jars with the rest of pandas, which provide "view access" to the data is stores... immutably.

A word to the wise!