# Unhashable in Python - Getting the unique number of locations in a GeoDataFrame

Posted on September 14, 2018 in Python

I end up using geopandas on a regular basis, and one of its minor irritants is getting the unique number of geometries in a GeoDataFrame.

In this particular instance, I want to know how many duplicate locations I had in my dataset. The solution involves a way to sidestep the issue. In it, we'll get to learn the difference between mutable and immutable objects in Python, and their various properties.

## Background¶

In [44]:
import geopandas as gpd
from shapely.geometry import Point

In [45]:
# Immutable objects are hashable, e.g. strings
print(set(['apple', 'orange']))

{'apple', 'orange'}

In [46]:
# But not mutable ones
print(set(['apple', ['orange', 'banana']]))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-46-6bb9ecf39f11> in <module>()
1 # But not mutable ones
----> 2 print(set(['apple', ['orange', 'banana']]))

TypeError: unhashable type: 'list'

That's because we're feeding set() with a string and a list, the latter being a mutable object. Lists are meant to change readily (as opposed to tuples), and Python disallows their serialization when trying to hash them.

## The problem¶

Geopandas geometries are stored as shapely.geometry objects, which have the interesting following attribute of being unhashable in Python.

In [47]:
points = [Point([0, 0]), Point([1, 1])]
print(len(set(points)))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-47-bf8d7d6f42c6> in <module>()
1 points = [Point([0, 0]), Point([1, 1])]
----> 2 print(len(set(points)))

TypeError: unhashable type: 'Point'

Similarly with Geopandas, which stores its geometry column as shapely.geometry objects:

In [48]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
print(len(world['geometry'].unique()))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
----> 2 print(len(world['geometry'].unique()))

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py in unique(self)
1491         Categories (3, object): [a < b < c]
1492         """
-> 1493         result = super(Series, self).unique()
1494
1495         if is_datetime64tz_dtype(self.dtype):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\base.py in unique(self)
1047         else:
1048             from pandas.core.algorithms import unique1d
-> 1049             result = unique1d(values)
1050
1051         return result

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\algorithms.py in unique(values)
366
367     table = htable(len(values))
--> 368     uniques = table.unique(values)
369     uniques = _reconstruct_data(uniques, dtype, original)
370

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.unique()

TypeError: unhashable type: 'Polygon'

The reason for this is because shapely objects are mutable — each object is meant to be transformed and rewritten-to in memory under the same variable name.

## The solution¶

To get around this issue, you need to modify the way the geometry object is represented before passing your iterable to set() or GeoSeries.unique().

Like above, we'll change them to type str, which are immutable.

In [49]:
geometries = world['geometry'].apply(lambda x: x.wkt).values
print(geometries[0][:100], '...')

POLYGON ((61.21081709172574 35.65007233330923, 62.23065148300589 35.27066396742229, 62.9846623065766 ...

In [50]:
len(set(geometries))

Out[50]:
177