Building a Global Shipping Density Map with H3 Hexagons

Published

I’ve always been interested in the problem of turning noisy, messy AIS data into something genuinely useful for maritime intelligence tooling. On paper it sounds straightforward: vessels emit positions, headings, speeds, and destinations... surely you can just stitch it together into something queryable. In practice, AIS data is chaotic. Signals are noisy, routes are inconsistent, destinations are free text, and vessel movement is anything but clean.

I’ve taken a few shots at solving this problem over the years, usually ending somewhere between "interesting experiment" and "too computationally expensive to be practical." This latest attempt has been the most promising so far, largely thanks to Uber’s H3 hexagonal geospatial library and a shift in thinking from exact vessel paths to probabilistic movement through space.

Here is the result - a global hexagonal grid showing shipping density, flow direction, and transition probabilities. Its possibly one of the more satisfying data engineering projects I’ve worked on recently.

Res3 H3 cells at global scale. Each hexagon covers roughly 200–500km. Colour gradient shows vessel-hour density.

I've provided a short run down on building the data output and why I chose h3 cells over other techniques.

What is AIS data, and why is it a mess?

AIS (Automatic Identification System) is the transponder system every vessel above a certain size is legally required to run. Ships broadcast their position, speed, heading, and a bunch of metadata every few seconds. Networks of shore stations and satellites receive these broadcasts and aggregate them into datasets you can buy through sources like MarineTraffic, Spire or sometimes find for free in limited amounts.

Sounds clean. It really isn't.

Here's the reality of what you're dealing with:

  • Duplicate pings. The same position broadcast gets picked up by multiple receivers and ends up in the dataset twice (or ten times).

  • Noisy data. Vessels continue to ping while they are drifting at anchor or in port. This turns in thousands of tiny movements that need to be normalised into a single event.

  • Position jumps. A vessel appears in the North Sea, then 200nm away three seconds later, then back again. GPS glitch, satellite handoff artefact, take your pick.

  • Speed outliers. Ships physically cannot exceed about 30 knots. The raw data often contains vessels doing >400 knots!

  • Missing voyages. Vessels go "dark". Either intentionally (spoofing) or because they left satellite coverage (ie while in the middle of the Pacific Ocean). You get gaps of hours or days.

  • Variable sampling rates. One vessel reports every 10 seconds. Another reports every 45 minutes. Both end up in the same dataset and need to be treated differently.

Before you can do anything interesting, you have to clean this all up.

Cleaning and segmenting the data

The cleaning pass is mostly filtering: deduplicating by quantising coordinates (ie rounding 34.59595959 down to 34.59 so pings from a barely-moving vessel don't get counted twice), dropping speeds above 30 knots, dropping coordinates outside valid WGS-84 ranges, dropping records missing origin/destination etc.

The filtering is pretty straightforward, but the interesting bit here is actually the voyage segmentation. Grouping position records into discrete journeys. I was lucky here: my dataset already had origin port, destination port, and departure time attached to each ping. That means segmentation is just a group-by rather than proximity detection.

df = df.with_columns(
pl.concat_str(
[pl.col("imo").cast(pl.String),
pl.col("origin_locode"),
pl.col("destination_locode"),
pl.col("atd").cast(pl.String)],
separator="|",
).alias("voyage_id")
)

Each unique `(imo, origin, destination, departure_time)` tuple becomes a voyage ID. Legs with fewer than 5 positions are dropped. They carry too little signal to be useful. This is important: you don't want a vessel that reported twice in the middle of the ocean distorting your density counts.

Why H3?

Once the data is clean, the next question is: what do you project it onto?
I chose Uber's H3 library, which gives you a hierarchical hexagonal grid covering the globe. A few reasons hexagons specifically:

  • Uniform adjacency. Every hex has exactly 6 neighbours, all at the same distance. Square grids have 8 neighbours but the diagonal ones are farther away, which creates subtle distortions in any algorithm that traverses a grid mapping over a sphere.

  • Clean hierarchical levels. H3 lets you pick a resolution from 0 (enormous cells) to 15 (tiny). Each level subdivides the previous one, so you can blend resolutions cleanly.

  • Trivial coordinate projection. Converting a lat/lng to an H3 cell is one function call.

andrewgreig.com

Its pretty easy to work with:

import h3
cell = h3.latlng_to_cell(lat, lon, resolution=3) # → '831b34fffffffff'

That's it. Every position in your dataset becomes a hex cell ID. The hard work is in how you aggregate once you're there.

I used two resolutions: res3 (~200–500km cells, good for open ocean corridors) and res4 (~50–150km cells, for denser coastal and port areas). The plan is to blend these on zoom in the frontend, res3 at global scale, res4 when you zoom into a region.

Here is the h3 cell output mapped using our AIS to show cells that have seen any sort of activity. You can already see shipping lanes starting to emerge before we even apply the heatmap coloring.

andrewgreig.com

Building the confidence map

Here's where it gets interesting. I don't just want to know where ships have been, I want to know how much, for how long, and in which direction they were heading.

The core metric is vessel-hours per cell: the sum of time each vessel spent in that cell, computed from the time delta between consecutive pings. This corrects for variable sampling rates. A vessel that reports every 10 seconds and a vessel that reports every 30 minutes both contribute proportionally to the time they actually spent in a cell.

1df = df.with_columns([
2 pl.col("h3_index").shift(-1).over("voyage_id").alias("next_h3_index"),
3 pl.col("reported_at").shift(-1).over("voyage_id").alias("next_reported_at"),
4])
5
6df = df.with_columns(
7 ((pl.col("next_reported_at") - pl.col("reported_at"))
8 .dt.total_seconds() / 3600.0)
9 .alias("time_delta_hours")
10)
11
12# Drop tracking gaps — not continuous travel
13df = df.filter(
14 (pl.col("time_delta_hours") > 0) & (pl.col("time_delta_hours") <= 3.0)
15)

The 3-hour gap threshold is a deliberate choice. If two consecutive pings from the same vessel are more than 3 hours apart, something interrupted tracking, ie. a dark period, a port stop, satellite dropout. You don't want to credit that gap to the cell as if the vessel was sitting there the whole time.

I also compute a flow vector per cell: the mean bearing and speed of all vessels passing through. Averaging angles directly is wrong (0° and 360° are the same heading, but their numeric mean is 180°), so I had to decompose into sin/cos components and use atan2 to recover the mean.

1df = df.with_columns([
2 (pl.col("course") * math.pi / 180.0).sin().alias("course_sin"),
3 (pl.col("course") * math.pi / 180.0).cos().alias("course_cos"),
4])
5
6cells = df.group_by("h3_index").agg([
7 pl.col("time_delta_hours").sum().alias("density_vessel_hours"),
8 pl.col("course_sin").mean().alias("_mean_sin"),
9 pl.col("course_cos").mean().alias("_mean_cos"),
10 pl.col("speed_kts").mean().alias("flow_speed_kts"),
11 pl.col("voyage_id").n_unique().alias("voyage_count"),
12])
13
14cells = cells.with_columns(
15 ((pl.arctan2(pl.col("_mean_sin"), pl.col("_mean_cos"))
16 * (180.0 / math.pi) % 360.0 + 360.0) % 360.0)
17 .alias("flow_bearing_deg")
18)

The output is two parquet files:

  • h3_cells.parquet. One row per H3 cell, with density, flow bearing, flow speed, voyage count.

  • h3_transitions.parquet. one row per (from_cell, to_cell) pair, with count and transition probability

That second file is where things start to look like a graph. More on that shortly.

But, before doing anything else with this data, I ran some sanity checks against known chokepoints. The numbers seemed to land about where I'd expect:

LocationRes3 Vessel-hoursVoyages
Strait of Malacca2,177441
Singapore Strait2,446317
English Channel930188
Strait of Gibraltar1,266220
Cape of Good Hope563139
Lombok Strait194

Lombok being sparse makes sense, it's a secondary passage around Bali, less commonly used than Malacca. Cape of Good Hope having fewer voyages but non-trivial vessel-hours suggests longer transits (bulk carriers, tankers on the Cape route rather than through Suez).

The dataset is also heavily Asia-Pacific weighted (due to the tracking data I had access to), which shows in the map. The East China Sea and Malacca cells are the densest in the world by a wide margin.

From parquet to map

The parquet files are the canonical store of the h3 data, but MapLibre GL can't read parquet, it needs GeoJSON. The export step converts each H3 cell ID into its actual hexagon polygon using h3.cell_to_boundary(), attaches the density and flow properties, and writes one GeoJSON FeatureCollection per resolution into the frontend's public directory.

boundary = h3.cell_to_boundary(row["h3_index"])
coordinates = [[lon, lat] for lat, lon in boundary] # h3 returns (lat, lon); GeoJSON wants [lon, lat]
coordinates.append(coordinates[0]) # close the ring

There's one non-obvious gotcha: cells that straddle the antimeridian (the 180° longitude line, out in the Pacific) need their vertices normalised so they don't wrap the wrong way around the globe. MapLibre handles coordinates outside [-180, 180] fine via wrapping, you just have to make sure each vertex is within 180° of its cell centroid rather than letting them drift to opposite sides of the line.

Once exported, the two GeoJSON files get served as static assets and loaded into MapLibre as fill layers, with density_vessel_hours mapped to a colour ramp. The multi-resolution blend uses res3 globally and switches to res4 on zoom using a MapLibre zoom-level style rule.

Res4 H3 cells (Singapore and the Malacca Strait). Each hexagon is roughly 50–150km.

What this grid makes possible

Congestion and density heatmaps. This is the most obvious use. The vessel-hours metric is already a congestion signal which means cells with very high density relative to their neighbours are likely chokepoints. With a live feed of AIS data, you could compute a rolling density that gives you a near-real-time congestion map. Suez gets blocked again, the Cape of Good Hope cells light up.

Shipping lane discovery. The transition graph (from_cellto_cell with probabilities) is already a directed weighted graph. Run a graph algorithm over it (ie A* or Dijkstra) and you get paths that naturally follow where ships actually go. You can extract lane centerlines without any manual cartographic input. The lanes are latent in the data; you just need to surface them.

Pathfinding. The transition graph is also a pathfinding graph. If you want to know the most likely route a vessel would take from Rotterdam to Singapore, you can query this graph directly. The challenge is that raw AIS transitions have some noise (land-crossing edges from position jumps, sparse coverage in remote ocean areas), but those are solvable problems.

Anomaly detection. A vessel appearing in a cell with near-zero historical density is unusual. A transition pair that's never appeared in the training data is unusual. With the baseline established, you can flag deviations, ie vessels taking unexpected routes, loitering in atypical areas, or transiting a corridor in the wrong direction.

Port pressure and dwell time. Cells adjacent to major ports accumulate vessel-hours from both transit traffic and waiting vessels. You can use the grid to reason about anchorage congestion, port approach bottlenecks, and how changes in trade flows affect port pressure over time.

Res4 H3 cells (Iceland, Greenland and Northern Europe). Each hexagon is roughly 50–150km.

The thing I found most satisfying about this project is how much structure was already latent in the raw data. You don't impose the shipping lanes, they naturally emerge from counting carefully and in a coordinated way. The H3 grid is just a sensible bucket to pour the counts into, and the transition pairs are just the natural next step after you've assigned each ping to a cell.

The hard parts were the boring ones: deduplication, gap handling, the circular mean trick for bearings. The interesting structure came for free once the data was clean.

H3 doesn't get talked about much outside of ride-sharing and delivery logistics contexts, but it's a genuinely useful primitive for maritime and geospatial work. If you're doing anything with vessel tracking, weather data, or any other phenomenon that needs a global spatial index, it's worth a look.

Backend code is written in Python, using Polars for the dataframe work and the h3-py bindings for cell operations. The frontend renders via MapLibre GL JS.