Inserting 3 million rows into Postgres in Go

Published Feb 2, 2023

I was recently working on a project that required a GTFS ingestion step—specifically, loading GTFS files, unzipping them, extracting their contents, and inserting the data into a PostgreSQL database.

For those unfamiliar, GTFS (General Transit Feed Specification) is a standardized format used by public transit agencies to share schedule and geographic data—covering buses, trains, ferries, and more. Apps like Google Maps use this data to show transit options and schedules.

A typical GTFS dataset is a ZIP archive containing several CSV files, including:

1gtfs-data.zip
2├── agency.txt          (Transit agency information)
3├── routes.txt          (Bus/train routes and lines)
4├── trips.txt           (Individual scheduled trips)
5├── stops.txt           (Stop/station locations)
6├── stop_times.txt      (Arrival/departure times at stops)
7├── calendar.txt        (Service dates - weekdays, weekends)
8├── calendar_dates.txt  (Service exceptions - holidays, etc.)
9├── fare_attributes.txt (Fare information - optional)
10├── fare_rules.txt      (Fare rules - optional)
11└── shapes.txt          (Route geometry - optional)
12
13Relationships:
14agency.txt → routes.txt → trips.txt → stop_times.txt ← stops.txt
15                                  ↑
16                            calendar.txt

These files are relational—e.g., agency.txt → routes.txt → trips.txt → stop_times.txt, which connects to stops.txt. They’re designed to map out how services run, when, and where.

Our goal was to ingest feeds from multiple agencies into a single database so we could support real-time queries like “When does the next X Agency bus arrive at Stop Z?”

We wrote a set of insert functions for each GTFS file we wanted to load, using Go’s pgx connection pooler and transaction pattern to insert rows into the database:

1// insert stops.csv
2func (r *repo) InsertStops(ctx context.Context, stops []Stop, feedName string) error {
3  columns := []string{
4    "id",
5    "stop_code",
6    "stop_name",
7    "stop_desc",
8    "stop_lat",
9    "stop_lon",
10    "zone_id",
11    "stop_url",
12    "location_type",
13    "wheelchair_boarding",
14    "parent_station",
15    "stop_timezone",
16  }
17
18  sql := db.BuildUpsertQuery("stops", columns, "id")
19
20  return db.Transaction(func(ctx context.Context, tx pgx.Tx) error {
21    for _, stop := range stops {
22      _, err := tx.Exec(ctx, sql,
23        stop.ID,
24        stop.Code,
25        stop.Name,
26        stop.Description,
27        stop.Lat,
28        stop.Lon,
29        stop.ZoneID,
30        stop.URL,
31        stop.LocationType,
32        stop.WheelchairBoard,
33        stop.ParentStationID,
34        stop.Timezone,
35      )
36      if err != nil {
37        return err
38      }
39    }
40    return nil
41  })
42}

1// DB Transaction function
2func Transaction(fn func(ctx context.Context, tx pgx.Tx) error) error {
3  ctx, cancel := context.WithTimeout(context.Background(), defaultTxTimeout)
4  defer cancel()
5
6  tx, err := Pool.Begin(ctx)
7  if err != nil {
8    return err
9  }
10  defer tx.Rollback(ctx)
11
12  if err := fn(ctx, tx); err != nil {
13    return err
14  }
15
16  return tx.Commit(ctx)
17}

This approach worked well and scaled cleanly across various GTFS datasets. Most files were only a few megabytes and contained a few thousand rows each—nothing dramatic.

That was until we tried loading the New York Metro GTFS feed.

It started off smoothly... until it reached the stops.txt file. At that point, the process stalled for nearly 15 minutes before finally completing. This was on a MacBook Pro M4, so something was clearly off.

After digging in, we realized we’d hit a classic performance bottleneck: inserting one row at a time, each inside its own transaction. While this is safe and reliable, it's also painfully slow when you're dealing with millions of rows—in this case, over 3 million rows in just one file.

That may be acceptable for a one-time load, but we needed a pipeline that could reliably handle large, complex GTFS feeds with tens of millions of rows. Our current method wouldn’t scale.

Evaluating Faster Insert Options

We narrowed it down to two better-performing alternatives:

Batch inserts (e.g., inserting 50,000 rows at a time)
PostgreSQL COPY FROM

We decided to try out both approaches to see what worked best for our circumstances and the non-trivial job of inserting 3+ million rows in one on numerous occasions.

We implemented both to see how they'd perform in our environment.

Here’s how they compared for inserting over 3 million rows:

Row-by-row inserts - 15+ min - Too slow to be acceptable
Batch (chunked) - 60–90 sec - Much faster; supports conflict logic
COPY FROM - 5–10 sec - Fastest by far; no conflict support

The difference was dramatic. COPY FROM blew the others out of the water in terms of raw speed. However, it's worth noting one key limitation: COPY FROM does not support ON CONFLICT logic, which is necessary for upserting.

Fortunately, our use case didn’t require upserts. We planned to clear out old GTFS data before each run, ensuring we avoided duplicates or stale records.

COPY FROM in practise

Here’s what using COPY FROM looked like in code:

1func (r *repo) InsertStopTimes(ctx context.Context, stopTimes []StopTime, feedName string) error {
2  rows := make([][]interface{}, len(stopTimes))
3
4  for i, stopTime := range stopTimes {
5    rowID := generateScopedID(stopTime.TripID, stopTime.StopID)
6
7    rows[i] = []interface{}{
8      rowID,
9      stopTime.TripID,
10      stopTime.StopID,
11      stopTime.ArrivalTime,
12      stopTime.DepartureTime,
13      stopTime.StopSequence,
14      stopTime.StopHeadsign,
15      stopTime.PickupType,
16      stopTime.DropOffType,
17    }
18  }
19
20  columns := []string{
21    "id",
22    "trip_id",
23    "stop_id",
24    "arrival_time",
25    "departure_time",
26    "stop_sequence",
27    "stop_headsign",
28    "pickup_type",
29    "drop_off_type",
30  }
31
32  conn, err := db.Pool.Acquire(ctx)
33  if err != nil {
34    return err
35  }
36  defer conn.Release()
37
38  _, err = conn.Conn().CopyFrom(
39    ctx,
40    pgx.Identifier{"stop_times"},
41    columns,
42    pgx.CopyFromRows(rows),
43  )
44
45  return err
46}

PostgreSQL’s COPY FROM is purpose-built for scenarios like this. It streams data from external sources (CSV, TSV, binary, etc.) directly into tables, bypassing the overhead of standard INSERT statements. This makes it perfect for high-volume, high-speed imports.

Final Thoughts

By replacing our line-by-line transactions with COPY FROM, we cut our GTFS ingestion time from 15 minutes to under 10 seconds in some cases. Batch inserts still have a place—especially when upserts or per-row validation are needed—but for clean bulk imports, COPY FROM is the clear winner.

If you're working with GTFS feeds—or any large dataset—and find yourself stuck on slow inserts, give COPY FROM a try. Just make sure you're okay with its limitations, and prepare for some seriously fast results.

Andrew S Greig

Contact

hi@andrewgreig.com

Social

Github / LinkedIn / Instagram