I was recently working on a project that required a GTFS ingestion step—specifically, loading GTFS files, unzipping them, extracting their contents, and inserting the data into a PostgreSQL database.
For those unfamiliar, GTFS (General Transit Feed Specification) is a standardized format used by public transit agencies to share schedule and geographic data—covering buses, trains, ferries, and more. Apps like Google Maps use this data to show transit options and schedules.
A typical GTFS dataset is a ZIP archive containing several CSV files, including:
1gtfs-data.zip2├── agency.txt (Transit agency information)3├── routes.txt (Bus/train routes and lines)4├── trips.txt (Individual scheduled trips)5├── stops.txt (Stop/station locations)6├── stop_times.txt (Arrival/departure times at stops)7├── calendar.txt (Service dates - weekdays, weekends)8├── calendar_dates.txt (Service exceptions - holidays, etc.)9├── fare_attributes.txt (Fare information - optional)10├── fare_rules.txt (Fare rules - optional)11└── shapes.txt (Route geometry - optional)1213Relationships:14agency.txt → routes.txt → trips.txt → stop_times.txt ← stops.txt15 ↑16 calendar.txt
These files are relational—e.g., agency.txt
→ routes.txt
→ trips.txt
→ stop_times.txt
, which connects to stops.txt
. They’re designed to map out how services run, when, and where.
Our goal was to ingest feeds from multiple agencies into a single database so we could support real-time queries like “When does the next X Agency bus arrive at Stop Z?”
We wrote a set of insert functions for each GTFS file we wanted to load, using Go’s pgx
connection pooler and transaction pattern to insert rows into the database:
1// insert stops.csv2func (r *repo) InsertStops(ctx context.Context, stops []Stop, feedName string) error {3 columns := []string{4 "id",5 "stop_code",6 "stop_name",7 "stop_desc",8 "stop_lat",9 "stop_lon",10 "zone_id",11 "stop_url",12 "location_type",13 "wheelchair_boarding",14 "parent_station",15 "stop_timezone",16 }1718 sql := db.BuildUpsertQuery("stops", columns, "id")1920 return db.Transaction(func(ctx context.Context, tx pgx.Tx) error {21 for _, stop := range stops {22 _, err := tx.Exec(ctx, sql,23 stop.ID,24 stop.Code,25 stop.Name,26 stop.Description,27 stop.Lat,28 stop.Lon,29 stop.ZoneID,30 stop.URL,31 stop.LocationType,32 stop.WheelchairBoard,33 stop.ParentStationID,34 stop.Timezone,35 )36 if err != nil {37 return err38 }39 }40 return nil41 })42}
1// DB Transaction function2func Transaction(fn func(ctx context.Context, tx pgx.Tx) error) error {3 ctx, cancel := context.WithTimeout(context.Background(), defaultTxTimeout)4 defer cancel()56 tx, err := Pool.Begin(ctx)7 if err != nil {8 return err9 }10 defer tx.Rollback(ctx)1112 if err := fn(ctx, tx); err != nil {13 return err14 }1516 return tx.Commit(ctx)17}
This approach worked well and scaled cleanly across various GTFS datasets. Most files were only a few megabytes and contained a few thousand rows each—nothing dramatic.
That was until we tried loading the New York Metro GTFS feed.
It started off smoothly... until it reached the stops.txt
file. At that point, the process stalled for nearly 15 minutes before finally completing. This was on a MacBook Pro M4, so something was clearly off.
After digging in, we realized we’d hit a classic performance bottleneck: inserting one row at a time, each inside its own transaction. While this is safe and reliable, it's also painfully slow when you're dealing with millions of rows—in this case, over 3 million rows in just one file.
That may be acceptable for a one-time load, but we needed a pipeline that could reliably handle large, complex GTFS feeds with tens of millions of rows. Our current method wouldn’t scale.
Evaluating Faster Insert Options
We narrowed it down to two better-performing alternatives:
Batch inserts (e.g., inserting 50,000 rows at a time)
PostgreSQL
COPY FROM
We decided to try out both approaches to see what worked best for our circumstances and the non-trivial job of inserting 3+ million rows in one on numerous occasions.
We implemented both to see how they'd perform in our environment.
Here’s how they compared for inserting over 3 million rows:
Row-by-row inserts - 15+ min - Too slow to be acceptable
Batch (chunked) - 60–90 sec - Much faster; supports conflict logic
COPY FROM - 5–10 sec - Fastest by far; no conflict support
The difference was dramatic. COPY FROM
blew the others out of the water in terms of raw speed. However, it's worth noting one key limitation: COPY FROM
does not support ON CONFLICT
logic, which is necessary for upserting.
Fortunately, our use case didn’t require upserts. We planned to clear out old GTFS data before each run, ensuring we avoided duplicates or stale records.
COPY FROM in practise
Here’s what using COPY FROM
looked like in code:
1func (r *repo) InsertStopTimes(ctx context.Context, stopTimes []StopTime, feedName string) error {2 rows := make([][]interface{}, len(stopTimes))34 for i, stopTime := range stopTimes {5 rowID := generateScopedID(stopTime.TripID, stopTime.StopID)67 rows[i] = []interface{}{8 rowID,9 stopTime.TripID,10 stopTime.StopID,11 stopTime.ArrivalTime,12 stopTime.DepartureTime,13 stopTime.StopSequence,14 stopTime.StopHeadsign,15 stopTime.PickupType,16 stopTime.DropOffType,17 }18 }1920 columns := []string{21 "id",22 "trip_id",23 "stop_id",24 "arrival_time",25 "departure_time",26 "stop_sequence",27 "stop_headsign",28 "pickup_type",29 "drop_off_type",30 }3132 conn, err := db.Pool.Acquire(ctx)33 if err != nil {34 return err35 }36 defer conn.Release()3738 _, err = conn.Conn().CopyFrom(39 ctx,40 pgx.Identifier{"stop_times"},41 columns,42 pgx.CopyFromRows(rows),43 )4445 return err46}
PostgreSQL’s COPY FROM
is purpose-built for scenarios like this. It streams data from external sources (CSV, TSV, binary, etc.) directly into tables, bypassing the overhead of standard INSERT
statements. This makes it perfect for high-volume, high-speed imports.
Final Thoughts
By replacing our line-by-line transactions with COPY FROM
, we cut our GTFS ingestion time from 15 minutes to under 10 seconds in some cases. Batch inserts still have a place—especially when upserts or per-row validation are needed—but for clean bulk imports, COPY FROM
is the clear winner.
If you're working with GTFS feeds—or any large dataset—and find yourself stuck on slow inserts, give COPY FROM
a try. Just make sure you're okay with its limitations, and prepare for some seriously fast results.