Using COPY in Postgres for Importing Large CSVs
August 29, 2018
It only took 16 hours four minutes
Faced with importing a million-line, 750 MB CSV file into Postgres for a Rails app, Daniel Fone did what most Ruby developers would do in that situation and wrote a simple Rake task to parse the CSV file and import each row via ActiveRecord. However, even at a brisk 15 records per second, it would take a whopping 16 hours to complete. Here’s what he did instead.
A million records
As part of building an API, I had an endpoint that needed to filter and query a lot of data -- principally from a table with over a million records. While it’s tempting to throw some indexes at it and hope for the best, I didn’t really want to design the queries or make any performance commitments before I knew what the data actually looked like. Fortunately, I was able to get a production dataset in the form of a CSV. While this was going to make designing the endpoint far easier, I still had to get this data into my development database. The file itself was 750 MB.
Rake: the naïve approach
My first approach was a simple Rake task. There was a bit of work to do for the import: there were some associations to
find_or_create, some datetime parsing to do, and the primary record itself had to be found or initialised so that the import was idempotent. An initial run looked like it was going to take 16 hours to finish. While I could have sped things up by removing validations and other ActiveRecord overhead, there was still enough unexpected data throughout the CSV that the task continually needed to be updated and restarted, promising to waste hours of import time.
A better option: A four-step Postgres COPY
My main concern with dropping Rails from the process was how to do the required transformations prior to inserting the records into the final tables. After exploring many options, by far the best approach seemed to be PostgreSQL’s COPY command. My key realisation was that I should simply import the CSV directly into a makeshift table that mirrored the CSV fields exactly, and worry about transformation and JOINs after the fact.
The first step is to create a table that resembles our CSV exactly. The simplest approach at this stage is just to use
varchars to represent the CSV data, so that we can deal with blank fields and cast to other types in later steps. You have to ensure there’s a column in the table for every field in the CSV, or you’ll get
ERROR: extra data after last expected column.
Time taken: 0.011s
In this step, we do the heavy lifting. We ingest the CSV with Postgres’ COPY command.*
As with much of Postgres, the documentation is thorough and fairly readable. The key thing to note here is that
FROM is interpreted relative to the working directory of the server process which states: "Files named in a COPY command are read or written directly by the server, not by the client application. Therefore, they must reside on or be accessible to the database server machine, not the client."
This is no issue if your Postgres client is on the same host as the server, but if you need to transfer the file from a client to a server, the nearly identical psql /copy meta-command might be a better bet.
Time taken: 39.0s
*It might be helpful to note that there’s been some churn in the syntax for this command recently, so if you’re pulling examples from Stack Overflow (or this post!), check the syntax against the docs for your version of Postgres.
With the heavy lifting out of the way, we can do the smart work in our last step. We have a large table of verbatim data from the CSV, and we want to transform and load it into the ‘real’ tables of our database. The data I was dealing with had a series of auxiliary relations, and these ‘parent’ records need to exist before we can populate our main table.
Imagine we’re importing reports. Each report might have a project it’s related to. Before we import the main ‘report’ record, we need to create or update the parent ‘project’ based on the data in the CSV. The hypothetical parent record could be imported like this:
There are a few important patterns to note:
- Our innermost
SELECT DISTINCTquery returns the attributes that make up the distinct projects represented across the entire CSV. Out of a million records, perhaps this is only a hundred or so distinct rows.
- We then
INSERTthese into our projects table.
- Finally, the
ON CONFLICTclause handles the case where there is an existing row with the same reference (e.g. from a previous import). In this case, the existing row will be updated and we use Postgres’ special `excluded` table to reference the row that was proposed for insertion.
Now that we’ve set up the required relations, we can import the main records themselves. The query looks much like the one above, but we can join the import table against our other newly created relations.
This is also where we handle the typecasting. Depending on the nature of the CSV and whether it uses
,””, to denote an empty field, your import table could have blank strings or
NULLs. In each case, I was able to use the pattern
Having loaded our raw CSV data into all the relevant tables, transforming it on the fly, our main work is done!
Time taken: 151.4s (4 inserts into 4 relations)
Our final step is clean up. We can drop the temporary CSV import table, and tell Postgres to do its housekeeping.
Doing the full vacuum is probably overkill, but it allows Postgres to reclaim the disk space from the now deleted tuples, and it will update the query planner statistics with the newly imported data.
Time taken: 50.3s
A go-to trick for copying large amounts of data
Every time I invest a little effort into learning more about Postgres, I’m amazed at its flexibility and utility. Instead of waiting days for a cobbled-together Rake task to inefficiently churn through the import, we have all the data in its place in about the time it takes to make a coffee. Certainly a tactic worth remembering.