csv2table

28 August 2016

csv2table is a tool that I wrote. Over the years, I've found myself constantly interacting with delimited files and constantly frustrated by the lack of tooling to query such files. csv2table was born of that frustration.

If you spend a lot of time massaging data in CSV or Excel files or frequently load delimited data into a database (I'm looking at you, data-science-types) then csv2table is a tool you should know and love.

The Basics

Before we get any further: csv2table is meant for use with postgres (redshift, as well). It utilizes the COPY command to perform the real heavy lifting. If another database implements this command then csv2table will work just dandy. Otherwise, you can still use most of csv2table's functionality minus the --copy-related bits.

My most common use is to create a table and load the data into it:

~$ cat /tmp/colors.csv
Id,Color,Color Name,Description,Hex #,Inventory,Add Date
1,red,Red,Having the color of blood,#f00,0.25,2014-10-16
2,green,Green,Having the color of growing grass,#0f0,10.18,2014-08-25
3,blue,Blue,Having the color of the clear sky,#00f,4.67,2014-09-17
~$ csv2table --file /tmp/colors.csv --copy
create table "colors" (
    "Id" text,
    "Color" text,
    "Color Name" text,
    "Description" text,
    "Hex #" text,
    "Inventory" text,
    "Add Date" text
);
copy "colors"("Id", "Color", "Color Name", "Description", "Hex #", "Inventory", "Add Date") from '/tmp/colors.csv' with csv header delimiter ',' quote '"';
~$ csv2table --file /tmp/colors.csv --copy | pg test
SET
SET
Null display is "<NÜLLZØR>".
Timing is on.
CREATE TABLE
Time: 4.795 ms
COPY 3
Time: 1.211 ms
~$ pg test
(postgres@[local]:5432 21:19:54) [test]> \d colors
     Table "public.colors"
   Column    | Type | Modifiers
-------------+------+-----------
 Id          | text |
 Color       | text |
 Color Name  | text |
 Description | text |
 Hex #       | text |
 Inventory   | text |
 Add Date    | text |

The above command emitted a CREATE TABLE statement and a COPY statement and loaded all the data into a table with same fields and structure as the supplied CSV. The name of the table was inferred from the name of the file.

Every column was created with the type TEXT. csv2table can attempt to automatically find types for your columns. Pass each type you'd like to detect:

~$ csv2table --file /tmp/colors.csv --integer --timestamp
create table "colors" (
    "Id" integer,
    "Color" text,
    "Color Name" text,
    "Description" text,
    "Hex #" text,
    "Inventory" text,
    "Add Date" timestamp
);

As you can see, the Id column will be created as type INTEGER and the Add Date column will be created with type TIMESTAMP.

This is the full list of supported type detections:

--integer
--numeric
--timestamp (also used with --tz for TIMESTAMPTZ)
--date
--array
--json
--jsonb

Another flag I almost always use is -1 aka --transaction. This wraps the whole thing in a transaction so that if your data load fails the table and any other structures won't be created and leave you in a half-complete state.

d e l i m i t e d

There's a long running joke at work about receiving e delimited files. We've received files with many different delimiters as well as files masquerading as delimited files. The most popular delimiters are probably , and |:

~$ csv2csv --in-file /tmp/colors.csv --out-delimiter '|' --out-file /tmp/colors.psv
~$ csv2table --file /tmp/colors.psv --delimiter '|' --copy
create table "colors" (
    "Id" text,
    "Color" text,
    "Color Name" text,
    "Description" text,
    "Hex #" text,
    "Inventory" text,
    "Add Date" text
);
copy "colors"("Id", "Color", "Color Name", "Description", "Hex #", "Inventory", "Add Date") from '/tmp/colors.psv' with csv header delimiter '|' quote '"';

Parsing and COPY done just fine (csv2csv is another tool I use semi-regularly).

Tabs show up pretty frequently too. Hard to type, so csv2table has some helper handling:

~$ csv2csv --in-file /tmp/colors.csv --out-delimiter '\t' --out-file /tmp/colors.tsv
~$ create table "colors" (
    "Id" text,
    "Color" text,
    "Color Name" text,
    "Description" text,
    "Hex #" text,
    "Inventory" text,
    "Add Date" text
);
copy "colors"("Id", "Color", "Color Name", "Description", "Hex #", "Inventory", "Add Date") from '/tmp/colors.tsv' with csv header delimiter '  ' quote '"';

Sometimes, you run into files that are kinda CSV files. Delimited, sure. But without a quote character. I commonly see this with pipe delimited files. For our data, usually I can get away by picking, say, @ for the quote character. This will prevent a quote from ever being found. So if your data has something like this in it:

~$ cat /tmp/lol.csv
oh|hai|there
let's just sprinkle "|some "" characters|"around
~$ csv2table --file /tmp/lol.csv --delimiter '|' --copy | pg test
SET
SET
Null display is "<NÜLLZØR>".
Timing is on.
CREATE TABLE
Time: 4.421 ms
ERROR:  22P04: missing data for column "hai"
CONTEXT:  COPY lol, line 2: "let's just sprinkle "|some "" characters|"around"
LOCATION:  NextCopyFrom, copy.c:2989
Time: 2.384 ms
~$ csv2table --file /tmp/lol.csv --delimiter '|' --copy --quote '@' | pg test
SET
SET
Null display is "<NÜLLZØR>".
Timing is on.
CREATE TABLE
Time: 5.091 ms
COPY 1
Time: 1.432 ms

Kind of a hack but I plan on some day adding a feature to handle this directly.

Pro tip: When you're working on a Mac and need to load an Excel file, save it as a CSV in the "Windows Comma Separated (.csv)" format not in the "Comma Separated Values (.csv)" format. This will make the file loadable with csv2table. The default format on OSX is to use just carriage return to separate lines, which causes issues with many tools. The Windows format will also include a newline.

Show Me Where The Bad File Touched You

The colors.csv file we've been working with isn't ideal. Those column names are AWFUL (as I'm sure you've noticed). csv2table has got your back here, too:

~$ csv2table --file /tmp/colors.csv --mogrify
create table "colors" (
    "Id" text,
    "Color" text,
    "Color_Name" text,
    "Description" text,
    "Hex_" text,
    "Inventory" text,
    "Add_Date" text
);

Better, but capitalization is really annoying too:

~$ csv2table --file /tmp/colors.csv --mogrify --lower
create table "colors" (
    "id" text,
    "color" text,
    "color_name" text,
    "description" text,
    "hex_" text,
    "inventory" text,
    "add_date" text
);

Ahh, much better. Same rules apply to the name of the generated table. So when Jimmy names your files:

~$ cp /tmp/colors.csv "/tmp/How To Name Files Like a Windows User (1).csv"
~$ csv2table --file "/tmp/How To Name Files Like a Windows User (1).csv" --mogrify --lower
create table "how_to_name_files_like_a_windows_user_1_" (
    "id" text,
    "color" text,
    "color_name" text,
    "description" text,
    "hex_" text,
    "inventory" text,
    "add_date" text
);

csv2table has your back there, too.

Sometimes, your file will have duplicate header names:

~$ cat /tmp/hehe.csv
1,1,1,1
wow,such,descriptive,names
~$ csv2table --file /tmp/hehe.csv --fix-duplicates
create table "hehe" (
    "1" text,
    "1_1" text,
    "1_2" text,
    "1_3" text
);

Not the greatest names in the world, but at least you'll be able to quickly load your data. I often find the columns with the same names are also ones I'd like to ignore. So this lets me get the data loaded and me on with my life.

Sometimes, Satan created your file. It has no headers. csv2table is hard at work sprinkling holy water on that dumpster fire, too:

~$ cat /tmp/fail.csv
there,are,no,names,at,all
only,works,with,two,lines,minimum
~$ csv2table --file /tmp/fail.csv --missing-headers
create table "fail" (
    "col_0" text,
    "col_1" text,
    "col_2" text,
    "col_3" text,
    "col_4" text,
    "col_5" text
);

Could be better but let's see you perform that exorcism.

🎶Cry For The Bad Data🎶

Not only will you files have bad formats, they'll have bad data, too. There's only so much juju that csv2table can bring to bear in this case but there's a couple flags and techniques that I find useful here.

Most issues are just the type detection being based on the first row of data. --keep-going will read more of the file and attempt to do a better job of things. It can be slow to process the whole file, however, so select a reasonably small number of lines.

--skip-parsing can be used to disable type detection for specific columns. In my line of work, there are many numeric identifiers. These are actually numeric strings and not numbers. I pretty much never want them to actually end up as numbers. --skip-parsing will let you name these columns and prevent them from being typed (actually, they end up as TEXT because being untyped is a weird gray area).

If your data can load as TEXT you're probably best off trying to fix it in the database. In postgres, ALTER TABLE colors ALTER COLUMN "Id" TYPE INTEGER USING "Id"::INTEGER is one approach. USING can take an arbitrary expression and since this command re-writes the entire table you end up with no wasted space. Of course, you can use UPDATE as well.

If you can't get the data loaded as all TEXT, you've got a real steaming turd file on your hands. If you care about the data a lot, you'll have to open it in an editor and futz around until you've got all those chunks of ~~corn~~ bad data sorted. When you aren't super concerned about your data iconv can throw you a bone here. sed can probably help, too. If you're stuck, try one of these:

~$ iconv -c -f utf-8 -t ascii colors.csv
~$ iconv -f utf8 -t ascii//TRANSLIT colors.csv
~$ sed 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//'

The first command will kill anything that isn't ASCII, the second will try to remove accents and such, and the third should basically do the same as the first. More information here at stackoverflow.

g-g-g-g-zip

- gzip support

csv2table supports gzip compressed files:

~$ gzip /tmp/colors.csv
~$ csv2table --file /tmp/colors.csv.gz --gzip --copy
create table "colors" (
    "Id" text,
    "Color" text,
    "Color Name" text,
    "Description" text,
    "Hex #" text,
    "Inventory" text,
    "Add Date" text
);
copy "colors"("Id", "Color", "Color Name", "Description", "Hex #", "Inventory", "Add Date") from program 'gunzip < /tmp/colors.csv.gz' with csv header delimiter ',' quote '"';

Note that for this to work, you need to have a version of postgres that is at least 9.3 or newer. The COPY command above utilizes the PROGRAM option to decompress the file.

Usage with Redshift

csv2table also works with redshift, which is a data warehousing solution provided by AWS and based off of postgres 8.0 (with backported features and added enhancements). Working with redshift requires the file to be in Amazon's infrastructure (often in S3). You csv2table supports reading environment variables (of the kind used by other Amazon tools) and config files:

~$ csv2table --file /tmp/colors.csv --redshift "<ENV>"

The above will read the credentials from AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and S3_BUCKET. If the credentials themselves are not found, but AWS_CREDENTIAL_FILE is found, then csv2table will read the file given by that variable.

~$ cs2table --file /tmp/colors.csv --redshift /tmp/config.ini

The above will read the given config file. The names are as follows:

Value	Name Options
Access Key Id	`AWSAccessKeyId`, `aws_access_key_id`, `access_key`, `s3.account_id`
Secret Access Key	`AWSSecretKey`, `aws_secret_access_key`, `secret_key`, `s3.private_key`
S3 Bucket	`s3.bucket`, `s3.transfer_bucket`

There are historical reasons for these names but the conventions are typical of those I've seen in the wild and those used in the codebase at work.

The S3 bucket option is sort-of non-standard but necessary to upload the file to or read the file from. csv2table will upload the file for when used with the --redshift-upload option. Otherwise, it will assume that the file is already uploaded to the given bucket.

The gzip format is supported here as well and I would recommned using it with large files to reduce upload times (I haven't measured load times with and without compression but anecdotally it seems to be at least as fast when the files are compressed).

csv2table works without third party libraries when used with postgres. With redshift, however, boto must be available for the --redshift-upload option to work. Sorry about that.

Some more advanced features

csv2table has many, many more options. The ones I often reach for can be grouped into three buckets: schema handling, ancillary data, and indexing,

Schema handling options let you control how the table is created:

--drop. Drop the table first, if it exists.
--truncate. Empty the table before copying into it.
--no-create. Don't create the table. Useful when loading multiple files into a table.
--temporary. Make the table a TEMPORARY table.
--cine. CREATE IF NOT EXISTS.
--create-schema. When using --schema this option can be used to create the schema as well.

Ancillary data options allow you to add in some additional information to your table:

--file-column. Add a column with the name of the file the data was loaded from.
--serial-column. Add a SERIAL column with the given name.

Lastly and leastly, indexing:

--primary-key. Make the named column(s) the primary key of the table.

Conclusion

Put those CSV files into your database! Happy querying!