csv2table
28 August 2016
csv2table
is a tool that I wrote. Over the years, I've found myself
constantly interacting with delimited files and constantly
frustrated by the lack of tooling to query such files.
csv2table
was born of that frustration.
If you spend a lot of time massaging data in CSV or Excel
files or frequently load delimited data into a database
(I'm looking at you, data-science-types) then
csv2table
is a tool you should know and love.
The Basics
Before we get any further: csv2table
is meant
for use with postgres (redshift, as well). It utilizes the
COPY
command to perform the real heavy
lifting. If another database implements this command then
csv2table
will work just dandy. Otherwise, you
can still use most of csv2table
's
functionality minus the --copy
-related bits.
My most common use is to create a table and load the data into it:
~$ cat /tmp/colors.csv Id,Color,Color Name,Description,Hex #,Inventory,Add Date 1,red,Red,Having the color of blood,#f00,0.25,2014-10-16 2,green,Green,Having the color of growing grass,#0f0,10.18,2014-08-25 3,blue,Blue,Having the color of the clear sky,#00f,4.67,2014-09-17 ~$ csv2table --file /tmp/colors.csv --copy create table "colors" ( "Id" text, "Color" text, "Color Name" text, "Description" text, "Hex #" text, "Inventory" text, "Add Date" text ); copy "colors"("Id", "Color", "Color Name", "Description", "Hex #", "Inventory", "Add Date") from '/tmp/colors.csv' with csv header delimiter ',' quote '"'; ~$ csv2table --file /tmp/colors.csv --copy | pg test SET SET Null display is "<NÜLLZØR>". Timing is on. CREATE TABLE Time: 4.795 ms COPY 3 Time: 1.211 ms ~$ pg test (postgres@[local]:5432 21:19:54) [test]> \d colors Table "public.colors" Column | Type | Modifiers -------------+------+----------- Id | text | Color | text | Color Name | text | Description | text | Hex # | text | Inventory | text | Add Date | text |
The above command emitted a CREATE TABLE
statement and a COPY
statement and loaded all
the data into a table with same fields and structure as the
supplied CSV. The name of the table was inferred from the
name of the file.
Every column was created with the type TEXT
.
csv2table
can attempt to automatically find
types for your columns. Pass each type you'd like to
detect:
~$ csv2table --file /tmp/colors.csv --integer --timestamp create table "colors" ( "Id" integer, "Color" text, "Color Name" text, "Description" text, "Hex #" text, "Inventory" text, "Add Date" timestamp );
As you can see, the Id
column will be created
as type INTEGER
and the Add Date
column will be created with type TIMESTAMP
.
This is the full list of supported type detections:
--integer
--numeric
--timestamp
(also used with--tz
forTIMESTAMPTZ
)--date
--array
--json
--jsonb
Another flag I almost always use is -1
aka
--transaction
. This wraps the whole thing in a
transaction so that if your data load fails the table and
any other structures won't be created and leave you in a
half-complete state.
d e l i m i t e d
There's a long running joke at work about receiving e
delimited files. We've received files with many different
delimiters as well as files masquerading as delimited
files. The most popular delimiters are probably
,
and |
:
~$ csv2csv --in-file /tmp/colors.csv --out-delimiter '|' --out-file /tmp/colors.psv ~$ csv2table --file /tmp/colors.psv --delimiter '|' --copy create table "colors" ( "Id" text, "Color" text, "Color Name" text, "Description" text, "Hex #" text, "Inventory" text, "Add Date" text ); copy "colors"("Id", "Color", "Color Name", "Description", "Hex #", "Inventory", "Add Date") from '/tmp/colors.psv' with csv header delimiter '|' quote '"';
Parsing and COPY
done just fine
(csv2csv
is another tool I use semi-regularly).
Tabs show up pretty frequently too. Hard to type, so
csv2table
has some helper handling:
~$ csv2csv --in-file /tmp/colors.csv --out-delimiter '\t' --out-file /tmp/colors.tsv ~$ create table "colors" ( "Id" text, "Color" text, "Color Name" text, "Description" text, "Hex #" text, "Inventory" text, "Add Date" text ); copy "colors"("Id", "Color", "Color Name", "Description", "Hex #", "Inventory", "Add Date") from '/tmp/colors.tsv' with csv header delimiter ' ' quote '"';
Sometimes, you run into files that are kinda CSV files. Delimited, sure. But without a quote character. I commonly see this with pipe delimited files. For our data, usually I can get away by picking, say, @ for the quote character. This will prevent a quote from ever being found. So if your data has something like this in it:
~$ cat /tmp/lol.csv oh|hai|there let's just sprinkle "|some "" characters|"around ~$ csv2table --file /tmp/lol.csv --delimiter '|' --copy | pg test SET SET Null display is "<NÜLLZØR>". Timing is on. CREATE TABLE Time: 4.421 ms ERROR: 22P04: missing data for column "hai" CONTEXT: COPY lol, line 2: "let's just sprinkle "|some "" characters|"around" LOCATION: NextCopyFrom, copy.c:2989 Time: 2.384 ms ~$ csv2table --file /tmp/lol.csv --delimiter '|' --copy --quote '@' | pg test SET SET Null display is "<NÜLLZØR>". Timing is on. CREATE TABLE Time: 5.091 ms COPY 1 Time: 1.432 ms
Kind of a hack but I plan on some day adding a feature to handle this directly.
Pro tip: When you're working on a Mac and need to load an
Excel file, save it as a CSV in the "Windows Comma
Separated (.csv)" format not in the "Comma Separated Values
(.csv)" format. This will make the file loadable with
csv2table
. The default format on OSX is to use
just carriage return to separate lines, which causes issues
with many tools. The Windows format will also include a
newline.
Show Me Where The Bad File Touched You
The colors.csv
file we've been working with
isn't ideal. Those column names are AWFUL (as I'm sure
you've noticed). csv2table
has got your back
here, too:
~$ csv2table --file /tmp/colors.csv --mogrify create table "colors" ( "Id" text, "Color" text, "Color_Name" text, "Description" text, "Hex_" text, "Inventory" text, "Add_Date" text );
Better, but capitalization is really annoying too:
~$ csv2table --file /tmp/colors.csv --mogrify --lower create table "colors" ( "id" text, "color" text, "color_name" text, "description" text, "hex_" text, "inventory" text, "add_date" text );
Ahh, much better. Same rules apply to the name of the generated table. So when Jimmy names your files:
~$ cp /tmp/colors.csv "/tmp/How To Name Files Like a Windows User (1).csv" ~$ csv2table --file "/tmp/How To Name Files Like a Windows User (1).csv" --mogrify --lower create table "how_to_name_files_like_a_windows_user_1_" ( "id" text, "color" text, "color_name" text, "description" text, "hex_" text, "inventory" text, "add_date" text );
csv2table
has your back there, too.
Sometimes, your file will have duplicate header names:
~$ cat /tmp/hehe.csv 1,1,1,1 wow,such,descriptive,names ~$ csv2table --file /tmp/hehe.csv --fix-duplicates create table "hehe" ( "1" text, "1_1" text, "1_2" text, "1_3" text );
Not the greatest names in the world, but at least you'll be able to quickly load your data. I often find the columns with the same names are also ones I'd like to ignore. So this lets me get the data loaded and me on with my life.
Sometimes, Satan created your file. It has no headers.
csv2table
is hard at work sprinkling holy
water on that dumpster fire, too:
~$ cat /tmp/fail.csv there,are,no,names,at,all only,works,with,two,lines,minimum ~$ csv2table --file /tmp/fail.csv --missing-headers create table "fail" ( "col_0" text, "col_1" text, "col_2" text, "col_3" text, "col_4" text, "col_5" text );
Could be better but let's see you perform that exorcism.
🎶Cry For The Bad Data🎶
Not only will you files have bad formats, they'll have bad
data, too. There's only so much juju that
csv2table
can bring to bear in this case but
there's a couple flags and techniques that I find useful
here.
Most issues are just the type detection being based on the
first row of data. --keep-going
will read
more of the file and attempt to do a better job of things.
It can be slow to process the whole file, however, so
select a reasonably small number of lines.
--skip-parsing
can be used to disable type
detection for specific columns. In my line of work, there
are many numeric identifiers. These are actually numeric
strings and not numbers. I pretty much never want them to
actually end up as numbers. --skip-parsing
will let you name these columns and prevent them from being
typed (actually, they end up as TEXT
because
being untyped is a weird gray area).
If your data can load as TEXT
you're probably
best off trying to fix it in the database. In postgres,
ALTER TABLE colors ALTER COLUMN "Id" TYPE INTEGER USING "Id"::INTEGER
is one approach. USING
can take an arbitrary
expression and since this command re-writes the entire
table you end up with no wasted space. Of course, you can
use UPDATE
as well.
If you can't get the data loaded as all TEXT
,
you've got a real steaming turd file on your hands. If you
care about the data a lot, you'll have to open it in an
editor and futz around until you've got all those chunks of
corn bad data sorted. When you aren't
super concerned about your data iconv
can
throw you a bone here. sed
can probably help,
too. If you're stuck, try one of these:
~$ iconv -c -f utf-8 -t ascii colors.csv ~$ iconv -f utf8 -t ascii//TRANSLIT colors.csv ~$ sed 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//'
The first command will kill anything that isn't ASCII, the second will try to remove accents and such, and the third should basically do the same as the first. More information here at stackoverflow.
g-g-g-g-zip
- gzip support
csv2table
supports gzip compressed files:
~$ gzip /tmp/colors.csv ~$ csv2table --file /tmp/colors.csv.gz --gzip --copy create table "colors" ( "Id" text, "Color" text, "Color Name" text, "Description" text, "Hex #" text, "Inventory" text, "Add Date" text ); copy "colors"("Id", "Color", "Color Name", "Description", "Hex #", "Inventory", "Add Date") from program 'gunzip < /tmp/colors.csv.gz' with csv header delimiter ',' quote '"';
Note that for this to work, you need to have a version of
postgres that is at least 9.3 or newer. The
COPY
command above utilizes the
PROGRAM
option to decompress the file.
Usage with Redshift
csv2table
also works with redshift, which is a
data warehousing solution provided by AWS and based off of
postgres 8.0 (with backported features and added
enhancements). Working with redshift requires the file to
be in Amazon's infrastructure (often in S3). You
csv2table
supports reading environment
variables (of the kind used by other Amazon tools) and
config files:
~$ csv2table --file /tmp/colors.csv --redshift "<ENV>"
The above will read the credentials from
AWS_ACCESS_KEY_ID
,
AWS_SECRET_ACCESS_KEY
, and
S3_BUCKET
. If the credentials themselves are
not found, but AWS_CREDENTIAL_FILE
is found,
then csv2table
will read the file given by
that variable.
~$ cs2table --file /tmp/colors.csv --redshift /tmp/config.ini
The above will read the given config file. The names are as follows:
Value | Name Options |
---|---|
Access Key Id | AWSAccessKeyId , aws_access_key_id , access_key , s3.account_id |
Secret Access Key | AWSSecretKey , aws_secret_access_key , secret_key , s3.private_key |
S3 Bucket | s3.bucket , s3.transfer_bucket |
There are historical reasons for these names but the conventions are typical of those I've seen in the wild and those used in the codebase at work.
The S3 bucket option is sort-of non-standard but necessary
to upload the file to or read the file from.
csv2table
will upload the file for when used
with the --redshift-upload
option. Otherwise,
it will assume that the file is already uploaded to the
given bucket.
The gzip format is supported here as well and I would recommned using it with large files to reduce upload times (I haven't measured load times with and without compression but anecdotally it seems to be at least as fast when the files are compressed).
csv2table
works without third party libraries
when used with postgres. With redshift, however,
boto
must be available for the
--redshift-upload
option to work. Sorry about
that.
Some more advanced features
csv2table
has many, many more options. The
ones I often reach for can be grouped into three buckets:
schema handling, ancillary data, and indexing,
Schema handling options let you control how the table is created:
-
--drop
. Drop the table first, if it exists. -
--truncate
. Empty the table before copying into it. -
--no-create
. Don't create the table. Useful when loading multiple files into a table. -
--temporary
. Make the table aTEMPORARY
table. -
--cine
.CREATE IF NOT EXISTS
. -
--create-schema
. When using--schema
this option can be used to create the schema as well.
Ancillary data options allow you to add in some additional information to your table:
-
--file-column
. Add a column with the name of the file the data was loaded from. -
--serial-column
. Add aSERIAL
column with the given name.
Lastly and leastly, indexing:
-
--primary-key
. Make the named column(s) the primary key of the table.
Conclusion
Put those CSV files into your database! Happy querying!