csv2table
28 August 2016
csv2table
is a tool that I wrote. Over the years, I've found myself
constantly interacting with delimited files and constantly
frustrated by the lack of tooling to query such files.
csv2table was born of that frustration.
If you spend a lot of time massaging data in CSV or Excel
files or frequently load delimited data into a database
(I'm looking at you, data-science-types) then
csv2table is a tool you should know and love.
The Basics
Before we get any further: csv2table is meant
for use with postgres (redshift, as well). It utilizes the
COPY command to perform the real heavy
lifting. If another database implements this command then
csv2table will work just dandy. Otherwise, you
can still use most of csv2table's
functionality minus the --copy-related bits.
My most common use is to create a table and load the data into it:
~$ cat /tmp/colors.csv
Id,Color,Color Name,Description,Hex #,Inventory,Add Date
1,red,Red,Having the color of blood,#f00,0.25,2014-10-16
2,green,Green,Having the color of growing grass,#0f0,10.18,2014-08-25
3,blue,Blue,Having the color of the clear sky,#00f,4.67,2014-09-17
~$ csv2table --file /tmp/colors.csv --copy
create table "colors" (
"Id" text,
"Color" text,
"Color Name" text,
"Description" text,
"Hex #" text,
"Inventory" text,
"Add Date" text
);
copy "colors"("Id", "Color", "Color Name", "Description", "Hex #", "Inventory", "Add Date") from '/tmp/colors.csv' with csv header delimiter ',' quote '"';
~$ csv2table --file /tmp/colors.csv --copy | pg test
SET
SET
Null display is "<NÜLLZØR>".
Timing is on.
CREATE TABLE
Time: 4.795 ms
COPY 3
Time: 1.211 ms
~$ pg test
(postgres@[local]:5432 21:19:54) [test]> \d colors
Table "public.colors"
Column | Type | Modifiers
-------------+------+-----------
Id | text |
Color | text |
Color Name | text |
Description | text |
Hex # | text |
Inventory | text |
Add Date | text |
The above command emitted a CREATE TABLE
statement and a COPY statement and loaded all
the data into a table with same fields and structure as the
supplied CSV. The name of the table was inferred from the
name of the file.
Every column was created with the type TEXT.
csv2table can attempt to automatically find
types for your columns. Pass each type you'd like to
detect:
~$ csv2table --file /tmp/colors.csv --integer --timestamp
create table "colors" (
"Id" integer,
"Color" text,
"Color Name" text,
"Description" text,
"Hex #" text,
"Inventory" text,
"Add Date" timestamp
);
As you can see, the Id column will be created
as type INTEGER and the Add Date
column will be created with type TIMESTAMP.
This is the full list of supported type detections:
--integer--numeric--timestamp(also used with--tzforTIMESTAMPTZ)--date--array--json--jsonb
Another flag I almost always use is -1 aka
--transaction. This wraps the whole thing in a
transaction so that if your data load fails the table and
any other structures won't be created and leave you in a
half-complete state.
d e l i m i t e d
There's a long running joke at work about receiving e
delimited files. We've received files with many different
delimiters as well as files masquerading as delimited
files. The most popular delimiters are probably
, and |:
~$ csv2csv --in-file /tmp/colors.csv --out-delimiter '|' --out-file /tmp/colors.psv
~$ csv2table --file /tmp/colors.psv --delimiter '|' --copy
create table "colors" (
"Id" text,
"Color" text,
"Color Name" text,
"Description" text,
"Hex #" text,
"Inventory" text,
"Add Date" text
);
copy "colors"("Id", "Color", "Color Name", "Description", "Hex #", "Inventory", "Add Date") from '/tmp/colors.psv' with csv header delimiter '|' quote '"';
Parsing and COPY done just fine
(csv2csv
is another tool I use semi-regularly).
Tabs show up pretty frequently too. Hard to type, so
csv2table has some helper handling:
~$ csv2csv --in-file /tmp/colors.csv --out-delimiter '\t' --out-file /tmp/colors.tsv
~$ create table "colors" (
"Id" text,
"Color" text,
"Color Name" text,
"Description" text,
"Hex #" text,
"Inventory" text,
"Add Date" text
);
copy "colors"("Id", "Color", "Color Name", "Description", "Hex #", "Inventory", "Add Date") from '/tmp/colors.tsv' with csv header delimiter ' ' quote '"';
Sometimes, you run into files that are kinda CSV files. Delimited, sure. But without a quote character. I commonly see this with pipe delimited files. For our data, usually I can get away by picking, say, @ for the quote character. This will prevent a quote from ever being found. So if your data has something like this in it:
~$ cat /tmp/lol.csv oh|hai|there let's just sprinkle "|some "" characters|"around ~$ csv2table --file /tmp/lol.csv --delimiter '|' --copy | pg test SET SET Null display is "<NÜLLZØR>". Timing is on. CREATE TABLE Time: 4.421 ms ERROR: 22P04: missing data for column "hai" CONTEXT: COPY lol, line 2: "let's just sprinkle "|some "" characters|"around" LOCATION: NextCopyFrom, copy.c:2989 Time: 2.384 ms ~$ csv2table --file /tmp/lol.csv --delimiter '|' --copy --quote '@' | pg test SET SET Null display is "<NÜLLZØR>". Timing is on. CREATE TABLE Time: 5.091 ms COPY 1 Time: 1.432 ms
Kind of a hack but I plan on some day adding a feature to handle this directly.
Pro tip: When you're working on a Mac and need to load an
Excel file, save it as a CSV in the "Windows Comma
Separated (.csv)" format not in the "Comma Separated Values
(.csv)" format. This will make the file loadable with
csv2table. The default format on OSX is to use
just carriage return to separate lines, which causes issues
with many tools. The Windows format will also include a
newline.
Show Me Where The Bad File Touched You
The colors.csv file we've been working with
isn't ideal. Those column names are AWFUL (as I'm sure
you've noticed). csv2table has got your back
here, too:
~$ csv2table --file /tmp/colors.csv --mogrify
create table "colors" (
"Id" text,
"Color" text,
"Color_Name" text,
"Description" text,
"Hex_" text,
"Inventory" text,
"Add_Date" text
);
Better, but capitalization is really annoying too:
~$ csv2table --file /tmp/colors.csv --mogrify --lower
create table "colors" (
"id" text,
"color" text,
"color_name" text,
"description" text,
"hex_" text,
"inventory" text,
"add_date" text
);
Ahh, much better. Same rules apply to the name of the generated table. So when Jimmy names your files:
~$ cp /tmp/colors.csv "/tmp/How To Name Files Like a Windows User (1).csv"
~$ csv2table --file "/tmp/How To Name Files Like a Windows User (1).csv" --mogrify --lower
create table "how_to_name_files_like_a_windows_user_1_" (
"id" text,
"color" text,
"color_name" text,
"description" text,
"hex_" text,
"inventory" text,
"add_date" text
);
csv2table has your back there, too.
Sometimes, your file will have duplicate header names:
~$ cat /tmp/hehe.csv
1,1,1,1
wow,such,descriptive,names
~$ csv2table --file /tmp/hehe.csv --fix-duplicates
create table "hehe" (
"1" text,
"1_1" text,
"1_2" text,
"1_3" text
);
Not the greatest names in the world, but at least you'll be able to quickly load your data. I often find the columns with the same names are also ones I'd like to ignore. So this lets me get the data loaded and me on with my life.
Sometimes, Satan created your file. It has no headers.
csv2table is hard at work sprinkling holy
water on that dumpster fire, too:
~$ cat /tmp/fail.csv
there,are,no,names,at,all
only,works,with,two,lines,minimum
~$ csv2table --file /tmp/fail.csv --missing-headers
create table "fail" (
"col_0" text,
"col_1" text,
"col_2" text,
"col_3" text,
"col_4" text,
"col_5" text
);
Could be better but let's see you perform that exorcism.
🎶Cry For The Bad Data🎶
Not only will you files have bad formats, they'll have bad
data, too. There's only so much juju that
csv2table can bring to bear in this case but
there's a couple flags and techniques that I find useful
here.
Most issues are just the type detection being based on the
first row of data. --keep-going will read
more of the file and attempt to do a better job of things.
It can be slow to process the whole file, however, so
select a reasonably small number of lines.
--skip-parsing can be used to disable type
detection for specific columns. In my line of work, there
are many numeric identifiers. These are actually numeric
strings and not numbers. I pretty much never want them to
actually end up as numbers. --skip-parsing
will let you name these columns and prevent them from being
typed (actually, they end up as TEXT because
being untyped is a weird gray area).
If your data can load as TEXT you're probably
best off trying to fix it in the database. In postgres,
ALTER TABLE colors ALTER COLUMN "Id" TYPE INTEGER USING "Id"::INTEGER
is one approach. USING can take an arbitrary
expression and since this command re-writes the entire
table you end up with no wasted space. Of course, you can
use UPDATE as well.
If you can't get the data loaded as all TEXT,
you've got a real steaming turd file on your hands. If you
care about the data a lot, you'll have to open it in an
editor and futz around until you've got all those chunks of
corn bad data sorted. When you aren't
super concerned about your data iconv can
throw you a bone here. sed can probably help,
too. If you're stuck, try one of these:
~$ iconv -c -f utf-8 -t ascii colors.csv ~$ iconv -f utf8 -t ascii//TRANSLIT colors.csv ~$ sed 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//'
The first command will kill anything that isn't ASCII, the second will try to remove accents and such, and the third should basically do the same as the first. More information here at stackoverflow.
g-g-g-g-zip
- gzip support
csv2table supports gzip compressed files:
~$ gzip /tmp/colors.csv
~$ csv2table --file /tmp/colors.csv.gz --gzip --copy
create table "colors" (
"Id" text,
"Color" text,
"Color Name" text,
"Description" text,
"Hex #" text,
"Inventory" text,
"Add Date" text
);
copy "colors"("Id", "Color", "Color Name", "Description", "Hex #", "Inventory", "Add Date") from program 'gunzip < /tmp/colors.csv.gz' with csv header delimiter ',' quote '"';
Note that for this to work, you need to have a version of
postgres that is at least 9.3 or newer. The
COPY command above utilizes the
PROGRAM option to decompress the file.
Usage with Redshift
csv2table also works with redshift, which is a
data warehousing solution provided by AWS and based off of
postgres 8.0 (with backported features and added
enhancements). Working with redshift requires the file to
be in Amazon's infrastructure (often in S3). You
csv2table supports reading environment
variables (of the kind used by other Amazon tools) and
config files:
~$ csv2table --file /tmp/colors.csv --redshift "<ENV>"
The above will read the credentials from
AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY, and
S3_BUCKET. If the credentials themselves are
not found, but AWS_CREDENTIAL_FILE is found,
then csv2table will read the file given by
that variable.
~$ cs2table --file /tmp/colors.csv --redshift /tmp/config.ini
The above will read the given config file. The names are as follows:
| Value | Name Options |
|---|---|
| Access Key Id | AWSAccessKeyId, aws_access_key_id, access_key, s3.account_id |
| Secret Access Key | AWSSecretKey, aws_secret_access_key, secret_key, s3.private_key |
| S3 Bucket | s3.bucket, s3.transfer_bucket |
There are historical reasons for these names but the conventions are typical of those I've seen in the wild and those used in the codebase at work.
The S3 bucket option is sort-of non-standard but necessary
to upload the file to or read the file from.
csv2table will upload the file for when used
with the --redshift-upload option. Otherwise,
it will assume that the file is already uploaded to the
given bucket.
The gzip format is supported here as well and I would recommned using it with large files to reduce upload times (I haven't measured load times with and without compression but anecdotally it seems to be at least as fast when the files are compressed).
csv2table works without third party libraries
when used with postgres. With redshift, however,
boto must be available for the
--redshift-upload option to work. Sorry about
that.
Some more advanced features
csv2table has many, many more options. The
ones I often reach for can be grouped into three buckets:
schema handling, ancillary data, and indexing,
Schema handling options let you control how the table is created:
-
--drop. Drop the table first, if it exists. -
--truncate. Empty the table before copying into it. -
--no-create. Don't create the table. Useful when loading multiple files into a table. -
--temporary. Make the table aTEMPORARYtable. -
--cine.CREATE IF NOT EXISTS. -
--create-schema. When using--schemathis option can be used to create the schema as well.
Ancillary data options allow you to add in some additional information to your table:
-
--file-column. Add a column with the name of the file the data was loaded from. -
--serial-column. Add aSERIALcolumn with the given name.
Lastly and leastly, indexing:
-
--primary-key. Make the named column(s) the primary key of the table.
Conclusion
Put those CSV files into your database! Happy querying!