Wednesday, February 5, 2014

Clever Combining CSV Files to Cure KickStarter Conundrums

I recently ran a successful KickStarter for a Bubblegum Crisis Ultimate Edition (you can still sign up here), and in the aftermath ran into an annoying problem.

KickStarter lets you run a Backer Survey after the project is funded, and that's how backers tell you things like their names, addresses and so on. A typical KickStarter will have many different "support levels" providing different benefits.

Annoyingly, KickStarter forces you to design a survey format for each individual support level, with no way of cloning surveys. But what really got me steamed was the discovery that when you export the survey results, you get a CSV file for each survey, and the format of that CSV file depends on the items in each individual survey.

That makes it a royal pain in the ass to import the data into your own database -- I mean, I had 38 CSV files, each with its own ordering of fields. Plus given that people take their time filling out the surveys (2 months later and they're still trickling in), I was going to be doing the import multiple times.

To solve this, I wrote a quick little python script, combine-csv.py, that reads in an entire directory of CSV files, extracts all the unique field names from them, and then outputs a single file containing all the rows of data, with a column for each unique field. In cases where a row does not have an entry for a particular field, it gets a blank entry. Also, if a file contains multiple columns that have the same name, the data in them is combined into a single column, separated by the | delimiter.

Added: optional count field name; if present, each unique line is emitted only once, with the last field being a count of the number of occurrences. If not present, the source file name is appended.

You can find it here: combine-csv.py.zip

You use it like this:

python combine-csv.py [source directory] [destination csv file] {optional count field name}

Enjoy!

Update: I also whipped up a standalone AllCSV app that does the same thing; available for both Mac and Windows. It does not currently do the occurrence counting.

14 comments:

Andy said...

Hi, just wanted to say thanks for this - I have had the same problem with multiple CSV files with differing fields and without enough Python knowledge have given up multiple times already, and this did the job first time. Amazing. Thank you!

MadOverlord said...

Glad it was helpful -- the whole point of the post was to help out TNPB (The Next Poor Bastard).

I just updated the post with links to Mac and Windows standalone apps that do the same thing.

Unknown said...

Hi,
Thank you. your blog is super useful. I am just running into a small problem. After combining the number of files that I have the number of rows exceed 1048576 (the max possible in excel), and cause of this my code stops running beyond that point. Any idea how can i modify your code to solve this problem?
Thanks again!!

MadOverlord said...

I don't think that's a problem with my code, you're just generating a file that is too big for excel.

You could use a text editor to split the file up into segments, each with less than the max number of rows. You would have to copy the first line into each subfile so that each would have the same header.

However, I don't think that this will help you because the total number of lines will be more than excel can handle. For data sets this large, Excel is perhaps not the best tool.

One possibility is if your data has many duplicate lines with the same data, it might be possible to tweak the program so that it combined identical lines and added a "count" column. You would have to take the count into account when doing your analysis, of course.

Unknown said...

I agree, there is no problem with your code. It works perfectly fine with smaller data set.
I was perhaps under the impression that your code is generating a .csv file with excel under consideration and hence if that could be tweaked, but thanks for the clarification.

I just want to generate a huge .csv and use it for further analyzes. Hence, doesn't matter if its excel or something else.

Nonetheless, thank you for your suggestions. Will try something like that. Will also see if some bash scripting does the trick.

I'll post something I get my way around this thing, for others reading your blog.

MadOverlord said...

Email me at rjwoodhead@gmail.com with more details of your situation and perhaps I can tweak the python script to be more helpful.

MadOverlord said...

Updated combine-csv.py to optionally output only the unique lines along with a count field; only lightly tested, so use with care.

Format is: python combine-csv.py {source folder} {output file} [{optional count field name}]

Read the python script for more details.

Anonymous said...

Hello I beginner for using python by the way, i confuse about your code, how to input parameter in your code?

MadOverlord said...

It gets its parameters from the command line; the example shows it being executed from a command shell (Terminal app on Mac, and Command.com on Windows IIRC)

Labinov said...

This worked like a charm! Amazing python script! As a "next poor bastard", I really appreciate what you've done here. I'm new to Python and your was the first script I ran on my machine to much success. I did want to ask how to handle the following two scenarios:

1) multiple excel files with multiple tabs - I just want the data in one tab and not the others
2) The data in each tab doesn't have a date in it and the date is only included in the name of the excel file. This is a bit challenging. I *could* parse the name in SQL but the way your code puts in the full name of the file = includes the folder. If it just included the name of the files itself, I could parse out the date. Conversely, the file title is as follows "file #2 - DRAFT- String String String - Month Year" (spaces included) - do you have any python documentation for pulling the Month Year out after the delimiter?

I know I have a lot of questions but please do point me in the right direction or provide insight per your availability. Thank you!

Labinov said...

This worked like a charm! Amazing python script! As a "next poor bastard", I really appreciate what you've done here. I'm new to Python and your was the first script I ran on my machine to much success. I did want to ask how to handle the following two scenarios:

1) multiple excel files with multiple tabs - I just want the data in one tab and not the others
2) The data in each tab doesn't have a date in it and the date is only included in the name of the excel file. This is a bit challenging. I *could* parse the name in SQL but the way your code puts in the full name of the file = includes the folder. If it just included the name of the files itself, I could parse out the date. Conversely, the file title is as follows "file #2 - DRAFT- String String String - Month Year" (spaces included) - do you have any python documentation for pulling the Month Year out after the delimiter?

I know I have a lot of questions but please do point me in the right direction or provide insight per your availability. Thank you!

MadOverlord said...

1) I think you will need to export individual tabs as CSV files. I have no idea if Excel will let you do that. You may be able to just copy/paste the entire tab into a new document and then just save as CSV.

2) The variable "filename" contains the full filepath, and is added to the rows in the line that reads newRow.append(filename). Replacing it with something like newRow.append(os.path.split(filename)[-1]) would probably do the trick.

When looking for ways to do things, googling " how to " usually works and will almost always return a StackOverflow.com page

MadOverlord said...

Er, used brackets in that reply... googling "(language name) how to (do whatever you want)" is what you want to do. So "python how to extract filename from filepath", etc.

Labinov said...

Thank you! Also, yes, Google is in fact the best place for documentation. I just figured that before I go fishing I ask someone who has clearly done some of their own about this subject.