Data Conversions

dataConversionThe data that I work usually doesn’t arrive in the format that I need for cleaning or analyzing. After writing individual scripts each time I needed to do this conversion, I decided to take the time to create scripts that I can use repeatedly.

Initially I created these data conversion scripts in PowerShell since that’s what I was working with at the time, and later switched to Python so I can use them in any environment. They generally work the same, so use whichever is easiest for you.

My scripts accept csv, json or tsv as inputs, and can output to csv, sqlite or xml. The format to use them is filename.csv, which converts your data and outputs filename.db (filename & format dependent, of course).

Here’s a list of the conversions available:

  • csv to json
  • csv to sqlite
  • json to csv
  • json to sqlite
  • json to xml
  • tsv to csv

One quick note on converting from json to x. My scripts are only designed to work with flat json files (no nested objects or arrays). If your input json isn’t flat, my script will produce columns of nested json. If that doesn’t work for your needs, there are a couple of options. You can flatten it via a script like so (found on stackoverflow):

def flatten(structure, key="", path="", flattened=None):
    if flattened is None:
        flattened = {}
    if type(structure) not in(dict, list):
        flattened[((path + ".") if path else "") + key] = structure
    elif isinstance(structure, list):
        for i, item in enumerate(structure):
            flatten(item, "%d" % i, ".".join(filter(None,[path,key])), flattened)
        for new_key, value in structure.items():
            flatten(value, new_key, ".".join(filter(None,[path,key])), flattened)
    return flattened

Or you can hard code the conversion yourself. Here’s an example (also from stackoverflow):

for item in data:

Use, share or modify these conversion scripts as you care to.