Dealing with Very Big Files using Pandas DataFrame

Recently working on some prototype of a demand prediction solution I faced with a couple of problems.

  • Very . Such files cannot be loaded to memory since they are bigger then available one. 
  • Files may have invalid lines which are located in the very end of files. This circumstance breaks parsing and doesn't allow to set types for columns in some cases. 

I found that .read_csv(…) offers possibility to solve both problems. Below I'm giving a little more details on this.

Invalid lines in the end of file

I have 10+ files which are 1.2Gb each. I was trying to load pandas.Dataframe and set column types with dtype={}, but I failed and got an error.

data = pd.read_csv(path, error_bad_lines=False, dtype={'ID':int})

ValueError: invalid literal for int() with base 10: '(7588832 row(s) affected)'

It turned out that there is an additional line which was added during executing data extract. This one line brought a big problem to me because it cannot be easily removed. The files are too big for editing in text editor. Notepad++ cannot load 1.2Gb files. I was saved because the data extract had also small files which I could investigate. A note for the future: the data extract should consist of small files, suitable for text editors. Or you can use one of those editor which allows editing of huge files (read more here). I tried UltraEdit and it looks nice, but sadly not free (find more).

If you want to automate this you may want to use additional parameters while reading CSV: skipfooter and engine.

data = pd.read_csv(path, error_bad_lines=False, dtype={'ID':int}, skipfooter=1, engine='python')

These parameters will help you to skip lines at the bottom of a file. In order to use skipfooter the engine must be set to python (read more here).

Very big files

I've got about 20Gb of data in multiple files. Data contains information about sales by locations. Data represents sales history for all locations sliced by quarters. I need to transform it to history by locations. The problem here is that a required amount of memory is bigger than available in my system. Consequently I cannot load everything to a pandas.dataFrame and query it. More about reading in chunks here.

Pandas allows to read CSV in chunks loading only portion of a file to memory. In order to do this you call pandas.read_csv(…, chunksize=N). Each chunk is a DataFrame and you can group and extarct data you need as usual. The working code looks as below.

    for raw_file in raw_files:
        print("Reading", raw_file)

        reader = pd.read_csv(raw_file, sep=';', error_bad_lines=False, chunksize=10000)
        chunk_count = 0
        for chunk in reader:
            grouped = chunk.groupby(['LocationID'])
            chunk_count = chunk_count + 1
            print("Processing next chunk...", chunk_count)
            for name, group in grouped:
                # I use assumption that customer name is in the name of files
                customer_name = 'Cust1' if 'Cust1' in raw_file else 'Cust2'
                destination_abs_file_path = os.path.join(script_dir,
                                                         '{}_location_{}.csv'.format(customer_name, name))
                if os.path.exists(destination_abs_file_path):
                    group.to_csv(destination_abs_file_path, sep=';', index=False, mode = 'a', header = False)
                    group.to_csv(destination_abs_file_path, sep=';', index=False)

Further Research

The process  itself looks as an ETL process and I had an idea that there are some well-known packages which provides more robust approach for the same task as I implemented manually. Someone from data science community suggested me to use Luigi (git repo).

Also I found an article about Dask python library which allows working on large datasets. Here you can find more. Also there is an article about it on (here).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Post

%d bloggers like this: