Dealing with Very Big Files using Pandas DataFrame

Recently working on some prototype of a demand prediction solution I faced with a couple of problems.

  • Very big files. Such files cannot be loaded to memory since they are bigger then available one. 
  • Files may have invalid lines which are located in the very end of files. This circumstance breaks parsing and doesn’t allow to set types for columns in some cases. 

I found that pandas.read_csv(…) offers possibility to solve both problems. Below I’m giving a little more details on this.

Invalid lines in the end of file

I have 10+ files which are 1.2Gb each. I was trying to load pandas.Dataframe and set column types with dtype={}, but I failed and got an error.

It turned out that there is an additional line which was added during executing data extract. This one line brought a big problem to me because it cannot be easily removed. The files are too big for editing in text editor. Notepad++ cannot load 1.2Gb files. I was saved because the data extract had also small files which I could investigate. A note for the future: the data extract should consist of small files, suitable for text editors. Or you can use one of those editor which allows editing of huge files (read more here). I tried UltraEdit and it looks nice, but sadly not free (find more).

If you want to automate this you may want to use additional parameters while reading CSV: skipfooter and engine.

These parameters will help you to skip lines at the bottom of a file. In order to use skipfooter the engine must be set to python (read more here).

Very big files

I’ve got about 20Gb of data in multiple files. Data contains information about sales by locations. Data represents sales history for all locations sliced by quarters. I need to transform it to history by locations. The problem here is that a required amount of memory is bigger than available in my system. Consequently I cannot load everything to a pandas.dataFrame and query it. More about reading in chunks here.

Pandas allows to read CSV in chunks loading only portion of a file to memory. In order to do this you call pandas.read_csv(…, chunksize=N). Each chunk is a DataFrame and you can group and extarct data you need as usual. The working code looks as below.

Further Research

The process  itself looks as an ETL process and I had an idea that there are some well-known packages which provides more robust approach for the same task as I implemented manually. Someone from data science community suggested me to use Luigi (git repo).

Also I found an article about Dask python library which allows working on large datasets. Here you can find more. Also there is an article about it on Kaggle (here).




No Comments


You can leave the first : )



Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.