Timothy Helton

Learn Something New Every Day

pandas Best Practices

Python pandas Logo

June 6, 2017

"I think my code is well written, but this is taking forever to run!"

This article is going to display that sometimes processing speedups come in strange places. Hopefully, you will learn a trick or two and be able to write some better code on your project.


Background

I have been fortunate enough to attend two different training courses put on by Enthought Scientific Computing Solutions. The instructors were Mike McKerns and Alex Chabot-Leclerc who were highly knowledgeable professionals. The most recent class was pandas Mastery Workshop, which I would highly recommend. During the course, Alexandre made numerous comments about performance differences, and I wanted to document some surprising methods to speed up pandas .

Notebook

The notebook with all the code to recreate this study is located here .

Versions

The following versions of packages were used to measure the executions times.

NOTE: The timeit.autorange method was introduced in Python 3.6 and is used to automatically determine how many times to call timeit in the provided notebook.

Methodology

I'll use the following methods on various operations defined in the table.

Operation Application Method
pandas Series Addition
  1. Pure NumPy
  2. Pure pandas
  3. Wrap a pandas object in a NumPy method
  4. Conversion
    1. Convert a pandas object to a NumPy object
    2. Perform the calculation in NumPy
    3. Create a new pandas object from the NumPy results
pandas Series Multiplication
pandas DataFrame Addition Down Columns
pandas DataFrame Multiplication Down Columns
pandas DataFrame Addition Across Rows
pandas DataFrame Multiplication Across Rows
pandas DataFrame Element-Wise Applymap
  1. Pure NumPy
  2. Pure pandas
  3. Conversion
pandas Date String Format
  1. ISO 8601 Format YYYY-MM-DD
  2. MM/DD/YYYY Format

pandas Overhead Costs

pandas has a multitude of well thought out quality methods, but there is a price to pay to get them. A younger version of myself used to think, "Straight up NumPy is good enough for me." I refused to even consider a number of great packages. Now that I've been reformed, one of the key ideas I learned is to use the packages that are out there. When something doesn't seem quite right, it's time to dig a little deeper.

If equivalent operations are performed in NumPy and pandas, NumPy is clearly faster. The sunburst graphics below were generated using SNAKEVIZ , which works with cProfiler to display the calls made to the interpreter when code is executed.

                            arr = np.arange(1e6)
                        
NumPy Array Profile
                            ser = pd.Series(np.arange(1e6))
                        
pandas Series Profile
                            df = pd.DataFrame(np.arange(1e6).reshape(1000, 1000))
                        
pandas DataFrame Profile

pandas Series Numerical Operations

For numerical operations, one would expect NumPy to be faster than pandas and that is the case. In this section, mathematical addition and multiplication examples employing the optimized methods for each package were used. Later on we'll take a look at an example of when the pandas mapping function is utilized. The surprising discovery from this round of profiling is that wrapping a pandas object with a NumPy method provides essentially no performance benefit. It's also interesting to note that the multiplication operator is slightly faster than the addition operator. And yet a bigger surprise still is that converting a pandas object to a NumPy object, performing a calculation, and then creating a new pandas object takes less than half the time to just do the calculation in pandas!

Series Addition Profile
Series Multiplication Profile

pandas DataFrame Numerical Operations

The results for pandas DataFrames are similar to the Series. Again, multiplication is just a bit quicker than addition and the conversion technique provides great improvements in execution time.

DataFrame Addition Columns Profile
DataFrame Multiplication Columns Profile
DataFrame Addition Rows Profile
DataFrame Multiplication Rows Profile

Aggregate Methods

pandas does have a couple methods built to perform element-wise operations. For Series, there is map and DataFrames use apply and applymap . With the current implementation of these methods, pandas ends up calling a for loop in Python for map and apply, and a nested for loop in Python for applymap. This is where using the conversion technique could be a game changer. Once in NumPy, the operations are performed in a vectorized manner using optimized C routines. The performance gains will be relative to your array size, and in this case for one million elements there was an increase in speed by more than 400 times!

Date String Formats

If you read the documentation carefully for most of the methods related to importing time in pandas, you will see there is a fast path for strings formatted in ISO 8601.

                    iso_8601_format = 'YYYY-MM-DD'
                

The current profiling for this aspect of the code yielded a speed increase of four times just by altering the string format. There are two takeaways here:

  1. If you are able to control the gathering of date, chose to use the ISO 8601 format.
  2. If you have a large amount of times to convert in pandas, it may be worth pre-processing them into the ISO 8601 format prior to loading into pandas.

Date Format Profile

Conclusions

The key points to take away from this article are:


So on your next project, remember these simple tips and you'll be able to change the world a little faster!