Making numpy string processing faster.

Or tuples for the win!

Jun 10, 2023

NumPy is a Python library that provides a high-performance multidimensional array class with broadcasting functions, tools for integrating C/C++ and Fortran code, useful linear algebra, Fourier transform, and random number capabilities. These features make it one of the most useful and widely used Python library for data science.

But let’s best honest - developers love strings and always find ways to abuse them. As such, numpy comes with some string utility functions. But before you use them, make sure you need the guarantees they provide, since they come at a very high cost - for large-ish strings it can easily be more than 25X slower:

While at it, I’ve decided to apply knowledge from

First, switching from lists to tuple improved module initialization by ~25%

But since this turned _all_chars into tuple, the string join to create LOWER_TABLE seemed unnecessary, what if we get rid of this join?

Turns out it reduced table creation overhead by more than half and as a bonus also made english_lower ~20% faster.

Note that using tuples brings not only better performance - it also comes with significant maintainability advantage thanks to their immutability, which is also a reason why they are hashable.

So don’t miss out on tuple benefits and consider spreading your knowledge across open source projects we all use and love.

Software Bits Newsletter

Discussion about this post

Ready for more?