Every now and then I run into a piece of Python code that just doesn't run as fast as I would like. I use 3 different Python tools to find and fix Python performance problems.
cProfile:
cProfile is a module that is included in the Python Standard Library. It logs function calls and execution times. There is also a pure-python version named 'profile'.
KCachegrind:
KCachegrind was written to visualize the output generated by Callgrind (a C profiler), but the function call logs from cProfile can be converted to the KCachegrind format with this script. KCachegrind uses the KDE framework. On Linux boxes, KCachegrind is usually bundled in a package with other development tools written for KDE. The package is named 'kdesdk' or something similar.
timeit:
timeit is a module included in the Python Standard Library that is used for measuring the execution time of arbitrary pieces of code.
Here is the code that needs to be improved:
class LameStringBuilder(object):
def __init__(self, cols=40, rows=10000):
self.cols = cols
self.rows = rows
def build(self, val):
built = ''
for i in range(self.rows):
built += self.build_row(val) + "\n"
return built
def build_row(self, val):
built = ''
for i in range(self.cols - 1):
built += val + ','
built += val
return built
Here is the code to execute a profile test and format the results into something that KCachegrind can understand:
import cProfile
# Use this module to convert profile data
# into the KCacheGrind format.
#
# The module is available here:
# http://www.gnome.org/~johan/lsprofcalltree.py
import lsprofcalltree
import lame_string_builder
def test():
"""Code to profile."""
l = lame_string_builder.LameStringBuilder(40, 10000)
l.build('foo')
if __name__ == "__main__":
# Profile function
p = cProfile.Profile()
p.run('test()');
# Get profile log in KCacheGrind format
# and dump to file.
k = lsprofcalltree.KCacheGrind(p)
out = open('lame_profile.kgrind', 'w')
k.output(out)
out.close()
After running the test, launch KCachegrid and open the profile file to view the results.
The 'Flat Profile' and 'Callee Map' are the 2 most useful displays for finding the bottle necks in your code.
The 'Flat Profile' describes each function called in the test with the following columns:
- Incl. - The total percentage of time spent within a function.
- Self - The percentage of time spent within a function NOT including inner function calls.
- Called - The total number of times the function was called during the test.
- Function - The name of the function being described.
The 'Callee Map' represents the different functions called during the test. Each function is drawn as a rectangle. Inner function calls are drawn on top of their parent function's rectangle. The area used to draw each function is proportional to the execution time for each function call relative to the total execution time of the parent (also printed as a percentage in the graph).
In the example above it is easy to pick out where the bottle necks are by looking at the 'Callee Map'. Most of the test time was spent within the 'LameStringBuilder.build_row' method, so it is the main bottle neck. Other significant sources of time include calls to 'LameStringBuilder.build' and the built-in function 'range'.
After identifying the bottlenecks, I created a better version of the 'LameStringBuilder' class:
import lame_string_builder
class LessLameStringBuilder(lame_string_builder.LameStringBuilder):
def build(self, val):
return "\n".join([self.build_row(val) for i in xrange(self.rows)])
def build_row(self, val):
return ",".join([val for i in xrange(self.cols)])
After making changes to the code, Python's built-in 'timeit' module can be used to simply and accurately measure the differences in execution speed between the old and the new versions:
import timeit
import lame_string_builder
import less_lame_string_builder
def test_lame(val, cols, rows):
l = lame_string_builder.LameStringBuilder(cols, rows)
l.build(val)
def test_less_lame(val, cols, rows):
l = less_lame_string_builder.LessLameStringBuilder(cols, rows)
l.build(val)
def test_generic(repeat, name, args):
print "%s (x%i):" % (name, repeat)
t = timeit.Timer("%s(%s)" % (name, args), "from __main__ import %s" % name)
print t.timeit(repeat)
def test_all(repeat, val, cols, rows):
args = "'%s', %i, %i" % (val, cols, rows)
for name in ("test_lame", "test_less_lame"):
test_generic(repeat, name, args)
if __name__ == "__main__":
test_all(100, 'foo', 40, 10000)
The new code is significantly faster!