Memory Leak When Using Strings < 128kb In Python?

February 22, 2024 Post a Comment

Original title: Memory leak opening files < 128KB in Python? Original question I see what I think is a memory leak when running my Python script. Here is my script: import sys i

Solution 1:

You might simply hit the default behaviour of the linux memory allocator.

Basically Linux has two allocation strategies, sbrk() for small blocks of memory and mmap() for larger blocks. sbrk() allocated memory blocks cannot easily be returned to the system, while mmap() based ones can (just unmap the page).

So if you allocate a memory block larger than the value where the malloc() allocator in your libc decides to switch between sbrk() and mmap() you see this effect. See the mallopt() call, especially the MMAP_THRESHOLD (http://man7.org/linux/man-pages/man3/mallopt.3.html).

Update To answer your extra question: yes, it is expected that you leak memory that way, if the memory allocator works like the libc one on Linux. If you used Windows LowFragmentationHeap instead, it would probably not leak, similar on AIX, depending on which malloc is configured. Maybe one of the other allocators (tcmalloc etc.) also fix such issues. sbrk() is blazingly fast, but has issues with memory fragmentation. CPython cannot do much about it, as it does not have a compacting garbage collector, but simple reference counting.

Python offers a few methods to reduce the buffer allocations, see for example the blog post here: http://eli.thegreenplace.net/2011/11/28/less-copies-in-python-with-the-buffer-protocol-and-memoryviews/

Solution 2:

I would look into garbage collection. It may be that larger files are triggering garbage collection more frequently, but the small files are being freed but collectively staying at some threshold. Specifically, call gc.collect() and then call gc.get_referrers() on the object to hopefully reveal what is keeping an instance is around. See the Python doc here:

http://docs.python.org/2/library/gc.html?highlight=gc#gc.get_referrers

Update:

The issue relates to garbage collection, namespace, and reference counting. The bash script you posted is giving a fairly narrow view of the garbage collector's behaviour. Try a larger range and you will see patterns in how much memory certain ranges will take. For example, change the bash for loop for a larger range, something like: seq 0 16 2056.

You noticed the memory usage was reduced if you del mystr because you are removing any references left to it. Similar results would likely happen if you limited the mystr variable to it's own function like so:

def loopy():
    mylist = []
    for x in xrange(100):
        mystr =' ' * int(size_kb) * 1024
        mydict = {x: mystr}
        mylist.append(mydict)
    return mylist

Rather than using bash scripts, I think you could get more useful information using a memory profiler. Here are a couple examples using Pympler. This first version is similar to your code from Update 3:

import gc
import sys
import time
from pympler import tracker

tr = tracker.SummaryTracker()
print'begin:'
tr.print_diff()

size_kb = sys.argv[1]

mylist = []
mydict = {}

print'empty list & dict:'
tr.print_diff()

for x in xrange(100):
    mystr = ' ' * int(size_kb) * 1024
    mydict = {x: mystr}
    mylist.append(mydict)

print'after for loop:'
tr.print_diff()

del mystr
del mydict
del mylist

print'after deleting stuff:'
tr.print_diff()

collected = gc.collect()
print'after garbage collection (collected: %d):' % collected
tr.print_diff()

time.sleep(2)
print'took a short nap after all that work:'
tr.print_diff()

mylist = []
print'create an empty list for some reason:'
tr.print_diff()

And the output:

$ python mem_test.py 256
begin:
                  types |   # objects |    total size
======================= | =========== | =============
                   list |         957 |      97.44 KB
                    str |         951 |      53.65 KB
                    int |         118 |       2.77 KB
     wrapper_descriptor |           8 |     640     B
                weakref |           3 |     264     B
      member_descriptor |           2 |     144     B
      getset_descriptor |           2 |     144     B
  function (store_info) |           1 |     120     B
                   cell |           2 |     112     B
         instancemethod |          -1 |     -80     B
       _sre.SRE_Pattern |          -2 |    -176     B
                  tuple |          -1 |    -216     B
                   dict |           2 |   -1744     B
empty list & dict:
  types |   # objects |   total size
======= | =========== | ============
   list |           2 |    168     B
    str |           2 |     97     B
    int |           1 |     24     B
after for loop:
  types |   # objects |   total size
======= | =========== | ============
    str |           1 |    256.04 KB
   list |           0 |    848     B
after deleting stuff:
  types |   # objects |      total size
======= | =========== | ===============
   list |          -1 |      -920     B
    str |          -1 |   -262181     B
after garbage collection (collected: 0):
  types |   # objects |   total size
======= | =========== | ============
took a short nap after all that work:
  types |   # objects |   total size
======= | =========== | ============
create an empty listfor some reason:
  types |   # objects |   total size
======= | =========== | ============
   list |           1 |     72     B

Notice after the for loop the total size for the str class is 256 KB, essentially the same as the argument I passed to it. After explicitly removing the reference to mystr in del mystr the memory is freed. After this, the garbage has already been picked up so there's no further reduction after gc.collect().

The next version uses a function to create a different namespace for the string.

import gc
import sys
import time
from pympler import tracker

defloopy():
    mylist = []
    for x in xrange(100):
        mystr = ' ' * int(size_kb) * 1024
        mydict = {x: mystr}
        mylist.append(mydict)
    return mylist


tr = tracker.SummaryTracker()
print'begin:'
tr.print_diff()

size_kb = sys.argv[1]

mylist = loopy()

print'after for loop:'
tr.print_diff()

del mylist

print'after deleting stuff:'
tr.print_diff()

collected = gc.collect()
print'after garbage collection (collected: %d):' % collected
tr.print_diff()

time.sleep(2)
print'took a short nap after all that work:'
tr.print_diff()

mylist = []
print'create an empty list for some reason:'
tr.print_diff()

And finally the output from this version:

$ python mem_test_2.py 256
begin:
                  types |   # objects |    total size
======================= | =========== | =============
                   list |         958 |      97.53 KB
                    str |         952 |      53.70 KB
                    int |         118 |       2.77 KB
     wrapper_descriptor |           8 |     640     B
                weakref |           3 |     264     B
      member_descriptor |           2 |     144     B
      getset_descriptor |           2 |     144     B
  function (store_info) |           1 |     120     B
                   cell |           2 |     112     B
         instancemethod |          -1 |     -80     B
       _sre.SRE_Pattern |          -2 |    -176     B
                  tuple |          -1 |    -216     B
                   dict |           2 |   -1744     B
after for loop:
  types |   # objects |   total size
======= | =========== | ============
   list |           2 |   1016     B
    str |           2 |     97     B
    int |           1 |     24     B
after deleting stuff:
  types |   # objects |   total size
======= | =========== | ============
   list |          -1 |   -920     B
after garbage collection (collected: 0):
  types |   # objects |   total size
======= | =========== | ============
took a short nap after all that work:
  types |   # objects |   total size
======= | =========== | ============
create an empty listfor some reason:
  types |   # objects |   total size
======= | =========== | ============
   list |           1 |     72     B

Now, we don't have to clean up the str, and I think this example shows why using functions are a good idea. Generating code where there's one big chunk in one namespace is really preventing the garbage collector from doing it's job. It will not come into your house and start assuming things are trash :) It has to know that things are safe to collect.

That Evan Jones link is very interesting btw.

Python Freelancers

Memory Leak When Using Strings < 128kb In Python?

Solution 1:

Solution 2:

Update:

Post a Comment for "Memory Leak When Using Strings < 128kb In Python?"