博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
python内存泄漏_诊断和修复Python中的内存泄漏
阅读量:2517 次
发布时间:2019-05-11

本文共 18640 字,大约阅读时间需要 62 分钟。

python内存泄漏

uses Python extensively throughout the and in our support tools, due to its ease-of-use, extensive package library, and powerful language tools. One thing we’ve learned from building complex software for the cloud is that a language is only as good as its debugging and profiling tools. Logic errors, CPU spikes, and memory leaks are inevitable, but a good debugger, CPU profiler, and memory profiler can make finding these errors significantly easier and faster, letting our developers get back to creating Fugue’s dynamic cloud orchestration and enforcement system. Let’s look at a case in point.

易用性,广泛的软件包库和强大的语言工具,在整个以及在我们的支持工具中都广泛使用Python。 我们从为云构建复杂的软件中学到的一件事是,一种语言仅与其调试和概要分析工具一样好。 逻辑错误,CPU尖峰和内存泄漏是不可避免的,但是良好的调试器,CPU探查器和内存探查器可以使发现这些错误变得更加轻松快捷,这使我们的开发人员可以重新创建Fugue的动态云编排和执行系统。 让我们来看一个例子。

In the fall, our metrics reported that a Python component of Fugue called the reflector was experiencing random restarts and instability after a few days of uptime. Looking at memory usage showed that the reflector’s memory footprint increased monotonically and continuously, indicating a memory leak. , a powerful memory tracking tool in the Python standard library, made it possible to quickly diagnose and fix the leak. We discovered that the memory leak was related to our use of requests, a popular third-party Python HTTP library. Rewriting the component to use urllib from the Python standard library eliminated the memory leak. In this blog, we’ll explore the details.

在秋天,我们的指标报告称,Fuge的Python组件“反射器”在正常运行几天后经历了随机重启和不稳定。 查看内存使用情况表明,反射器的内存占用量单调且连续增加,表明内存泄漏。 是Python标准库中的功能强大的内存跟踪工具,它使得快速诊断和修复泄漏成为可能。 我们发现内存泄漏与我们对requests的使用有关, requests是一种流行的第三方Python HTTP库。 重写组件以使用Python标准库中的urllib消除了内存泄漏。 在此博客中,我们将探索细节。

Metric_Before.png Metrics show the problem: Percentage of total system memory used by the reflector, using the requests library.

指标显示了问题:使用请求库,反射器使用的系统内存总量的百分比。

Python中的内存分配 (Memory Allocation in Python)

In most scenarios, there’s no need to understand memory management in Python beyond knowing that the interpreter manages memory for you. However, when writing large, complex Python programs with high stability requirements, it’s useful to peek behind the curtain to understand how to write code that interacts well with Python’s memory management algorithms. Notably, Python uses reference counting and garbage collection to free memory blocks, and only frees memory to the system when certain internal requirements are met. A pure Python script will never have direct control over memory allocation in the interpreter. If direct control over memory allocation is desired, the interpreter’s memory allocation can be bypassed by writing or using an extension. For example, manages memory for large data arrays using its own memory allocator.

在大多数情况下,无需了解Python中的内存管理,只需要了解解释器即可为您管理内存即可。 但是,当编写具有高稳定性要求的大型,复杂的Python程序时,了解一下如何编写与Python的内存管理算法良好交互的代码是很有用的。 值得注意的是,Python使用引用计数和垃圾回收来释放内存块,并且仅在满足某些内部要求时才将内存释放给系统。 纯Python脚本将永远无法直接控制解释器中的内存分配。 如果需要直接控制内存分配,则可以通过写入或使用扩展名来绕过解释器的内存分配。 例如, 使用其自己的内存分配器管理大型数据阵列的内存。

Fundamentally, Python is a garbage-collected language that uses reference counting. The interpreter automatically allocates memory for objects as they are created and tracks the number of references to those objects in a data structure associated with the object itself. This memory will be freed when the reference count for those objects reaches zero. In addition, garbage collection will detect cycles and remove objects that are only referenced in cycles. Every byte of memory allocated within the Python interpreter is able to be freed between these two mechanisms, but no claims can be made about memory allocated in extensions.

从根本上讲,Python是一种使用引用计数的垃圾收集语言。 解释器在创建对象时自动为它们分配内存,并在与对象本身关联的数据结构中跟踪对这些对象的引用数。 当这些对象的引用计数达到零时,将释放该内存。 此外,垃圾回收将检测周期并删除仅在周期中引用的对象。 可以在这两种机制之间释放在Python解释器中分配的内存的每个字节,但是无法声明扩展中分配的内存。

Python manages its own heap, separate from the system heap. Memory is allocated in the Python interpreter by different methods according to the type of the object to be created. Scalar types, such as integers and floats, use different memory allocation methods than composite types, such as lists, tuples, and dictionaries. In general, memory is allocated on the Python heap in fixed-size blocks, depending on the type. These blocks are organized into pools, which are further organized into arenas. Memory is pre-allocated using arenas, pools, and blocks, which are then used to store data as needed over the course of program’s execution. Since these blocks, pools, and arenas are kept in Python’s own heap, freeing a memory block merely marks it as available for future use in the interpreter. Freeing memory in Python does not immediately free the memory at the system level. When an entire arena is marked as free, its memory is released by the Python interpreter and returned to the system. However, this may occur infrequently due to memory fragmentation.

Python管理自己的堆,与系统堆分开。 根据要创建的对象的类型,可以通过不同的方法在Python解释器中分配内存。 标量类型(例如整数和浮点数)使用的内存分配方法与组合类型(例如列表,元组和字典)的使用方法不同。 通常,根据类型,内存以固定大小的块分配在Python堆上。 这些区块被组织成池,然后被进一步组织成竞技场。 使用舞台,池和块预先分配内存,然后在程序执行过程中根据需要使用它们存储数据。 由于这些块,池和竞技场都保留在Python自己的堆中,因此释放内存块只会将其标记为可供将来在解释器中使用。 在Python中释放内存不会立即在系统级别释放内存。 当整个竞技场被标记为空闲时,Python解释器将释放其内存,并将其返回给系统。 但是,由于内存碎片,这种情况可能很少发生。

Due to these abstractions, memory usage in Python often exhibits high-water-mark behavior, where peak memory usage determines the memory usage for the remainder of execution, regardless of whether that memory is actively being used. Furthermore, the relationship between memory being “freed” in code and being returned to the system is vague and difficult to predict. These behaviors make completely understanding the memory usage of complex Python programs notoriously difficult.

由于这些抽象,Python中的内存使用率经常表现出高水位行为,其中峰值内存使用率决定了剩余执行时间的内存使用率,而与该内存是否正在被使用无关。 此外,在代码中“释放”的内存与返回到系统之间的关系是模糊且难以预测的。 这些行为使完全理解复杂的Python程序的内存使用异常困难。

使用tracemalloc进行内存分析 (Memory Profiling Using tracemalloc)

tracemalloc is a package included in the Python standard library (as of version 3.4). It provides detailed, block-level traces of memory allocation, including the full traceback to the line where the memory allocation occurred, and statistics for the overall memory behavior of a program. The documentation is available and provides a good introduction to its capabilities. The also has some insight on its design.

tracemalloc是Python标准库(自3.4版开始)中包含的软件包。 它提供了详细的块级内存分配跟踪,包括对内存分配发生所在行的完整回溯,以及程序整体内存行为的统计信息。 该文档可获得并对其功能进行了很好的介绍。 也对其设计有一些见解。

tracemalloc can be used to locate high-memory-usage areas of code in two ways:

tracemalloc可通过两种方式用于定位代码的高内存使用区域:

  • looking at cumulative statistics on memory use to identify which object allocations are using the most memory, and
  • tracing execution frames to identify where those objects are allocated in the code.
  • 查看有关内存使用情况的累积统计信息,以确定哪些对象分配使用了最多的内存,以及
  • 跟踪执行框架以标识那些对象在代码中的分配位置。

模块级内存使用率 (Module-level Memory Usage)

We start by tracing the memory usage of the entire program, so we can identify, at a high level, which objects are using the most memory. This will hopefully provide us with enough insight to know where and how to look more deeply. The following wrapper starts tracing and prints statistics when Ctrl-C is hit:

我们首先跟踪整个程序的内存使用情况,以便可以从较高级别确定哪些对象使用的内存最多。 希望这将为我们提供足够的见识,以了解在何处以及如何更深入地看待。 按下Ctrl-C时,以下包装器开始跟踪并打印统计信息:

import tracemalloctracemalloc.start(10)try:    run_reflector()except:    snapshot = tracemalloc.take_snapshot()    top_n(25, snapshot, trace_type='filename')import tracemalloctracemalloc. start ( 10 )try:    run_reflector ()except:    snapshot = tracemalloc. take_snapshot ()    top_n ( 25 , snapshot, trace_type= 'filename' )

tracemalloc.start(10) starts memory tracing, while saving 10 frames of traceback for each entry. The default is 1, but saving more traceback frames is useful if you plan on using tracebacks to locate memory leaks, which will be discussed later. tracemalloc.take_snapshot() takes a snapshot of currently allocated memory in the Python heap. It stores the number of allocated blocks, their size, and tracebacks to identify which lines of code allocated which blocks of memory. Once a snapshot is created, we can compute statistics on memory use, compare snapshots, or save them to analyze later. top_n is a helper function I wrote to pretty print the output from tracemalloc. Here, I ask for the top 25 memory allocations in the snapshot, grouped by filename. After running for a few minutes, the output looks like this:

tracemalloc.start(10)开始内存跟踪,同时为每个条目保存10帧回溯。 默认值为1,但是如果计划使用回溯来定位内存泄漏,则保存更多的回溯帧将很有用,稍后将对此进行讨论。 tracemalloc.take_snapshot()获取Python堆中当前分配的内存的快照。 它存储分配的块数,它们的大小和回溯,以识别哪些代码行分配了哪些内存块。 创建快照后,我们可以计算内存使用统计信息,比较快照或保存它们以供以后分析。 top_n是我编写的帮助程序函数,用于漂亮地打印tracemalloc的输出。 在这里,我要求快照中前25个内存分配,按文件名分组。 运行几分钟后,输出如下所示:

This shows the cumulative amount of memory allocated by the component over the entire runtime, grouped by filename. At this level of granularity, it’s hard to make sense of the results. For instance, the first line shows us that 17 MB of collections objects are created, but this view doesn’t provide enough detail for us to know which objects, or where they’re being used. A different approach is needed to isolate the problem.

这显示了在整个运行时组件分配的内存总量,按文件名分组。 在这种粒度级别上,很难理解结果。 例如,第一行向我们显示了创建了17 MB的collections对象,但是此视图没有提供足够的详细信息让我们知道哪些对象或它们在哪里使用。 需要一种不同的方法来隔离问题。

了解tracemalloc输出 (Understanding tracemalloc Output)

tracemalloc shows the net memory usage at the time a memory snapshot is taken. When comparing two snapshots, it shows the net memory usage between the two snapshots. If memory is allocated and freed between snapshots, it won’t be shown in the output. Therefore, if snapshots are created at the same point in a loop, any memory allocations visible in the differences between two snapshots are contributing to the long-term total amount of memory used, rather than being a temporary allocation made in the course of execution.

tracemalloc显示拍摄内存快照时的净内存使用情况。 比较两个快照时,它显示两个快照之间的净内存使用情况。 如果在快照之间分配并释放了内存,则不会在输出中显示该内存。 因此,如果在循环的同一点创建快照,则在两个快照之间的差异中可见的任何内存分配都将有助于长期使用的内存总量,而不是在执行过程中进行的临时分配。

In the case of reference cycles that require garbage collection, uncollected cycles are recorded in the output, while collected cycles are not. Any blocks freed by the garbage collector in the time covered by a snapshot will be recorded as freed memory. Therefore, forcing garbage collection with gc.collect() before taking a snapshot will reduce noise in the output.

对于需要垃圾回收的参考周期,未收集的周期会记录在输出中,而未收集的周期则不会。 在快照覆盖的时间内垃圾回收器释放的任何块都将记录为释放的内存。 因此,在拍摄快照之前使用gc.collect()强制进行垃圾回收将减少输出中的噪音。

每次迭代的内存使用情况 (Per-Iteration Memory Usage)

Since we’re looking for a memory leak, it’s useful to understand how the memory usage of our program changes over time. We can instrument the main loop of the component, to see how much memory is allocated in each iteration, by calling the following method from the main loop:

由于我们正在寻找内存泄漏,因此了解程序的内存使用情况如何随时间变化很有用。 我们可以通过在主循环中调用以下方法来检测组件的主循环,以查看每次迭代分配了多少内存:

def collect_stats(self):        self.snapshots.append(tracemalloc.take_snapshot())        if len(self.snapshots) > 1:            stats = self.snapshots[-1].filter_traces(filters).compare_to(self.snapshots[-2], 'filename')            for stat in stats[:10]:                print("{} new KiB {} total KiB {} new {} total memory blocks: ".format(stat.size_diff/1024, stat.size / 1024, stat.count_diff ,stat.count))                for line in stat.traceback.format():                    print(line)   def collect_stats (self) :        self .snapshots .append (tracemalloc. take_snapshot () )        if len (self.snapshots) > 1 :            stats = self .snapshots [- 1 ]. filter_traces (filters) . compare_to (self.snapshots[- 2 ], 'filename' )            for stat in stats[: 10 ]:                print ( "{} new KiB {} total KiB {} new {} total memory blocks: " .format(stat.size_diff/ 1024 , stat.size / 1024 , stat.count_diff ,stat.count) )                for line in stat .traceback .format ():                    print (line)

This code takes a memory snapshot and saves it, then uses snapshot.compare_to(other_snapshot, group_by='filename') to compare the newest snapshot with the previous snapshot, with results grouped by filename. After a few iterations to warm up memory, the output looks like this:

这段代码获取并保存了一个内存快照,然后使用snapshot.compare_to(other_snapshot, group_by='filename')将最新快照与先前快照进行比较,并按文件名对结果进行分组。 经过几次迭代以预热内存后,输出如下所示:

The linecache (1) and tracemalloc (2) allocations are part of the instrumentation, but we can also see some memory allocations made by the requests HTTP package (3) that warrant further investigation. Recall that tracemalloc tracks net memory usage, so these memory allocations are accumulating on each iteration. Although the individual allocations are small and don’t jump out as problematic, the memory leak only becomes apparent over the course of a few days, so it’s likely to be a case of small losses adding up.

所述linecache (1)和tracemalloc (2)分配是仪器的一部分,但是我们还可以看到由取得了一些存储器分配requests的是需要进一步调查的HTTP包(3)。 回想一下tracemalloc跟踪净内存使用情况,因此这些内存分配在每次迭代时都会累积。 尽管单个分配很小,并且不会出现问题,但是内存泄漏只会在几天内变得明显,因此很可能是少量损失加起来的情况。

筛选快照 (Filtering Snapshots)

Now that we have an idea of where to look, we can use tracemalloc’s filtering capabilities to show only memory allocations related to the requests package:

现在我们已经知道了在哪里看,可以使用tracemalloc的过滤功能来仅显示与请求包相关的内存分配:

from tracemalloc import Filter    filters = [Filter(inclusive=True, filename_pattern="*requests*")]    filtered_stats = snapshot.filter_traces(filters).compare_to(old_snapshot.filter_traces(filters), 'traceback')    for stat in stats[:10]:        print("{} new KiB {} total KiB {} new {} total memory blocks: ".format(stat.size_diff/1024, stat.size / 1024, stat.count_diff ,stat.count))        for line in stat.traceback.format():            print(line)   from tracemalloc import Filter    filters = [ Filter (inclusive=True, filename_pattern= "*requests*" ) ]    filtered_stats = snapshot. filter_traces (filters) . compare_to (old_snapshot.filter_traces(filters) , 'traceback' )    for stat in stats[: 10 ]:        print ( "{} new KiB {} total KiB {} new {} total memory blocks: " .format(stat.size_diff/ 1024 , stat.size / 1024 , stat.count_diff ,stat.count) )        for line in stat .traceback .format ():            print (line)

snapshot.filter_traces() takes a list of Filters to apply to the snapshot. Here, we create a Filter in inclusive mode, so it includes only traces that match the filename_pattern. When inclusive is False, the filter excludes traces that match the filename_pattern. The filename_pattern uses UNIX-style wildcards to match filenames in the traceback. In this example, the wildcards in “requests” match occurrences of “requests” in the middle of a path, such as "/Users/mike/.pyenv/versions/venv/lib/python3.4/site-packages/requests/sessions.py".

snapshot.filter_traces()获取要应用于快照的Filters列表。 在这里,我们以inclusive模式创建Filter ,因此它仅包含与filename_pattern匹配的跟踪。 当inclusiveFalse ,过滤器将排除与filename_pattern匹配的跟踪。 filename_pattern使用UNIX样式的通配符来匹配回溯中的文件名。 在此示例中,“请求”中的通配符匹配路径中间出现的“请求”,例如"/Users/mike/.pyenv/versions/venv/lib/python3.4/site-packages/requests/sessions.py"

We then use compare_to() to compare the results to the previous snapshot. The filtered output is below:

然后,我们使用compare_to()将结果与先前的快照进行比较。 过滤后的输出如下:

With the Filter in place, we can clearly see how requests is using memory. Line (4) shows that roughly 50 KiB of memory is lost in requests on each iteration of the main loop. Note that negative memory allocations, such as (5), are visible in this output. These allocations are freeing memory allocated in previous loop iterations.

有了Filter ,我们可以清楚地看到requests如何使用内存。 第(4)行显示在主循环的每次迭代中, requests中大约丢失了50 KiB的内存。 请注意,在此输出中可以看到负内存分配,例如(5)。 这些分配释放了先前循环迭代中分配的内存。

跟踪内存分配 (Tracking Down Memory Allocations)

To determine which uses of requests are leaking memory, we can take a detailed look at where problematic memory allocations occur by calling compare_to() with traceback instead of filename, while using a Filter to narrow down the output:

为了确定哪些requests使用正在泄漏内存,我们可以通过使用traceback而不是filename调用compare_to() ,同时使用Filter来缩小输出范围,来详细研究出现问题的内存分配的位置:

stats = snapshot.filter_traces(filters).compare_to(old_snapshot.filter_traces(filters), 'traceback')   stats = snapshot. filter_traces (filters) . compare_to (old_snapshot.filter_traces(filters) , 'traceback' )

This prints 10 frames of traceback (since we started tracing with tracemalloc.start(10)) for each entry in the output, a truncated example of which is below:

tracemalloc.start(10)输出中的每个条目打印10帧回溯(因为我们开始使用tracemalloc.start(10) )进行跟踪tracemalloc.start(10) ,其截断示例如下:

The full traceback gives us the ability to trace backwards from memory allocations to the lines in our project code that generate them. In the case of this component, our uses of requests came from an internal storage library that used an HTTP API. Rewriting the library to use urllib directly eliminated the memory leak.

完整的追溯使我们能够从内存分配追溯到生成它们的项目代码中的行。 对于此组件,我们对requests的使用来自使用HTTP API的内部存储库。 重写库以使用urllib直接消除了内存泄漏。

Metric_After.png Metrics indicate the problem is solved: Percentage of total system memory used by the reflector, after removing requests and switching to
urllib.

度量标准表明问题已解决:删除请求并切换到urllib后,反射器使用的系统内存总量的百分比。

记忆分析:艺术还是科学? (Memory Profiling: Art or Science?)

tracemalloc is a powerful tool for understanding the memory usage of Python programs. It helped us understand module-level memory usage, find out which objects are being allocated the most, and it demonstrated how the reflector’s memory usage changed on a per-iteration basis. It comes with useful filtering tools and gives us the ability to see the full traceback for any memory allocation. Despite all of its features, however, finding memory leaks in Python can still feel like more of an art than a science. Memory profilers give us the ability to see how memory is being used, but oftentimes it’s difficult to find the exact memory allocation that is causing problems. It’s up to us to synthesize the information we get from our tools into a conclusion about the memory behavior of the program, then make a decision about what actions to take from there.

tracemalloc是了解Python程序的内存使用情况的强大工具。 它帮助我们了解模块级别的内存使用情况,找出分配最多的对象,并演示了反射器的内存使用情况如何在每个迭代基础上发生变化。 它带有有用的过滤工具,使我们能够查看任何内存分配的完整追溯。 尽管具有所有功能,但是在Python中查找内存泄漏仍然感觉像是一门艺术,而不是一门科学。 内存探查器使我们能够查看内存的使用方式,但是通常很难找到导致问题的确切内存分配。 我们需要将从工具中获得的信息综合为关于程序的内存行为的结论,然后决定从那里采取什么操作。

翻译自:

python内存泄漏

转载地址:http://qpqwd.baihongyu.com/

你可能感兴趣的文章
如何获取网站icon
查看>>
几种排序写法
查看>>
java 多线程的应用场景
查看>>
dell support
查看>>
转:Maven项目编译后classes文件中没有dao的xml文件以及没有resources中的配置文件的问题解决...
查看>>
MTK android 设置里 "关于手机" 信息参数修改
查看>>
单变量微积分笔记6——线性近似和二阶近似
查看>>
补几天前的读书笔记
查看>>
HDU 1829/POJ 2492 A Bug's Life
查看>>
CKplayer:视频推荐和分享插件设置
查看>>
CentOS系统将UTC时间修改为CST时间
查看>>
redis常见面试题
查看>>
导航控制器的出栈
查看>>
玩转CSS3,嗨翻WEB前端,CSS3伪类元素详解/深入浅出[原创][5+3时代]
查看>>
iOS 9音频应用播放音频之播放控制暂停停止前进后退的设置
查看>>
Delphi消息小记
查看>>
HNOI2016
查看>>
JVM介绍
查看>>
将PHP数组输出为HTML表格
查看>>
Java中的线程Thread方法之---suspend()和resume() 分类: ...
查看>>