數據探查

Data scientists often need to write a lot of complex, slow, CPU- and I/O-heavy code — whether you’re working with large matrices, millions of rows of data, reading in data files, or web-scraping.

數據科學家經常需要編寫許多復雜，緩慢，CPU和I / O繁重的代碼-無論您是使用大型矩陣，數百萬行數據，讀取數據文件還是網絡抓取。

Wouldn’t you hate to waste your time refactoring one section of your code, trying to wring out every last ounce of performance, when a few simple changes to another section could speed up your code tenfold?

當對另一部分進行一些簡單的更改可以使您的代碼速度提高十倍時，您是否不愿意浪費您的時間來重構代碼的一部分，試圖浪費每一刻的性能呢？

If you’re looking for a way to speed up your code, a profiler can show you exactly which parts are taking the most time, allowing you to see which sections would benefit most from optimization.

如果您正在尋找一種加快代碼速度的方法，則探查器可以準確地向您顯示哪些部分花費的時間最多，從而使您可以查看哪些部分將從優化中受益最大。

A profiler measures the time or space complexity of a program. There’s certainly value in theorizing about the big O complexity of an algorithm but it can be equally valuable to examine the real complexity of an algorithm.

探查器測量程序的時間或空間復雜度。對算法的O復雜度進行理論化肯定具有價值，但檢查算法的實際復雜度同樣有價值。

Where is the biggest slowdown in your code? Is your code I/O bound or CPU bound? Which specific lines are causing the slowdowns?

您的代碼中最慢的地方在哪里？ 是代碼 I / O綁定還是CPU綁定 ？ 哪些特定的行導致了速度下降？

Once you’ve answered those questions you’ll A) have a better understanding of your code and B) know where to target your optimization efforts in order to get the biggest boon with the least effort.

回答完這些問題后，您將A)對您的代碼有更好的了解，B)知道在哪里進行優化工作，以便以最少的努力獲得最大的收益。

Let’s dive into some quick examples using Python.

讓我們來看一些使用Python的快速示例。

基礎 (The Basics)

You might already be familiar with a few methods of timing your code. You could check the time before and after a line executes like this:

您可能已經熟悉幾種計時代碼的方法。您可以像這樣檢查行執行前后的時間：

In [1]: start_time = time.time()
   ...: a_function() # Function you want to measure
   ...: end_time = time.time()
   ...: time_to_complete = end_time - start_time
   ...: time_to_complete
Out[1]: 1.0110783576965332

Or, if you’re in a Jupyter Notebook, you could use the magic %time command to time the execution of a statement, like this:

或者，如果您在Jupyter Notebook中，則可以使用不可思議的%time命令來計時語句的執行時間，如下所示：

In [2]: %time a_function()
CPU times: user 14.2 ms, sys: 41 μs, total: 14.2 ms
Wall time: 1.01 s

Or, you could use the other magic command %timeit which gets a more accurate measurement by running the command multiple times, like this:

或者，您可以使用另一個魔術命令%timeit ，它可以通過多次運行該命令來獲得更準確的測量結果，如下所示：

In [3]: %timeit a_function()
1.01 s ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Alternatively, if you want to time your whole script, you can use the bash command time, like so…

另外，如果您想對整個腳本進行計時，則可以使用bash命令time ，就像這樣……

$ time python my_script.py
real    0m1.041s
user    0m0.040s
sys     0m0.000s

These techniques are great if you want to get a quick sense of how long a script or a section of code takes to run but they are less useful when you want a more comprehensive picture. It would be a nightmare if you had to wrap each line in time.time() checks. In the next section, we’ll look at how to use Python’s built-in profiler.

如果您想快速了解腳本或一段代碼要花多長時間，這些技術非常有用，但是當您想要更全面的了解時，它們就沒什么用處了。如果必須在time.time()檢查中包裝每一行，那將是一場噩夢。在下一節中，我們將研究如何使用Python的內置事件探查器。

使用cProfile深入了解 (Diving Deeper with cProfile)

When you’re trying to get a better understanding of how your code is running, the first place to start is cProfile, Python’s built-in profiler. cProfile will keep track of how often and for how long parts of your program were executed.

當您試圖更好地了解代碼的運行方式時，第一個開始的地方是cProfile ，它是Python的內置探查器。 cProfile將跟蹤程序的執行頻率和執行時間。

Just keep in mind that cProfile shouldn’t be used to benchmark your code. It’s written in C which makes it fast but it still introduces some overhead that could throw off your times.

請記住，不應將cProfile用作基準測試代碼。它是用C語言編寫的，雖然速度很快，但仍然會帶來一些開銷，這可能會打亂您的時間。

There are multiple ways to use cProfile but one simple way is from the command line.

使用cProfile的方法有多種，但一種簡單的方法是從命令行使用。

Before we demo cProfile, let’s start by looking at a basic sample program that will download some text files, count the words in each one, and then save the top 10 words from each to a file. Now that being said, it isn’t too important what the code does, just that we’ll be using it to show how the profiler works.

在演示cProfile之前，我們先來看一個基本的示例程序，該程序將下載一些文本文件，計算每個文本文件中的單詞，然后將每個單詞中的前10個單詞保存到文件中。話雖如此，代碼的功能并不太重要，只是我們將使用它來展示事件探查器的工作方式。

Demo code to test our profiler

演示代碼以測試我們的探查器

Now, with the following command, we’ll profile our script.

現在，使用以下命令，我們將分析腳本。

$ python -m cProfile -o profile.stat script.py

The -o flag specifies an output file for cProfile to save the profiling statistics.

-o標志為cProfile指定一個輸出文件，以保存性能分析統計信息。

Next, we can fire up python to examine the results using the pstats module (also part of the standard library).

接下來，我們可以使用pstats模塊(也是標準庫的一部分)啟動python來檢查結果。

In [1]: import pstats
   ...: p = pstats.Stats("profile.stat")
   ...: p.sort_stats(
   ...:   "cumulative"   # sort by cumulative time spent
   ...: ).print_stats(
   ...:   "script.py"    # only show fn calls in script.py
   ...: )Fri Aug 07 08:12:06 2020    profile.stat46338 function calls (45576 primitive calls) in 6.548 secondsOrdered by: cumulative time
List reduced from 793 to 6 due to restriction <'script.py'>ncalls tottime percall cumtime percall filename:lineno(function)
     1   0.008   0.008   5.521   5.521 script.py:1(<module>)
     1   0.012   0.012   5.468   5.468 script.py:19(read_books)
     5   0.000   0.000   4.848   0.970 script.py:5(get_book)
     5   0.000   0.000   0.460   0.092 script.py:11(split_words)
     5   0.000   0.000   0.112   0.022 script.py:15(count_words)
     1   0.000   0.000   0.000   0.000 script.py:32(save_results)

Wow! Look at all that useful info!

哇！查看所有有用的信息！

For each function called, we’re seeing the following information:

對于每個調用的函數，我們都會看到以下信息：

ncalls: number of times the function was called
ncalls ：調用函數的次數
tottime: total time spent in the given function (excluding calls to sub-functions)
tottime ：在給定功能上花費的總時間(不包括對子功能的調用)
percall: tottime divided by ncalls
percall ： tottime除以ncalls
cumtime: total time spent in this function and all sub-functions
cumtime ：此功能和所有子功能所花費的總時間
percall: (again) cumtime divided by ncalls
percall ：(再次) cumtime除以ncalls
filename:lineo(function): the file name, line number, and function name
filename:lineo(function) ：文件名，行號和函數名

When reading this output, note the fact that we’re hiding a lot of data —in fact, we’re only seeing 6 out of 793 rows. Those hidden rows are all the sub-functions being called from within functions like urllib.request.urlopen or re.split. Also, note that the <module> row corresponds to the code in script.py that isn’t inside a function.

在讀取此輸出時，請注意以下事實：我們隱藏了很多數據-實際上，在793行中，我們僅看到6。這些隱藏的行是從諸如urllib.request.urlopen或re.split之類的函數中調用的所有re.split 。另外，請注意<module>行對應于script.py中不在函數內的代碼。

Now let’s look back at the results, sorted by cumulative duration.

現在，讓我們回顧一下按累積持續時間排序的結果。

ncalls tottime percall cumtime percall filename:lineno(function)
     1   0.008   0.008   5.521   5.521 script.py:1(<module>)
     1   0.012   0.012   5.468   5.468 script.py:19(read_books)
     5   0.000   0.000   4.848   0.970 script.py:5(get_book)
     5   0.000   0.000   0.460   0.092 script.py:11(split_words)
     5   0.000   0.000   0.112   0.022 script.py:15(count_words)
     1   0.000   0.000   0.000   0.000 script.py:32(save_results)

Keep in mind the hierarchy of function calls. The top-level, <module>, calls read_books and save_results. read_books calls get_book, split_words, and count_words. By comparing cumulative times, we see that most of <module>’s time is spent in read_books and most of read_books’s time is spent in get_book, where we make our HTTP request, making this script (unsurprisingly) I/O bound.

請記住函數調用的層次結構。頂層<module>調用read_books和save_results. read_books調用get_book ， split_words和count_words 。通過比較累積時間，我們可以看到<module>的大部分時間都花在了read_books而大多數read_books的時間都花在了get_book ，我們在這里進行HTTP請求，從而使該腳本( 毫不奇怪 )受I / O約束。

Next, let’s take a look at how we can be even more granular by profiling our code line-by-line.

接下來，讓我們看看如何通過逐行分析代碼來使粒度更細。

逐行分析 (Profiling Line-by-Line)

Once we’ve used cProfile to get a sense of what function calls are taking the most time, we can examine those functions line-by-line to get an even clearer picture of where our time is being spent.

一旦使用cProfile來了解哪些函數調用花費了最多的時間，我們就可以逐行檢查這些函數，以更清楚地了解我們的時間花在哪里。

For this, we’ll need to install the line-profiler library with the following command:

為此，我們需要使用以下命令安裝line-profiler庫：

$ pip install line-profiler

Once installed, we just need to add the @profile decorator to the function we want to profile. Here’s the updated snippet from our script:

安裝完成后，我們只需要將@profile裝飾器添加到我們要分析的函數中即可。這是腳本中的更新片段：

Note the fact that we don’t need to import the profile decorator function — it will be injected by line-profiler.

請注意，我們不需要導入profile裝飾器功能，它將由line-profiler注入。

Now, to profile our function, we can run the following:

現在，要分析我們的功能，我們可以運行以下命令：

$ kernprof -l -v script-prof.py

kernprof is installed along with line-profiler. The -l flag tells line-profiler to go line-by-line and the -v flag tells it to print the result to the terminal rather than save it to a file.

kernprof與line-profiler一起安裝。 -l標志告訴line-profiler逐行進行， -v標志告訴它將結果打印到終端，而不是將其保存到文件。

The result for our script would look something like this:

我們的腳本的結果如下所示：

The key column to focus on here is % Time. As you can see, 89.5% of our time parsing each book is spent in the get_book function — making the HTTP request — further validation that our program is I/O bound rather than CPU bound.

這里要重點關注的關鍵列是% Time 。如您所見，解析每本書的時間中有89.5％花費在get_book函數(發出HTTP請求)中，這進一步驗證了我們的程序是I / O綁定而不是CPU綁定。

Now, with this new info in mind, if we wanted to speed up our code we wouldn’t want to waste our time trying to make our word counter more efficient. It only takes a fraction of the time compared to the HTTP request. Instead, we’d focus on speeding up our requests — possibly by making them asynchronously.

現在，有了這些新信息，如果我們想加快代碼的速度，我們就不會浪費時間試圖使我們的單詞計數器更有效。與HTTP請求相比，它只花費一小部分時間。取而代之的是，我們將專注于加快我們的請求-可能通過使其異步進行。

Here, the results are hardly surprising, but on a larger and more complicated program, line-profiler is an invaluable tool in our programming tool belt, allowing us to peer under the hood of our program and find the computational bottlenecks.

在這里，結果不足為奇，但是在更大，更復雜的程序上， line-profiler是我們編程工具帶中的寶貴工具，它使我們能夠窺視程序的底層并找到計算瓶頸。

分析內存 (Profiling Memory)

In addition to profiling the time-complexity of our program, we can also profile its memory-complexity.

除了分析程序的時間復雜度之外，我們還可以分析其內存復雜度。

In order to do line-by-line memory profiling, we’ll need to install the memory-profiler library which also uses the same @profile decorator to determine which function to profile.

為了進行逐行內存分析，我們需要安裝memory-profiler庫，該庫也使用相同的@profile裝飾器來確定要分析的函數。

$ pip install memory-profiler$ python -m memory_profiler script.py

The result of running memory-profiler on our same script should look something like the following:

在同一腳本上運行memory-profiler的結果應類似于以下內容：

There are currently some issues with the accuracy of the “Increment” so just focus on the “Mem usage” column for now.

當前，“增量”的準確性存在一些問題，因此暫時僅關注“內存使用量”列。

Our script had peak memory usage on line 28 when we split the books up into words.

當我們將書分成單詞時，腳本在第28行的內存使用量達到峰值。

結論 (Conclusion)

Hopefully, now you’ll have a few new tools in your programming tool belt to help you write more efficient code and quickly determine how to best spend your optimization-time.

希望您現在在編程工具帶中擁有一些新工具，可以幫助您編寫更有效的代碼并快速確定如何最佳地利用優化時間。

You can read more about cProfile here, line-profiler here, and memory-profiler here. I also highly recommend the book High Performance Python, by Micha Gorelick and Ian Ozsvald [1].

你可以關于CPROFILE 這里，行探查這里，和內存分析器這里。我也強烈推薦Micha Gorelick和Ian Ozsvald [1]一書“ 高性能Python ”。

Thanks for reading! I’d love to hear your thoughts on profilers or data science or anything else. Comment below or reach out on LinkedIn or Twitter!

謝謝閱讀！ 我很想聽聽您對分析器或數據科學或其他方面的想法。 在下面發表評論，或在 LinkedIn 或 Twitter上聯系 ！

翻譯自: https://towardsdatascience.com/data-scientists-start-using-profilers-4d2e08e7aec0

數據探查

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/387922.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/387922.shtml
英文地址，請注明出處：http://en.pswp.cn/news/387922.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

Node.js Streams：你需要知道的一切

Node.js Streams：你需要知道的一切圖像來源 Node.js流以難以使用而聞名，甚至更難理解。好吧，我有個好消息 - 不再是這樣了。多年來，開發人員在那里創建了許多軟件包，其唯一目的是簡化流程。但在本文中，我…

oracle表分區

1.表空間:是一個或多個數據文件的集合,主要存放的是表,所有的數據對象都存放在指定的表空間中;一個數據文件只能屬于一個表空間,一個數據庫空間由若干個表空間組成,其中包括:a.系統表空間:10g以前,默認系統表空間是System,10g包括10g以后,默認系統表空間是User,存放數據字典和視…

oracle異機恢復 open resetlogs 報：ORA-00392

參考文檔：ALTER DATABASE OPEN RESETLOGS fails with ORA-00392 (Doc ID 1352133.1) 打開一個克隆數據庫報以下錯誤： SQL> alter database open resetlogs; alter database open resetlogs * ERROR at line 1: ORA-00392: log 1 of thread 1 is being…

從ncbi下載數據_如何從NCBI下載所有細菌組件

從ncbi下載數據One of the most important steps in genome analysis is gathering the data required for downstream research. This sometimes requires us to have the assembled reference genomes (mostly bacterial) so we can verify the classifiers trained or bins …

shell之引號嵌套引號大全

萬惡的引號這個能看懂你就出師了! 轉載于:https://www.cnblogs.com/theodoric008/p/10000480.html

oracle表分區詳解

oracle表分區詳解從以下幾個方面來整理關于分區表的概念及操作: 表空間及分區表的概念表分區的具體作用表分區的優缺點表分區的幾種類型及操作方法對表分區的維護性操作 1.表空間及分區表的概念表空間： 是一個或多個數據文件的集合，所有的數據對象都存…

線性插值插值_揭秘插值搜索

線性插值插值搜索算法指南 (Searching Algorithm Guide) Prior to this article, I have written about Binary Search. Check it out if you haven’t seen it. In this article, we will be discussing Interpolation Search, which is an improvement of Binary Search when…

其他命令

keys *這個可以全部的值del name 這個可以刪除某個127.0.0.1:6379> del s_set(integer) 1127.0.0.1:6379> keys z*（匹配）1) "z_set2"2) "z_set"127.0.0.1:6379> exists sex(integer) 0 127.0.0.1:6379> get a"3232…

建按月日自增分區表

一、建按月自增分區表： 1.1建表SQL> create table month_interval_partition_table (id number,time_col date) partition by range(time_col)2 interval (numtoyminterval(1,month))3 (4 partition p_month_1 values less than (to_date(2014-01-01,yyyy-mm…

#1123-JSP隱含對象

JSP 隱含對象 JSP隱含對象是JSP容器為每個頁面提供的Java對象，開發者可以直接使用它們而不用顯式聲明。JSP隱含對象也被稱為預定義變量。 JSP所支持的九大隱含對象： 對象，描述 request，HttpServletRequest類的實例 response&#…

按照時間，每天分區；按照數字，200000一個分區

按照時間，每天分區 create table test_p(id number,createtime date) partition by range(createtime) interval(numtodsinterval(1,day)) store in (users) ( partition test_p_p1 values less than(to_date(20140110,yyyymmdd)) ); create index index_test_p_id …

如果您不將Docker用于數據科學項目，那么您將生活在1985年

重點 (Top highlight)One of the hardest problems that new programmers face is understanding the concept of an ‘environment’. An environment is what you could say, the system that you code within. In principal it sounds easy, but later on in your career yo…