從Jupyter Notebook切換到腳本的5個理由

意見 (Opinion)

動機 (Motivation)

Like most people, the first tool I used when started learning data science is Jupyter Notebook. Most of the online data science courses use Jupyter Notebook as a medium to teach. This makes sense because it is easier for beginners to start writing code in Jupyter Notebook’s cells than writing a script with classes and functions.

與大多數人一樣,我開始學習數據科學時使用的第一個工具是Jupyter Notebook。 大多數在線數據科學課程都使用Jupyter Notebook作為教學手段。 這是有道理的,因為對于初學者來說,在Jupyter Notebook的單元格中開始編寫代碼比編寫具有類和函數的腳本要容易得多。

Another reason why Jupyter Notebook is such a common tool in data science is that Jupyter Notebook makes it easy to explore and plot the data. When we type ‘Shift + Enter’, we will immediately see the results of the code, which makes it easy for us to identify whether our code works or not.

Jupyter Notebook之所以成為數據科學中如此普遍的工具的另一個原因是,Jupyter Notebook使其易于瀏覽和繪制數據。 當我們鍵入“ Shift + Enter”時,我們將立即看到代碼的結果,這使我們很容易確定我們的代碼是否有效。

However, I realized several fallbacks of Jupyter Notebook as I work with more data science projects:

但是,當我處理更多數據科學項目時,我意識到了Jupyter Notebook的一些后備功能:

  • Unorganized: As my code gets bigger, it becomes increasingly difficult for me to keep track of what I write. No matter how many markdowns I use to separate the notebook into different sections, the disconnected cells make it difficult for me to concentrate on what the code does.

    雜亂無章 :隨著我的代碼變得越來越大,對我而言,跟蹤自己的編寫變得越來越困難。 無論我使用多少次降價將筆記本分成不同的部分,斷開的單元格都使我難以集中精力執行代碼。

  • Difficult to experiment: You may want to test with different methods of processing your data, choose different parameters for your machine learning algorithm to see if the accuracy increases. But every time you experiment with new methods, you need to rerun the entire notebook. This is time-consuming, especially when the processing procedure or the training takes a long time to run.

    難以實驗: 可能想用不同的數據處理方法進行測試,為機器學習算法選擇不同的參數以查看準確性是否提高。 但是,每次嘗試新方法時,都需要重新運行整個筆記本。 這非常耗時,尤其是在處理過程或培訓需要很長時間才能運行時。

  • Not ideal for reproducibility: If you want to use new data with a slightly different structure, it would be difficult to identify the source of error in your notebook.

    對于重現性而言并不理想:如果要使用結構略有不同的新數據,則很難在筆記本中識別錯誤源。

  • Difficult to debug: When you get an error in your code, it is difficult to know whether the reason for the error is the code or the change in data. If the error is in the code, which part of the code is causing the problem?

    難以調試:當您得到 代碼中的錯誤,很難知道錯誤的原因是代碼還是數據更改。 如果錯誤出在代碼中,則代碼的哪一部分導致了問題?

  • Not ideal for production: Jupyter Notebook does not play very well with other tools. It is not easy to run the code from Jupyter Notebook while using other tools.

    對于生產而言并不理想: Jupyter Notebook在與其他工具配合使用時效果不佳。 使用其他工具時,從Jupyter Notebook運行代碼并不容易。

I knew there must be a better way to handle my code so I decided to give scripts a try. These are the benefits I found when using scripts:

我知道必須有一種更好的方式來處理我的代碼,所以我決定嘗試一下腳本。 這些是我在使用腳本時發現的好處:

有組織的 (Organized)

The cells in Jupyter Notebook make it difficult to organize the code into different parts. With a script, we could create several small functions with each function specifies what the code does like this

Jupyter Notebook中的單元格使得很難將代碼組織成不同的部分。 使用腳本,我們可以創建幾個小函數,每個函數指定代碼的功能,如下所示

Image for post

Better yet, if these functions could be categorized in the same category such as functions to process the data, we could put them in the same class!

更好的是,如果可以將這些函數歸為同一類,例如處理數據的函數,我們可以將它們歸為同一類!

Image for post

Whenever we want to process our data, we know the functions in the class Preprocess can be used for this purpose.

每當我們要處理數據時,我們都知道Preprocess類中的函數可用于此目的。

鼓勵實驗 (Encourage Experiment)

When we want to experiment with a different approach to preprocess data, we could just add or remove a function by commenting out like this without being afraid to break the code! Even if we happen to break the code, we know exactly where to fix it.

當我們想嘗試另一種預處理數據的方法時,我們可以通過注釋掉這樣的方式來添加或刪除函數,而不必擔心破壞代碼! 即使我們碰巧破壞了代碼,我們也知道在哪里修復它。

Image for post

We could also experiment with different parameters by changing the input of the functions. For example, if we want to see how different methods of resampling my Pandas series affect my results, we could just switch from method_of_resample='sum’ to method_of_resample= 'average'. How neat!

我們還可以通過更改函數的輸入來試驗不同的參數。 例如,如果要查看對熊貓系列進行重采樣的不同方法如何影響我的結果,可以將其從method_of_resample='sum'切換到method_of_resample= 'average' 。 多么整潔!

Image for post

You can still use functions in a notebook, but when your number of functions gets really big, you might want to split the functions in different notebooks. Importing functions across different notebook is not easy.

您仍然可以在筆記本中使用功能,但是當功能數量真的很大時,您可能希望將功能拆分到不同的筆記本中。 跨不同筆記本導入功能并不容易。

重現性的理想選擇 (Ideal for Reproducibility)

With classes and functions, we could make the code general enough so that it will be able to work with other data.

使用類和函數,我們可以使代碼足夠通用,以便能夠與其他數據一起使用。

For example, if we want to drop different columns in my new data, we just need to change columns_to_drop to a list of columns, we want to drop and the code will run smoothly!

例如,如果我們想在新數據中刪除不同的列,我們只需要將columns_to_drop更改為列的列表,我們就可以刪除并且代碼將平穩運行!

columns_to_drop = config.columns.to_dropdatetime_column = config.columns.datetime.sentimentdropna_columns = config.columns.drop_naprocessor = Preprocess(columns_to_drop, datetime_column, dropna_columns)

I can also create a pipeline that specifies steps to process and train the data! Once I have a pipeline, all I need to do is to use

我還可以創建一個管道來指定處理和訓練數據的步驟! 一旦有了管道,我要做的就是使用

pipline.fit_transform(data)

to apply the same processing to both the train and test data.

對火車和測試數據進行相同的處理。

易于調試 (Easy to Debug)

With functions, it is easier to test whether that function produces the output we expect. We can quickly spot out where in the code we should change to produce the output we want

使用函數,可以更輕松地測試該函數是否產生我們期望的輸出。 我們可以快速找出應該在代碼中更改的位置以產生所需的輸出

def extract_date_hour_minute(string: str):'''Extract data hour and minute from datetime string'''try:return string[:16]except TypeError:return np.nandef test_extract_date_hour_minute():'''Test whether the function extract date, hour, and minute '''string = '2020-07-30T23:25:31.036+03:00'assert extract_date_hour_minute(string) == '2020-07-30T23:25'

If all of the tests pass but there is still an error in running our code, we know the data is where we should look next.

如果所有測試都通過了,但是在運行我們的代碼時仍然存在錯誤,那么我們知道數據是我們下一步應該去的地方。

For example, after passing the test above, I still have a TypeError when running the script, which gives me the idea that my data has null values. I just need to take care of that to run the code smoothly.

例如,通過上述測試后,運行腳本時我仍然遇到TypeError,這使我想到了我的數據具有空值。 我只需要注意這一點即可順利運行代碼。

生產的理想選擇 (Ideal for Production)

We can use different functions in multiple scripts on top of something else like this

我們可以在類似這樣的其他東西的多個腳本中使用不同的功能

from preprocess import preprocess
from model import run_model
from predict import predictdef main(config):df = preprocess(config)df = run_model(config)df, df_scale, min_day, max_day, accuracy = predict(df, config)

or to add a config file to control the values of the variables. This prevents us from wasting time tracking down a specific variable in the code just to change its value.

或添加配置文件以控制變量的值。 這樣可以避免我們浪費時間跟蹤代碼中的特定變量以更改其值。

columns:to_drop:#- keywords#- entities- code- error- warningsbinary_columns: - sentiment - Diffdatetime:time: Date sentiment: crawleddrop_na: - sentiment- usage- crawled- emotionto_predict: sentiment

We could also easily add tools to track the experiment such as MLFlow or tools to handle configuration such as Hydra.cc!

我們還可以很容易地添加工具來跟蹤實驗,如MLFlow或工具來處理配置,如Hydra.cc !

我不喜歡使用Jupyter Notebook的想法,直到我將自己推出舒適區 (I didn’t like the Idea of Using Jupyter Notebook until I Pushed myself out of my Comfort Zone)

I used to use Jupyter Notebook all the time. When some data scientists advise me to switch from Jupyter Notebook to script to prevent some problems listed above, I didn’t understand and felt resistant to do so. I didn’t like the uncertainty of not being able to see the outcome when I run the cell.

我曾經一直使用Jupyter Notebook。 當一些數據科學家建議我從Jupyter Notebook切換到腳本以防止上面列出的某些問題時,我并不理解,并且對此感到抵觸。 我不喜歡在運行單元時無法看到結果的不確定性。

But the disadvantage of Jupyter Notebook grew as I started my first real data science project in my new company so I decided to push myself out of my comfort zone and experiment with scripts.

但是Jupyter Notebook的劣勢隨著我在新公司中開始第一個真實數據科學項目而變得越來越嚴重,因此我決定將自己從舒適的領域中脫身出來,并嘗試使用腳本。

In the beginning, I felt uncomfortable but started to notice the benefits of using scripts. I started to feel more organized when my code is organized into different functions, classes, and into multiple scripts with each script serving different purposes such as preprocessing, training, and testing.

一開始,我感到不舒服,但是開始注意到使用腳本的好處。 當我的代碼被組織成不同的函數,類和多個腳本,并且每個腳本具有不同的目的(例如預處理,培訓和測試)時,我開始變得井井有條。

所以,您是否建議我停止使用Jupyter Notebook? (So are you Suggesting me to Stop Using Jupyter Notebook?)

Don’t get me wrong. I still use Jupyter Notebook if my code is small and if I don’t plan to put my code into production. I use Jupyter Notebook when I want to explore and visualize the data. I also use it to explain how to use some python libraries. For example, I write use mostly Jupyter Notebooks in this repository as the medium to explain the code mentioned in all of my articles.

不要誤會我的意思。 如果我的代碼很小并且我不打算將代碼投入生產,我仍然會使用Jupyter Notebook。 當我想瀏覽和可視化數據時,我使用Jupyter Notebook。 我也用它來解釋如何使用一些python庫。 例如,我在這個存儲庫中主要使用Jupyter Notebooks作為媒介來解釋我所有文章中提到的代碼。

If you don’t feel comfortable with coding everything in scripts, you could use both scripts and Jupyter Notebook for different purposes. For example, you could create classes and functions in scripts then import them in the notebook so that the notebook is less messy.

如果您不滿意用腳本編寫所有代碼,則可以將腳本和Jupyter Notebook都用于不同的目的。 例如,您可以在腳本中創建類和函數,然后將其導入筆記本中,以使筆記本不那么混亂。

Another alternative is to turn the notebook into the script after writing the notebook. I personally don't prefer this approach because it often takes me longer to organize the code in my notebook such as put them into functions and classes and write test functions.

另一種選擇是在編寫筆記本后將筆記本變成腳本。 我個人不喜歡這種方法,因為通常需要我花費更長的時間在筆記本中組織代碼,例如將它們放入函數和類中以及編寫測試函數。

I find writing a small function then writing a small test function is faster and safer. If I happen to want to speeds up my code with the new Python library, I could use the test function I already wrote to make sure it still works as I expected.

我發現編寫一個小的函數然后編寫一個小的測試函數會更快,更安全。 如果我碰巧想用新的Python庫加速代碼,則可以使用已經編寫的測試函數來確保它仍然可以按預期工作。

With that being said, I believe there are more ways to solve the disadvantage of Jupyter Notebook than what I mentioned here such as how Netflix uses put the notebook into production and schedule the notebook to run at a certain time.

話雖這么說,我相信比我在這里提到的解決Jupyter Notebook的缺點還有更多的方法,例如Netflix如何使用Netflix將筆記本電腦投入生產并安排筆記本電腦在特定時間運行 。

結論 (Conclusion)

Everybody has their own way to make their workflow more efficient and to me, it is to leverage the utility of scripts. If you have just switched from Jupyter Notebook to script, it might not be intuitive to write code in scripts, but trust me, you will get used to using scripts eventually.

每個人都有自己的方法來提高工作流程的效率,對我來說,這是利用腳本的實用程序。 如果您剛剛從Jupyter Notebook切換到腳本,那么用腳本編寫代碼可能并不直觀,但是請相信我,您最終將習慣于使用腳本。

Once that happens, you will start to realize many benefits of the scripts over the messy Jupyter Notebook and want to write most of your code in scripts.

一旦發生這種情況,相對于凌亂的Jupyter Notebook,您將開始意識到腳本的許多優點,并希望將大多數代碼編寫在腳本中。

If you don’t feel comfortable with the big change, start small.

如果您對較大的變化不滿意,請從小處著手。

Big changes start with small steps

大變化始于小步

I like to write about basic data science concepts and play with different algorithms and data science tools. You could connect with me on LinkedIn and Twitter.

我喜歡寫有關基本數據科學概念的文章,并喜歡使用不同的算法和數據科學工具。 您可以在LinkedIn和Twitter上與我聯系。

Star this repo if you want to check out the codes for all of the articles I have written. Follow me on Medium to stay informed with my latest data science articles like these

如果您想查看我編寫的所有文章的代碼,請給此回購加注星號。 在Medium上關注我,以了解有關這些最新數據科學文章的最新信息

翻譯自: https://towardsdatascience.com/5-reasons-why-you-should-switch-from-jupyter-notebook-to-scripts-cb3535ba9c95

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/390563.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/390563.shtml
英文地址,請注明出處:http://en.pswp.cn/news/390563.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

leetcode 1833. 雪糕的最大數量

夏日炎炎,小男孩 Tony 想買一些雪糕消消暑。 商店中新到 n 支雪糕,用長度為 n 的數組 costs 表示雪糕的定價,其中 costs[i] 表示第 i 支雪糕的現金價格。Tony 一共有 coins 現金可以用于消費,他想要買盡可能多的雪糕。 給你價格…

MVC架構 -- 初學試水選課管理系統

項目文件網站地址:http://www.gegecool.cn:90/ 第一次對MVC 進行轉載于:https://www.cnblogs.com/wtusoso/p/8032120.html

rest api 示例2_REST API教程– REST Client,REST Service和API調用通過代碼示例進行了解釋

rest api 示例2Ever wondered how login/signup on a website works on the back-end? Or how when you search for "cute kitties" on YouTube, you get a bunch of results and are able to stream off of a remote machine?有沒有想過網站上的登錄/注冊在后端如…

win10子系統linux編譯ffmpeg

android-ndk-r14b(linux版) ffmpeg-4.0 開啟win10子系統(控制面板-》程序和功能-》啟用或關閉Windows功能 然后在 適用與 Linux 的 Windows 子系統前面打勾) 然后點擊確定,等待安裝,電腦會重啟 然后在win10應用商店 搜索ubuntu安裝…

ip登錄打印機怎么打印_不要打印,登錄。

ip登錄打印機怎么打印Often on Python, especially as a beginner, you might print( ) a variable in order to see what is happening in your program. It is possible if you rely on too many print statements throughout your program you will face the nightmare of h…

leetcode 451. 根據字符出現頻率排序

給定一個字符串,請將字符串里的字符按照出現的頻率降序排列。 示例 1:輸入: "tree"輸出: "eert"解釋: e出現兩次,r和t都只出現一次。 因此e必須出現在r和t之前。此外,"eetr"也是一個有效的答案。 示例 2:輸入…

Spring-Security 自定義Filter完成驗證碼校驗

Spring-Security的功能主要是由一堆Filter構成過濾器鏈來實現,每個Filter都會完成自己的一部分工作。我今天要做的是對UsernamePasswordAuthenticationFilter進行擴展,新增一個Filter,完成對登錄頁面的校驗碼的驗證。下面先給一張過濾器的說明…

如何使用Ionic和Firebase在短短三天內創建冠狀病毒跟蹤器應用程序

I am really fond of Hybrid App technologies – they help us achieve so much in a single codebase. Using the Ionic Framework, I developed a cross-platform mobile solution for tracking Coronavirus cases in just 3 days. 我真的很喜歡Hybrid App技術-它們可以幫助…

二、Java面向對象(7)_封裝思想——this關鍵字

2018-04-30 this關鍵字 什么是this: 表示當前對象本身,或當前類的一個實例,通過 this 可以調用本對象的所有方法和屬性。 this主要存在于兩個地方: 1)構造函數:此時this表示調用當前創建的對象 2)成員方法中…

機器學習模型 非線性模型_調試機器學習模型的終極指南

機器學習模型 非線性模型You’ve divided your data into a training, development and test set, with the correct percentage of samples in each block, and you’ve also made sure that all of these blocks (specially development and test set) come from the same di…

leetcode 645. 錯誤的集合

集合 s 包含從 1 到 n 的整數。不幸的是,因為數據錯誤,導致集合里面某一個數字復制了成了集合里面的另外一個數字的值,導致集合 丟失了一個數字 并且 有一個數字重復 。 給定一個數組 nums 代表了集合 S 發生錯誤后的結果。 請你找出重復出…

Linux環境變量總結

現在每天測試到時候會與Linux打交道,自然也會用到環境變量了。看了網上幾篇文章,結合自己到實踐和看法,總結以下Linux的環境變量吧。一、什么是環境變量?環境變量相當于給系統或用戶應用程序設置的一些參數, 具體起什么作用這當然…

目錄指南中的Python列表文件-listdir VS system(“ ls”)通過示例進行解釋

🔹歡迎 (🔹 Welcome) If you want to learn how these functions work behind the scenes and how you can use their full power, then this article is for you.如果您想了解這些功能在后臺如何工作以及如何充分利用它們的功能,那么本文適合…

Java多線程并發學習-進階大綱

1、synchronized 的實現原理以及鎖優化?2、volatile 的實現原理?3、Java 的信號燈?4、synchronized 在靜態方法和普通方法的區別?5、怎么實現所有線程在等待某個事件的發生才會去執行?6、CAS?CAS 有什么缺陷…

大數據定律與中心極限定理_為什么中心極限定理對數據科學家很重要?

大數據定律與中心極限定理數據科學 (Data Science) The Central Limit Theorem is at the center of statistical inference what each data scientist/data analyst does every day.中心極限定理是每個數據科學家/數據分析師每天所做的統計推斷的中心。 Central Limit Theore…

useEffect語法講解

useEffect語法講解 用法 useEffect(effectFn, deps)能力 useEffect Hook 相當于 componentDidMount,componentDidUpdate 和 componentWillUnmount 這三個函數的組合。 可以模擬渲染后、更新后、銷毀三個動作。 案例演示 渲染后更新標題 useEffect(()>{doc…

leetcode 726. 原子的數量

給定一個化學式formula(作為字符串),返回每種原子的數量。 原子總是以一個大寫字母開始,接著跟隨0個或任意個小寫字母,表示原子的名字。 如果數量大于 1,原子后會跟著數字表示原子的數量。如果數量等于 1…

web相關基礎知識1

2017-12-13 09:47:11 關于HTML 1.絕對路徑和相對路徑 相對路徑:相對于文件自身為參考。 (工作中一般是使用相對路徑) 這里我們用html文件為參考。如果說html和圖片平級,那直接使用src 如果說圖片在和html平級的文件夾里面&#xf…

JavaScript循環:標簽語句,繼續語句和中斷語句說明

標簽聲明 (Label Statement) The Label Statement is used with the break and continue statements and serves to identify the statement to which the break and continue statements apply. Label語句與break和continue語句一起使用,用于標識break和continue語…

馬約拉納費米子:推動量子計算的“天使粒子”

據《人民日報》報道,以華人科學家為主體的科研團隊找到了正反同體的“天使粒子”——馬約拉納費米子,從而結束了國際物理學界對這一神秘粒子長達80年的漫長追尋。該成果由加利福尼亞大學洛杉磯分校何慶林、王康隆課題組,美國斯坦福大學教授張…