使用管道符組合使用命令
Surely you have heard of pipelines or ETL (Extract Transform Load), or seen some method in a library, or even heard of any tool to create pipelines. However, you aren’t using it yet. So, let me introduce you to the fantastic world of pipelines.
當然,您聽說過管道或ETL(提取轉換負載),或者在庫中看到了一些方法,甚至聽說過任何用于創建管道的工具。 但是,您尚未使用它。 因此,讓我向您介紹夢幻般的管道世界。
Before understanding how to use them, we have to understand what it is.
在了解如何使用它們之前,我們必須了解它的含義。
A pipeline is a way to wrap and automatize a process, which means that the process will always be executed in the same way, with the same functions and parameters and the outcome will always be in the predetermined standard.
管道是包裝和自動化過程的一種方式,這意味著該過程將始終以相同的方式執行,并具有相同的功能和參數,并且結果將始終符合預定的標準。
So, as you may guess, the goal is to apply pipelines in every development stage to try to guarantee that the designed process never ends up different from the one idealized.
因此,您可能會猜到,目標是在每個開發階段都應用管道,以確保設計過程永遠不會與理想化過程不同。

There are in particular two uses of pipelines in data science, either in production or during the modelling/exploration, that have a huge importance. Furthermore, it makes our life much easier.
在數據科學中,無論是在生產中還是在建模/探索期間,管道都有兩種特別重要的用途。 此外,它使我們的生活更加輕松。
The first one is the data ETL. During production, the ramifications are way greater, and consequently, the level of detail spent in it, however, it can be summed up as:
第一個是數據ETL。 在生產過程中,后果會更加嚴重,因此花費在其中的詳細程度也可以總結為:
E (Extract) — How am I going to collect the data? If I am going to collect them from one or several sites, one or more databases, or even a simple pandas csv. We can think of this stage as the data reading phase.
E(摘錄)—我將如何收集數據? 如果我要從一個或多個站點,一個或多個數據庫甚至一個簡單的熊貓csv收集它們。 我們可以將此階段視為數據讀取階段。
T (Transform) — What do I need to do for the data to become usable? This can be thought of as the conclusion of the exploratory data analysis, which means after we know what to do with the data (remove features, transform categorical variables into binary data, cleaning strings, etc.), we compile it all in a function that guarantees that cleaning will always be done in the same way.
T(轉換)—要使數據變得可用我需要做什么? 這可以被認為是探索性數據分析的結論,這意味著在我們知道如何處理數據(刪除功能,將分類變量轉換為二進制數據,清理字符串等)之后,我們將其全部編譯為一個函數這樣可以確保始終以相同的方式進行清潔。
L (Load) — This is simply to save the data in the desired format (csv, data base, etc.) somewhere, either cloud or locally, to use anytime, anywhere.
L(負載)—這只是將數據以所需的格式(csv,數據庫等)保存在云端或本地的任何地方,以便隨時隨地使用。
The simplicity of the creation of this process is such that it can be done only by grabbing that exploratory data analysis notebook, put that pandas read_csv inside a funcion; write the several functions to prepare the data and compile them in one; and finally create a function saving the result of the previous one.
創建此過程非常簡單,只需抓住該探索性數據分析記錄本,然后將熊貓read_csv放入函數中即可完成 。 編寫幾個函數來準備數據并將它們合而為一; 最后創建一個保存上一個結果的函數。
Having this, we can create the main function in a python file and with one line of code executes the created ETL, without risking any changes. Not to mention the advantages of changing/updating everything in a single place.
有了這一點,我們可以在python文件中創建main函數,并用一行代碼執行創建的ETL,而無需冒險進行任何更改。 更不用說在單個位置更改/更新所有內容的優勢。
And the second, and likely the most advantageous pipeline, helps solve one of the most common problems in machine learning: the parametrization.
第二,可能是最有利的管道,有助于解決機器學習中最常見的問題之一:參數化。
How many times have we faced these questions: which model to choose? Should I use normalization or standardization?
我們已經面對這些問題多少次了:選擇哪種模型? 我應該使用標準化還是標準化?

Libraries such as scikit-learn offer us the pipeline method where we can put several models, with their respective parameters’ variance, add pre-processing such as normalization, standardization or even a custom process, and even add cross-validation at the end. Afterwards, all possibilities will be tested, and the results returned, or even only the best result, like in the following code:
諸如scikit-learn之類的庫為我們提供了流水線方法,我們可以放置幾個具有各自參數差異的模型,添加諸如標準化,標準化甚至是自定義過程之類的預處理,甚至最后添加交叉驗證。 之后,將測試所有可能性,并返回結果,甚至僅返回最佳結果,如以下代碼所示:
def build_model(X,y):
pipeline = Pipeline([
('vect',CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(estimator=RandomForestClassifier())) ])# specify parameters for grid search parameters = {
# 'vect__ngram_range': ((1, 1), (1, 2)),
# 'vect__max_df': (0.5, 0.75, 1.0),
# 'vect__max_features': (None, 5000, 10000),
# 'tfidf__use_idf': (True, False),
# 'clf__estimator__n_estimators': [50,100,150,200],
# 'clf__estimator__max_depth': [20,50,100,200],
# 'clf__estimator__random_state': [42] }
# create grid search object
cv = GridSearchCV(pipeline, param_grid=parameters, verbose=1) return cv
At this stage, the sky is the limit! There are no parameters limits inside the pipeline. However, depending on the database and the chosen parameters it can take an eternity to finish. Even so, it is a very good tool to funnel the research.
在這個階段,天空是極限! 管道內部沒有參數限制。 但是,根據數據庫和所選的參數,可能要花很長時間才能完成。 即使這樣,它還是進行研究的很好工具。
We can add a function to read the data that comes out of the data ETL, and another to save the created model and we have model ETL, wrapping up this stage.
我們可以添加一個函數來讀取來自數據ETL的數據,另一個函數可以保存創建的模型,并且我們擁有模型ETL,從而結束了這一階段。
In spite of everything that we talked about, the greatest advantages of creating pipelines are the replicability and maintenance of your code that improve exponentially.
盡管我們討論了所有問題,但是創建管道的最大優勢是代碼的可復制性和維護性得到了指數級的提高。
So, what are you waiting for to start creating pipelines?
那么,您還等什么來開始創建管道?
An example of these can be found in this project.
在此項目中可以找到這些示例。
翻譯自: https://towardsdatascience.com/how-to-use-the-magic-of-pipelines-6e98d7e5c9b7
使用管道符組合使用命令
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/392158.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/392158.shtml 英文地址,請注明出處:http://en.pswp.cn/news/392158.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!