使用管道符組合使用命令_如何使用管道的魔力

使用管道符組合使用命令

Surely you have heard of pipelines or ETL (Extract Transform Load), or seen some method in a library, or even heard of any tool to create pipelines. However, you aren’t using it yet. So, let me introduce you to the fantastic world of pipelines.

當然，您聽說過管道或ETL(提取轉換負載)，或者在庫中看到了一些方法，甚至聽說過任何用于創建管道的工具。但是，您尚未使用它。因此，讓我向您介紹夢幻般的管道世界。

Before understanding how to use them, we have to understand what it is.

在了解如何使用它們之前，我們必須了解它的含義。

A pipeline is a way to wrap and automatize a process, which means that the process will always be executed in the same way, with the same functions and parameters and the outcome will always be in the predetermined standard.

管道是包裝和自動化過程的一種方式，這意味著該過程將始終以相同的方式執行，并具有相同的功能和參數，并且結果將始終符合預定的標準。

So, as you may guess, the goal is to apply pipelines in every development stage to try to guarantee that the designed process never ends up different from the one idealized.

因此，您可能會猜到，目標是在每個開發階段都應用管道，以確保設計過程永遠不會與理想化過程不同。

There are in particular two uses of pipelines in data science, either in production or during the modelling/exploration, that have a huge importance. Furthermore, it makes our life much easier.

在數據科學中，無論是在生產中還是在建模/探索期間，管道都有兩種特別重要的用途。此外，它使我們的生活更加輕松。

The first one is the data ETL. During production, the ramifications are way greater, and consequently, the level of detail spent in it, however, it can be summed up as:

第一個是數據ETL。在生產過程中，后果會更加嚴重，因此花費在其中的詳細程度也可以總結為：

E (Extract) — How am I going to collect the data? If I am going to collect them from one or several sites, one or more databases, or even a simple pandas csv. We can think of this stage as the data reading phase.

E(摘錄)—我將如何收集數據？如果我要從一個或多個站點，一個或多個數據庫甚至一個簡單的熊貓csv收集它們。我們可以將此階段視為數據讀取階段。

T (Transform) — What do I need to do for the data to become usable? This can be thought of as the conclusion of the exploratory data analysis, which means after we know what to do with the data (remove features, transform categorical variables into binary data, cleaning strings, etc.), we compile it all in a function that guarantees that cleaning will always be done in the same way.

T(轉換)—要使數據變得可用我需要做什么？這可以被認為是探索性數據分析的結論，這意味著在我們知道如何處理數據(刪除功能，將分類變量轉換為二進制數據，清理字符串等)之后，我們將其全部編譯為一個函數這樣可以確保始終以相同的方式進行清潔。

L (Load) — This is simply to save the data in the desired format (csv, data base, etc.) somewhere, either cloud or locally, to use anytime, anywhere.

L(負載)—這只是將數據以所需的格式(csv，數據庫等)保存在云端或本地的任何地方，以便隨時隨地使用。

The simplicity of the creation of this process is such that it can be done only by grabbing that exploratory data analysis notebook, put that pandas read_csv inside a funcion; write the several functions to prepare the data and compile them in one; and finally create a function saving the result of the previous one.

創建此過程非常簡單，只需抓住該探索性數據分析記錄本，然后將熊貓read_csv放入函數中即可完成 。編寫幾個函數來準備數據并將它們合而為一；最后創建一個保存上一個結果的函數。

Having this, we can create the main function in a python file and with one line of code executes the created ETL, without risking any changes. Not to mention the advantages of changing/updating everything in a single place.

有了這一點，我們可以在python文件中創建main函數，并用一行代碼執行創建的ETL，而無需冒險進行任何更改。更不用說在單個位置更改/更新所有內容的優勢。

And the second, and likely the most advantageous pipeline, helps solve one of the most common problems in machine learning: the parametrization.

第二，可能是最有利的管道，有助于解決機器學習中最常見的問題之一：參數化。

How many times have we faced these questions: which model to choose? Should I use normalization or standardization?

我們已經面對這些問題多少次了：選擇哪種模型？我應該使用標準化還是標準化？

Libraries such as scikit-learn offer us the pipeline method where we can put several models, with their respective parameters’ variance, add pre-processing such as normalization, standardization or even a custom process, and even add cross-validation at the end. Afterwards, all possibilities will be tested, and the results returned, or even only the best result, like in the following code:

諸如scikit-learn之類的庫為我們提供了流水線方法，我們可以放置幾個具有各自參數差異的模型，添加諸如標準化，標準化甚至是自定義過程之類的預處理，甚至最后添加交叉驗證。之后，將測試所有可能性，并返回結果，甚至僅返回最佳結果，如以下代碼所示：

def build_model(X,y):                          
 pipeline = Pipeline([
        ('vect',CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))                           ])# specify parameters for grid search                           parameters = { 
    # 'vect__ngram_range': ((1, 1), (1, 2)),  
    # 'vect__max_df': (0.5, 0.75, 1.0),                                
    # 'vect__max_features': (None, 5000, 10000),
    # 'tfidf__use_idf': (True, False),
    # 'clf__estimator__n_estimators': [50,100,150,200],
    # 'clf__estimator__max_depth': [20,50,100,200],
    # 'clf__estimator__random_state': [42]                                                   } 
                                                  
# create grid search object                          
cv = GridSearchCV(pipeline, param_grid=parameters, verbose=1)                                                   return cv

At this stage, the sky is the limit! There are no parameters limits inside the pipeline. However, depending on the database and the chosen parameters it can take an eternity to finish. Even so, it is a very good tool to funnel the research.

在這個階段，天空是極限！管道內部沒有參數限制。但是，根據數據庫和所選的參數，可能要花很長時間才能完成。即使這樣，它還是進行研究的很好工具。

We can add a function to read the data that comes out of the data ETL, and another to save the created model and we have model ETL, wrapping up this stage.

我們可以添加一個函數來讀取來自數據ETL的數據，另一個函數可以保存創建的模型，并且我們擁有模型ETL，從而結束了這一階段。

In spite of everything that we talked about, the greatest advantages of creating pipelines are the replicability and maintenance of your code that improve exponentially.

盡管我們討論了所有問題，但是創建管道的最大優勢是代碼的可復制性和維護性得到了指數級的提高。

So, what are you waiting for to start creating pipelines?

那么，您還等什么來開始創建管道？

An example of these can be found in this project.

在此項目中可以找到這些示例。

翻譯自: https://towardsdatascience.com/how-to-use-the-magic-of-pipelines-6e98d7e5c9b7

使用管道符組合使用命令

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/392158.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/392158.shtml
英文地址，請注明出處：http://en.pswp.cn/news/392158.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

關于網頁授權的兩種scope的區別說明

關于網頁授權的兩種scope的區別說明 1、以snsapi_base為scope發起的網頁授權，是用來獲取進入頁面的用戶的openid的，并且是靜默授權并自動跳轉到回調頁的。用戶感知的就是直接進入了回調頁（往往是業務頁面） 2、以snsapi_userinfo為…

安卓流行布局開源庫_如何使用流行度在開源庫之間進行選擇

安卓流行布局開源庫by Ashish Singal通過Ashish Singal 如何使用流行度在開源庫之間進行選擇 (How to choose between open source libraries using popularity) Through my career as a product manager, I’ve worked closely with engineers to build many technology prod…

TCP/IP分析(一) 協議概述

各協議層分工明確轉載于:https://www.cnblogs.com/HonkerYblogs/p/11247604.html

window 下分linux分區,如何在windows9x下訪問linux分區

1. 簡介Linux 內核支持眾多的文件系統類型, 目前它可以讀寫( 至少是讀) 大部分的文件系統.Linux 經常與Microsoft Windows 共存于一個系統或者硬盤中.Linux 對windows9x/NT 的文件系統支持的很好, 反之你想在windows 下…

C# new關鍵字和對象類型轉換(雙括號、is操作符、as操作符)

一、new關鍵字 CLR要求所有的對象都通過new來創建,代碼如下: Object objnew Object(); 以下是new操作符做的事情 1、計算類型及其所有基類型(一直到System.Object,雖然它沒有定義自己的實例字段)中定義的所有實例字段需要的字節數.堆上每個對象都需要一些額外的成員,包括“類型…

JDBC01 利用JDBC連接數據庫【不使用數據庫連接池】

目錄： 1 什么是JDBC 2 JDBC主要接口 3 JDBC編程步驟【學渣版本】 5 JDBC編程步驟【學神版本】 6 JDBC編程步驟【學霸版本】 1 什么是JDBC JDBC是JAVA提供的一套標準連接數據庫的接口，規定了連接數據庫的步驟和功能；不同的數據庫提供商提供了一…

leetcode 778. 水位上升的泳池中游泳（并查集）

在一個 N x N 的坐標方格 grid 中，每一個方格的值 grid[i][j] 表示在位置 (i,j) 的平臺高度。現在開始下雨了。當時間為 t 時，此時雨水導致水池中任意位置的水位為 t 。你可以從一個平臺游向四周相鄰的任意一個平臺，但是前提是此時水位必須…

2020年十大幣預測_2020年十大商業智能工具

2020年十大幣預測In the rapidly growing world of today, when technology is expanding at a rate like never before, there are plenty of tools and skills to explore, learn, and master. In this digital and data age, Business Information and Intelligence have cl…