python 數據科學包

Python is the most popular language for data science. Unfortunately, it can be tricky to know which of the many data science libraries to use when. ??

Python是數據科學中最流行的語言。不幸的是，要知道何時使用許多數據科學庫中的哪一個可能很棘手。 ??

Understanding when to use which library is key for quickly getting up to speed. In this article, I’ll give you the lay of the land for important Python data science libraries. 😀

了解何時使用哪個庫是快速入門的關鍵。在本文中，我將為您介紹重要的Python數據科學庫。 😀

Every package you’ll see is free and open source software. 👍 Thank you to all the folks who create, support, and maintain these projects! 🎉 If you’re interested in learning about contributing fixes to open source projects, here’s a good guide. And If you’re interested in the foundations that support these projects, I wrote an overview here.

您將看到的每個軟件包都是免費的開源軟件。 👍感謝所有創建，支持和維護這些項目的人！ 🎉如果您有興趣學習有關為開源項目貢獻修補程序的知識，那么這里是一個很好的指南。如果您對支持這些項目的基金會感興趣，那么我在此處撰寫了概述。

Let’s get to it! 🚀

讓我們開始吧！ 🚀

大熊貓 (Pandas)

Pandas is a workhorse to help you understand and manipulate your data. Use pandas to manipulate tabular data (think spreadsheet tables). Pandas is great for data cleaning, descriptive statistics, and basic visualizations.

熊貓是幫助您理解和操縱數據的主力軍。使用熊貓來處理表格數據(請考慮電子表格表格)。 Pandas非常適合用于數據清理，描述性統計和基本可視化。

Pandas is relatively brain-friendly, although the API is gigantic. Check out my book on pandas if you want to get started with the most important parts of the API.

盡管API龐大，但Pandas相對大腦友好。如果您想開始使用API??最重要的部分，請查閱我關于熊貓的書。

Unlike a SQL database, pandas stores all your data in-memory. It’s sort of like a hybrid between Microsoft Excel and a SQL database. Pandas makes operations with lots of data fast and repeatable.

與SQL數據庫不同，pandas將所有數據存儲在內存中。有點像Microsoft Excel和SQL數據庫之間的混合體。熊貓可以快速，可重復地處理包含大量數據的操作。

The amount of memory on your machine constrains how many rows and columns pandas can handle. For a rough guideline, if your data is less than thousands of columns and hundreds of millions of rows, pandas should work well on most computers.

機器上的內存量限制了熊貓可以處理的行數和列數。作為一個粗略的指導，如果您的數據少于數千列和數億行，則熊貓在大多數計算機上都應該工作良好。

Like a real panda bear, pandas is warm and fuzzy to work with. 🐼

就像一只真正的熊貓熊一樣，熊貓溫暖而又模糊。 🐼

When you have more data than pandas can handle, you might want to drop down to NumPy.

如果您的數據量超出了熊貓的處理能力，則可能需要下拉至NumPy。

NumPy (NumPy)

NumPy ndarrays are like more powerful Python lists. They are the data structure on which the edifice of machine learning is built. They hold the data you need in as many dimensions as you need.

NumPy ndarrays就像更強大的Python列表一樣。它們是建立機器學習大廈的數據結構。它們可以根據需要在多個維度上保存您所需的數據。

Do you have video data with three color channels for each pixel and lots of frames? No problem. 😀

您是否具有每個像素具有三個顏色通道和許多幀的視頻數據？沒問題。 😀

NumPy doesn’t have handy methods for time series data and strings like pandas does. In fact, each NumPy ndarray can have only one data type (hat tip to Kevin Markham for suggesting I include that differentiator).

NumPy沒有像pandas那樣方便的方法來處理時間序列數據和字符串。實際上，每個NumPy ndarray只能具有一種數據類型( Kevin Markham建議我包括該區分符的提示)。

For tabular data, NumPy is also harder for your brain to work with than pandas. You can’t do things like easily display column names in tables, as you can in pandas.

對于表格數據，NumPy也比熊貓更難與大腦合作。您無法像在熊貓中那樣輕松地在表中輕松顯示列名稱。

NumPy has a bit more speed/memory efficiency than pandas, because it doesn’t have the additional overhead. However, there are other approaches that might scale better if you have really big data. I have an outline on that topic so let me know if you’d be interested in hearing about it on Twitter. 👍

NumPy比熊貓具有更高的速度/內存效率，因為它沒有額外的開銷。但是，如果您擁有真正的大數據，還有其他方法可能會更好地擴展。我對該主題有一個概述，所以請讓我知道您是否有興趣在Twitter上聽到它。 👍

What else is NumPy good for?

NumPy還有什么好處？

mathematical functions for ndarrays.
ndarray的數學函數。
basic statistical functions for ndarrays.
ndarray的基本統計功能。
making random variables from common distributions. NumPy has 27 distributions to randomly sample from.
根據共同分布制作隨機變量。 NumPy具有27個分布以從中隨機采樣。

NumPy is like pandas without convenience functions and column names, but with some speed gains. 🚀

NumPy就像沒有便利功能和列名的熊貓，但是速度有所提高。 🚀

Scikit學習 (Scikit-learn)

The scikit-learn library is the Swiss Army Knife of machine learning. If you are doing a prediction task that does not involve deep learning, scikit-learn is what you want to use. It can handle NumPy arrays with no problems and pandas data structures pretty well.

scikit學習庫是機器學習的瑞士軍刀。如果您正在執行不涉及深度學習的預測任務，則要使用scikit-learn。它可以毫無問題地處理NumPy數組，并且熊貓數據結構也很好。

Scikit-learn pipelines and model selection functions are great for preparing and manipulating data in ways that avoid accidentally peeking at your hold-out (test set) data.

Scikit學習管道和模型選擇功能非常適合以避免意外窺視保留(測試集)數據的方式來準備和處理數據。

The scikit-learn API is very consistent for preprocessing transformers and for estimators. This makes it relatively easy to search for the best results over many machine learning algorithms. And it makes it easier to wrap your head around the library. 🧠

scikit-learn API對于預處理轉換器和估計器非常一致。這使得在許多機器學習算法中搜索最佳結果相對容易。而且，它可以更輕松地將您的頭繞在圖書館周圍。 🧠

Scikit-learn accommodates multi-threading so you can speed up your searches. However, it wasn’t built for GPUs, so it can’t take advantage of speedups there.

Scikit-learn可容納多線程，因此您可以加快搜索速度。但是，它不是為GPU構建的，因此無法利用那里的加速優勢。

Scikit-learn also contains handy basic NLP functions.

Scikit學習還包含方便的基本NLP功能。

multi-purpose knife — source: pixabay.com

If you want to do machine learning, scikit-learn is essential to be familiar with.

如果您想進行機器學習，那么熟悉scikit-learn是必不可少的。

The next two libraries are primarily used for deep neural networks. They work well with GPUs, TPUs, and CPUs.

接下來的兩個庫主要用于深度神經網絡。它們與GPU，TPU和CPU配合良好。

TensorFlow (TensorFlow)

TensorFlow is the most popular deep learning library. It is especially common in industry. It was developed by Google.

TensorFlow是最受歡迎的深度學習庫。在工業中尤其常見。它是由Google開發的。

The Keras high-level API is now tightly integrated with TensorFlow as of version TF version 2.0.

自TF版本2.0起，Keras高級API現在已與TensorFlow緊密集成。

In addition to working on CPU chips, TensorFlow can use GPUs and TPUs. These matrix-algebra optimized chips provide big speedups for deep learning.

除了使用CPU芯片，TensorFlow還可以使用GPU和TPU。這些矩陣代數優化的芯片為深度學習提供了極大的提速。

火炬 (PyTorch)

PyTorch is the second most popular deep learning library and is now the most common in academic research. It was developed by Facebook and has been growing in popularity. You can see my article on the topic here.

PyTorch是第二受歡迎的深度學習庫，現在在學術研究中最常見。它是由Facebook開發的，并且越來越受歡迎。您可以在此處查看有關該主題的文章。

PyTorch and TensorFlow now provide very similar functionality. They both have data structures, called tensors, that are similar to NumPy ndarrays. Tensors and can be converted into ndarrays easily. Both packages also contain some basic statistical functions.

PyTorch和TensorFlow現在提供了非常相似的功能。它們都具有稱為張量的數據結構，類似于NumPy ndarrays。張量和可以輕松轉換為ndarrays。這兩個軟件包還包含一些基本的統計功能。

PyTorch’s API is generally considered a bit more pythonic than TensorFlow’s API.

通常認為PyTorch的API比TensorFlow的API更具Python風格。

Skorch, FastAI, and PyTorch Lightening are packages that reduce the amount of code needed to use PyTorch models. PyTorch/XLA lets you use PyTorch with TPUs.

Skorch ， FastAI和PyTorch Lightening是可以減少使用PyTorch模型所需代碼量的軟件包。 PyTorch / XLA允許您將TyTorch與TPU一起使用。

Both PyTorch and TensorFlow will allow you to make top notch deep learning models. 👍

PyTorch和TensorFlow都將允許您創建一流的深度學習模型。 👍

統計模型 (Statsmodels)

Statsmodels is the statistical modeling library. It’s the place for doing inferential frequentist statistics.

Statsmodels是統計建模庫。在這里進行推斷頻繁性統計。

Want to run a statistical test and get some p-values? Statsmodels is your tool. 🙂

是否想進行統計檢驗并獲得一些p值？ Statsmodels是您的工具。 🙂

Statisticians and scientists coming from R can use statsmodels formula API for a smooth transition to Python land.

來自R的統計學家和科學家可以使用statsmodels公式API順利過渡到Python領域。

In addition to common statistical tests such as the t-test, ANOVA, and linear regression, what is statsmodels good for?

除了常用的統計檢驗(例如t檢驗，ANOVA和線性回歸)外，statsmodels還有什么用？

test how closely your data matches a well-known distribution.
測試您的數據與知名分布的匹配程度。
do time series modeling with ARIMA, Holt-Winters, and other algorithms.
使用ARIMA，Holt-Winters和其他算法進行時間序列建模。

Scikit-learn has some overlap with statsmodels when it comes to common formulas such as linear regression. However, the APIs are different. Also, scikit-learn is much more focussed on prediction and statsmodels is much more focussed on inference. ??

當涉及線性回歸等常用公式時，Scikit-learn與statsmodels有一些重疊。但是，API是不同的。而且，scikit-learn更側重于預測，而statsmodels更側重于推理。 ??

Statsmodels is built on NumPy and SciPy and plays nicely with pandas. 🙂

Statsmodels建立在NumPy和SciPy之上，可與熊貓很好地配合使用。 🙂

Speaking of SciPy, let’s look at when to use it.

說到SciPy，讓我們看看何時使用它。

科學 ci (SciPy 🔬)

“SciPy is a collection of mathematical algorithms and convenience functions built on the NumPy extension of Python.” — the docs.

“ SciPy是基于Python的NumPy擴展構建的數學算法和便利函數的集合。” — docs 。

SciPy is like NumPy’s twin. Many NumPy array functions can alternatively be called through SciPy. The two packages even share the same docs website.

SciPy就像NumPy的雙胞胎。許多NumPy數組函數也可以通過SciPy調用。這兩個軟件包甚至共享相同的docs網站。

SciPy sparse matrices are used in scikit-learn. A sparse matrix is one that is optimized to use far less memory that a regular, dense matrix when most elements are zeros.

SciPy稀疏矩陣用于scikit學習中。當大多數元素為零時，稀疏矩陣是經過優化以使用比常規密集矩陣少得多的內存的矩陣。

SciPy contains common constants and linear algebra capabilities. The scipy.stats sub-module is used for probability distributions, descriptive stats, and statistical tests. It has 125 distributions to randomly sample from, nearly 100 more than NumPy. 😲 However, unless you are doing lots of stats, as a practicing data scientist, you’ll likely be fine with the distributions in NumPy.

SciPy包含公共常數和線性代數功能。 scipy.stats 子模塊用于概率分布，描述性統計和統計檢驗。它具有125種分布以從NumPy中隨機采樣近100種。 😲但是，除非您做大量的統計工作，否則作為一名實踐數據科學家，您可能會對NumPy中的分布情況滿意。

If statsmodels or NumPy doesn’t have the functionality you need, then go look in SciPy. 👀

如果statsmodels或NumPy沒有所需的功能，則請查看SciPy。 👀

達斯克 (Dask)

Dask has an API that mimics pandas and NumPy. Use Dask when you want to use pandas or NumPy but have more data than you can keep in memory.

Dask具有模仿熊貓和NumPy的API。當您想使用pandas或NumPy時，如果您要存儲的數據量過多，請使用Dask。

Dask can also speed up calculations for big datasets. It does multi-threading over multiple devices. You can combine Dask with Rapids to gain the performance benefits of distributed computing over GPUs. 👍

Dask還可以加快大型數據集的計算速度。它在多個設備上執行多線程。您可以將Dask與Rapids結合使用以獲得通過GPU進行分布式計算的性能優勢。 👍

PyMC3 (PyMC3)

PyMC3 is the package for Bayesian statistics. Like Markov Chain Monte Carlo (MCMC) simulations? PyMC3 is your jam. 🎉

PyMC3是用于貝葉斯統計的軟件包。像Markov Chain Monte Carlo(MCMC)模擬一樣？ PyMC3是您的果醬。 🎉

I find the PyMC3 API a bit confusing, but it’s a powerful library. But that’s might be because I don’t use it a lot.

我發現PyMC3 API有點令人困惑，但這是一個功能強大的庫。但這可能是因為我不經常使用它。

其他流行數據科學軟件包 (Other Popular Data Science Packages)

I won’t dive deeply into visualization, NLP, gradient boosting, time series, or model serving libraries, but I’ll highlight a few popular packages in each area.

我不會深入研究可視化，NLP，梯度增強，時間序列或模型服務庫，但我將重點介紹每個領域中的一些流行軟件包。

可視化庫📊 (Visualization libraries 📊)

There are gobs of visualization libraries in Python. Matplotlib, seaborn, and Plotly are three of the most popular. I go through some of the options toward the end of this article.

Python中有許多可視化庫。 Matplotlib ， seaborn和Plotly是最受歡迎的三種。在本文結尾處，我將介紹一些選項。

NLP庫🔠 (NLP libraries 🔠)

Natural language processing (NLP) is a huge and important area of machine learning. Either spaCy or NLTK will have most of the functionality you’ll need. Both are very popular.

自然語言處理(NLP)是機器學習的一個巨大而重要的領域。兩種空間或NLTK將具有您所需的大多數功能。兩者都很受歡迎。

梯度提升回歸樹庫🌳 (Gradient boosting regression tree libraries 🌳)

LightGBM is the most popular gradient boosting package. Scikit-learn has clones of its algorithms. XGBoost and CatBoost are other boosting algorithm packages that are similar to LightGBM. If you look at the leaderboard of a Kaggle machine learning competition and a deep learning algorithm isn’t a good match for the problem, you’ll likely see one of these gradient boosting libraries used by the winners.

LightGBM是最受歡迎的漸變增強軟件包。 Scikit-learn具有其算法的克隆。 XGBoost和CatBoost是其他類似于LightGBM的增強算法程序包。如果您看一下Kaggle機器學習競賽的排行榜，而深度學習算法并不是解決該問題的理想選擇，那么您很可能會看到獲勝者使用的其中一種梯度提升庫。

時間序列庫📅 (Time series libraries 📅)

Pmdarima makes fitting an ARIMA time series model less painful. However, the process of choosing the hyperparameters is not entirely automated.

使用Pmdarima可以簡化 ARIMA時間序列模型的擬合過程。但是，選擇超參數的過程并非完全自動化。

Prophet is another package for making predictions with time series data. “Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data.” — the docs. It was created by Facebook. The API and docs are user-friendly. It’s worth a shot if your data fit the description above.

先知是用于使用時間序列數據進行預測的另一個軟件包。 “先知是一種基于加性模型預測時間序列數據的程序，其中非線性趨勢與年，周和日的季節性變化以及假日效應相吻合。最適合具有強烈季節性影響和多個季節歷史數據的時間序列。” — docs 。它是由Facebook創建的。 API和文檔是用戶友好的。如果您的數據符合上述說明，則值得一試。

模型服務庫🚀 (Model serving libraries 🚀)

When it comes to model serving, Flask, FastAPI, and Streamlit are three popular libraries to actually do something with your model predictions. 😉 Flask is a basic framework for making an API or serving a website that has been battle tested. FastAPI makes setting up REST endpoints faster and easier. Streamlit makes it quick to serve a model in a single page app. If you’re interested in learning more about streamlit, I wrote a guide to getting started with it here.

當涉及模型服務時， Flask ， FastAPI和Streamlit是三個流行的庫，它們實際上可以對模型預測進行某些處理。 😉Flask是用于制作API或服務經過實戰測試的網站的基本框架。 FastAPI使設置REST端點更快，更輕松。 Streamlit可以快速在一個頁面應用程序中提供模型。如果您想了解有關Streamlit的更多信息，請在此處撰寫有關入門的指南。

包 (Wrap)

Here’s a quick recap of when to use which major Python data science library:

以下是何時使用哪個主要Python數據科學庫的簡要概述：

pandas for tabular data exploration and manipulation.
用于表格數據探索和處理的大熊貓 。
NumPy for random samples from common distributions, to save memory, or to speed up operations.
NumPy用于獲取來自公共分布的隨機樣本，以節省內存或加快操作速度。
scikit-learn for machine learning.
scikit-learn用于機器學習。
TensorFlow or PyTorch for deep learning.
TensorFlow或PyTorch用于深度學習。
statsmodels for statistical modeling.
用于統計建模的statsmodels 。
SciPy for statistical tests or distributions you can’t find in NumPy or statsmodels.
在NumPy或statsmodels中找不到的統計測試或分布的SciPy 。
Dask when you want pandas or NumPy but have really big data.
當您想要熊貓或NumPy但擁有真正的大數據時，請花點時間 。
PyMC3 for Bayesian stats.
PyMC3用于貝葉斯統計。

I hope you enjoyed this tour of key Python data science packages. If you did, please share it on your favorite social media so other folks can find it, too. 😀

我希望您喜歡這個關鍵的Python數據科學軟件包之旅。如果您這樣做了，請在您喜歡的社交媒體上分享它，以便其他人也可以找到它。 😀

Now you hopefully have a clearer mental model of how the different Python data science libraries relate to each other and reach for each of them.

現在，希望您有了一個更清晰的思維模型，以了解不同的Python數據科學庫如何相互關聯并達到它們之間的聯系。

I write about Python, SQL, Docker, and other tech topics. If any of that’s of interest to you, sign up for my mailing list of awesome data science resources and read more to help you grow your skills here. 👍

我撰寫有關Python ， SQL ， Docker和其他技術主題的文章。如果您有任何興趣，請注冊我的超棒數據科學資源郵件列表，并在此處內容以幫助您提高技能。 👍

map of the world to explore with python — source: pixabay.com

Happy exploring! 😀

探索愉快！ 😀

翻譯自: https://towardsdatascience.com/which-python-data-science-package-should-i-use-when-e98c701364c