數據分析和大數據哪個更吃香
Dealing with big data can be tricky. No one likes out of memory errors. ?? No one likes waiting for code to run. ? No one likes leaving Python. 🐍
處理大數據可能很棘手。 沒有人喜歡內存不足錯誤。 No?沒有人喜歡等待代碼運行。 ?沒有人喜歡離開Python。 🐍
Don’t despair! In this article I’ll provide tips and introduce up and coming libraries to help you efficiently deal with big data. I’ll also point you toward solutions for code that won’t fit into memory. And all while staying in Python. 👍
別失望! 在本文中,我將提供一些技巧,并介紹和建立新的庫來幫助您有效地處理大數據。 我還將向您指出不適合內存的代碼的解決方案。 而所有這些都停留在Python中。 👍
Python is the most popular language for scientific and numerical computing. Pandas is the most popular for cleaning code and exploratory data analysis.
Python是用于科學和數值計算的最流行的語言。 熊貓是最受歡迎的清潔代碼和探索性數據分析工具。
Using pandas with Python allows you to handle much more data than you could with Microsoft Excel or Google Sheets.
與Microsoft Excel或Google Sheets相比,在Python中使用pandas可以處理更多的數據。

SQL databases are very popular for storing data, but the Python ecosystem has many advantages over SQL when it comes to expressiveness, testing, reproducibility, and the ability to quickly perform data analysis, statistics, and machine learning.
SQL數據庫在存儲數據方面非常流行,但是在表達性,測試,可再現性以及快速執行數據分析,統計信息和機器學習的能力方面,Python生態系統具有許多優于SQL的優勢。
Unfortunately, if you are working locally, the amount of data that pandas can handle is limited by the amount of memory on your machine. And if you’re working in the cloud, more memory costs more money.
不幸的是,如果您在本地工作,熊貓可以處理的數據量受到計算機內存量的限制。 而且,如果您在云中工作,那么更多的內存將花費更多的資金。
Regardless of where you code is running you want operations to happen quickly so you can GSD (Get Stuff Done)! 😀
無論您在哪里運行代碼,都希望操作能夠快速進行,以便您可以進行GSD(完成工作)! 😀
永遠要做的事情 (Things to always do)
If you’ve ever heard or seen advice on speeding up code you’ve seen the warning. ?? Don’t prematurely optimize! ??
如果您曾經聽說過或看到有關加速代碼的建議,那么您會看到警告。 ?? 不要過早優化! ??
This is good advice. But it’s also smart to know techniques so you can write clean fast code the first time. 🚀
這是個好建議。 但是了解技術也很聰明,因此您可以在第一時間編寫干凈的快速代碼。 🚀

The following are three good coding practices for any size dataset.
以下是適用于任何大小數據集的三種良好編碼實踐。
Avoid nested loops whenever possible. Here’s a brief primer on Big-O notation and algorithm analysis. One
for
loop nested inside anotherfor
loop generally leads to polynomial time calculations. If you have more than a few items to search through, you’ll be waiting for a while. See a nice chart and explanation here.盡可能避免嵌套循環。 這是有關Big-O表示法和算法分析的簡要介紹。 一個
for
循環嵌套在另一個for
循環中通常會導致多項式時間計算。 如果您要搜索的項目不止幾個,則需要等待一段時間。 在這里看到一個不錯的圖表和說明。Use list comprehensions (and dict comprehensions) whenever possible in Python. Creating a list on demand is faster than load the append attribute of the list and repeatedly calling it as a function — hat tip to the Stack Overflow answer here. However, in general, don’t sacrifice clarity for speed, so be careful with nesting list comprehensions. ??
盡可能在Python中使用列表推導(和字典推導)。 按需創建列表的速度比加載列表的append屬性并作為函數重復調用的速度要快- 這里的Stack Overflow答案提示。 但是,總的來說,不要為了速度而犧牲清晰度,因此請小心嵌套列表的理解。 ??
- In pandas, use built-in vectorized functions. The principle is really the same as the reason for dict comprehensions. Apply a function to a whole data structure at once is much faster than repeatedly calling a function. 在熊貓中,使用內置的矢量化功能。 原理實際上與字典理解的原因相同。 一次將函數應用于整個數據結構比重復調用函數要快得多。
If you find yourself reaching for apply
, think about whether you really need to. It's looping over rows or columns. Vectorized methods are usually faster and less code, so they are a win win. 🚀
如果您發現自己想要apply
,請考慮是否確實需要。 它遍歷行或列。 向量化方法通常更快,代碼更少,因此是雙贏的。 🚀
Avoid the other pandas Series and DataFrame methods that loop over your data — applymap
, itterrows
, ittertuples
. Use the replace
method on a DataFrame instead of any of those other options to save lots of time.
避免其他遍歷數據的pandas Series和DataFrame方法applymap
, itterrows
, ittertuples
。 在DataFrame上使用replace
方法,而不是其他任何選項,可以節省大量時間。
Notice that these suggestions might not hold for very small amounts of data, but in that cases, the stakes are low, so who cares. 😉
請注意,這些建議可能只適用于非常少量的數據,但在那種情況下,風險很低,所以誰在乎。 😉
這將我們帶到最重要的規則 (This brings us to our most important rule)
如果可以的話,留在熊貓里。 🐼 (If you can, stay in pandas. 🐼)
It’s a happy place. 😀
這是一個快樂的地方。 😀
Don’t worry about these issues if you aren’t having problems and you don’t expect your data to balloon. But at some point, you’ll encounter a big dataset and then you’ll want to know what to do. Let’s see some tips.
如果您沒有遇到問題并且不希望數據激增,請不要擔心這些問題。 但是到某個時候,您將遇到一個龐大的數據集,然后您想知道該怎么做。 讓我們看看一些技巧。
與相當大的數據(大約數百萬行)有關的事情 (Things to do with pretty big data (roughly millions of rows))

- Use a subset of your data to explore, clean, make a baseline model if you’re doing machine learning. Solve 90% of your problems fast and save time and resources. This technique can save you so much time! 如果您要進行機器學習,請使用數據的子集來探索,清理和建立基線模型。 快速解決90%的問題,并節省時間和資源。 這種技術可以節省您很多時間!
Load only the columns that you need with the
usecols
argument when reading in your DataFrame. Less data in = win!在讀取
usecols
時,僅使用usecols
參數加載所需的列。 更少的數據=贏!Use dtypes efficiently. Downcast numeric columns to the smallest dtypes that makes sense with pandas.to_numeric(). Convert columns with low cardinality (just a few values) to a categorical dtype. Here’s a pandas guide on efficient dtypes.
有效地使用dtype。 將pandas.to_numeric()有意義的數字列轉換為最小的dtypes。 將具有低基數(僅幾個值)的列轉換為分類dtype。 這是有關有效dtypes的熊貓指南。
Parallelize model training in scikit-learn to use more processing cores whenever possible. By default, scikit-learn uses just one of your machine’s cores. Many computers have 4 or more cores. You can use them all for parallelizable tasks by passing the argument
n_jobs=-1
when doing cross validation with GridSearchCV and many other classes.在scikit-learn中并行進行模型訓練,以盡可能使用更多處理核心。 默認情況下,scikit-learn僅使用計算機的核心之一。 許多計算機具有4個或更多核心。 在使用GridSearchCV和許多其他類進行交叉驗證時,可以通過傳遞參數
n_jobs=-1
來將它們全部用于可并行化的任務。Save pandas DataFrames in feather or pickle formats for faster reading and writing. Hat tip to Martin Skarzynski, who links to evidence and code here.
將熊貓DataFrame保存為羽毛或泡菜格式,以實現更快的讀寫速度。 向Martin Skarzynski致謝,他在此處鏈接了證據和代碼。
Use
pd.eval
to speed up pandas operations. Pass the function your usual code in a string. It does the operation much faster. Here’s a chart from tests with a 100 column DataFrame.使用
pd.eval
可以加快熊貓操作。 將函數的常規代碼傳遞給字符串。 它可以更快地完成操作。 這是帶有100列DataFrame的測試的圖表。

df.query
is basically same as pd.eval
, but as a DataFrame method instead of a top-level pandas function.
df.query
是基本上相同pd.eval
,但作為一個數據幀的方法,而不是頂級大熊貓功能。
See the docs because there are some gotchas. ??
請參閱文檔,因為有一些陷阱。 ??
Pandas is using numexpr under the hood. Numexpr also works with NumPy. Hat tip to Chris Conlan in his book Fast Python for pointing me to@Numexpr. Chris’s book is an excellent read for learning how to speed up your Python code. 👍
熊貓在后臺使用numexpr 。 Numexpr也可以與NumPy一起使用。 克里斯·康蘭(Chris Conlan)在他的書《 快速Python》中給我的提示是@Numexpr。 克里斯的書是學習如何加快Python代碼速度的絕佳閱讀。 👍
事情涉及真正的大數據(大約數千萬行以上) (Things do with really big data (roughly tens of millions of rows and up))

Use numba. Numba gives you a big speed boost if you’re doing mathematical calcs. Install numba and import it. Then use the
@numba.jit
decorator function when you need to loop over NumPy arrays and can't use vectorized methods. It only works with only NumPy arrays. Use.to_numpy()
on a pandas DataFrame to convert it to a NumPy array.使用numba 。 如果您要進行數學計算,Numba可以大大提高速度。 安裝numba并將其導入。 然后,當您需要循環遍歷NumPy數組并且不能使用矢量化方法時,請使用
@numba.jit
裝飾器函數。 它僅適用于NumPy數組。 在熊貓DataFrame上使用.to_numpy()
將其轉換為NumPy數組。Use SciPy sparse matrices when it makes sense. Scikit-learn outputs sparse arrays automatically with some transformers, such as CountVectorizer. When your data is mostly 0s or missing values, you can convert columns to sparse dtypes in pandas. Read more here.
在合理的情況下使用SciPy稀疏矩陣 。 Scikit-learn使用某些轉換器(例如CountVectorizer)自動輸出稀疏數組。 當數據大部分為0或缺少值時,可以將列轉換為熊貓中的稀疏dtype。 在這里。
Use Dask to parallelize the reading of datasets into pandas in chunks. Dask can also parallelize data operations across multiple machines. It mimics a subset of the pandas and NumPy APIs. Dask-ML is a sister package to parallelize machine learning algorithms across multiple machines. It mimics the scikit-learn API. Dask plays nicely with other popular machine learning libraries such as XGBoost, LightGBM, PyTorch, and TensorFlow.
使用Dask將數據集的讀取并行化為大塊的熊貓。 Dask還可以跨多臺機器并行化數據操作。 它模仿了熊貓和NumPy API的子集。 Dask-ML是一個姊妹軟件包,用于在多臺機器之間并行化機器學習算法。 它模仿了scikit-learn API。 Dask與其他流行的機器學習庫(例如XGBoost,LightGBM,PyTorch和TensorFlow)配合得很好。
Use PyTorch with or without a GPU. You can get really big speedups by using PyTorch on a GPU, as I found in this article on sorting.
在有或沒有GPU的情況下使用PyTorch。 正如我在有關sorting的本文中所發現的那樣,通過在GPU上使用PyTorch可以大大提高速度。
未來處理大數據時需要密切注意/進行實驗的事情 (Things to keep an eye on/experiment with for dealing with big data in the future)

The following three packages are bleeding edge as of mid-2020. Expect configuration issues and early stage APIs. If you are working locally on a CPU, these are unlikely to fit your needs. But they all look very promising and are worth keeping an eye on. 🔭
截至2020年中,以下三個方案處于前沿。 預期配置問題和早期API。 如果您在本地CPU上工作,那么這些將不太可能滿足您的需求。 但是它們看起來都很有前途,值得關注。 🔭
Do you have access to lots of cpu cores? Does your data have more than 32 columns (necessary as of mid-2020)? Then consider Modin. It mimics a subset of the pandas library to speed up operations on large datasets. It uses Apache Arrow (via Ray) or Dask under the hood. The Dask backend is experimental. Some things weren’t fast in my tests — for example reading in data from NumPy arrays was slow and memory management was an issue.
您可以使用許多cpu核心嗎? 您的數據是否有超過32列(從2020年中期開始是必需的)? 然后考慮莫丁 。 它模仿了熊貓庫的一個子集,以加快對大型數據集的操作。 它在后臺使用Apache Arrow(通過Ray)或Dask。 Dask后端是實驗性的。 在我的測試中,有些事情并不快-例如,從NumPy陣列讀取數據的速度很慢,并且內存管理是一個問題。
You can use jax in place of NumPy. Jax is an open source google product that’s bleeding edge. It speeds up operations by using five things under the hood: autograd, XLA, JIT, vectorizer, and parallelizer. It works on a CPU, GPU, or TPU and might be simpler than using PyTorch or TensorFlow to get speed boosts. Jax is good for deep learning, too. It has a NumPy version but no pandas version yet. However, you could convert a DataFrame to TensorFlow or NumPy and then use jax. Read more here.
您可以使用jax代替NumPy。 Jax是一種最新的Google開源產品,具有領先優勢。 它通過使用5種功能來加快操作速度:autograd,XLA,JIT,矢量化器和并行化器。 它可以在CPU,GPU或TPU上工作,并且可能比使用PyTorch或TensorFlow來提高速度更為簡單。 Jax也適用于深度學習。 它具有NumPy版本,但尚未提供熊貓版本。 但是,您可以將DataFrame轉換為TensorFlow或NumPy,然后使用jax。 在這里。
Rapids cuDF uses Apache Arrow on GPUs with a pandas-like API. It’s an open source Python package from NVIDIA. Rapids plays nicely with Dask so you could get multiple GPUs processing data in parallel. For the biggest workloads, it should provide a nice boost.
Rapids cuDF在具有類似熊貓API的GPU上使用Apache Arrow。 這是NVIDIA的開源Python軟件包。 Rapids與Dask配合得很好,因此您可以獲得多個GPU并行處理數據。 對于最大的工作負載,它應該提供很好的提升。
其他有關代碼速度和大數據的知識 (Other stuff to know about code speed and big data)
計時作業 (Timing operations)
If you want to time an operation in a Jupyter notebook, you can use %time
or %%timeit
magic commands. They both work on a single line or an entire code cell.
如果要在Jupyter筆記本中計時操作的時間,可以使用%time
或%%timeit
magic命令。 它們都在單行或整個代碼單元上工作。

%time
runs once and %%timeit
runs the code multiple times (the default is seven). Do check out the docs to see some subtleties.
%time
運行一次, %%timeit
運行代碼多次(默認值為7)。 請檢查文檔以查看一些細節。
If you are in a script or notebook you can import the time module, check the time before and after running some code, and find the difference.
如果您使用的是腳本或筆記本,則可以導入時間模塊,檢查運行某些代碼之前和之后的時間,然后找出不同之處。

When testing for time, note that different machines and software versions can cause variation. Caching will sometimes mislead if you are doing repeated tests. As with all experimentation, hold everything you can constant. 👍
測試時間時,請注意不同的機器和軟件版本可能會導致變化。 如果進行重復測試,緩存有時會誤導。 與所有實驗一樣,保持一切不變。 👍
存儲大數據 (Storing big data)
GitHub’s maximum file size is 100MB. You can use Git Large File Storage extension if you want to version large files with GitHub.
GitHub的最大文件大小為100MB 。 如果要使用GitHub對大型文件進行版本控制,則可以使用Git Large File Storage擴展 。
Make sure you aren’t auto-uploading files to Dropbox, iCloud, or some other auto-backup service, unless you want to be.
除非您愿意,否則請確保沒有將文件自動上傳到Dropbox,iCloud或其他自動備份服務。
想了解更多? (Want to learn more?)
The pandas docs have sections on enhancing performance and scaling to large datasets. Some of these ideas are adapted from those sections.
熊貓文檔中有關于增強性能和擴展到大型數據集的部分 。 這些想法中的一些是從那些部分改編而成的。
Have other tips? I’d love to hear them over on Twitter. 🎉
還有其他提示嗎? 我很想在Twitter上聽到他們的聲音。 🎉
包 (Wrap)
You’ve seen how to write faster code. You’ve also seen how to deal with big data and really big data. Finally, you saw some new libraries that will likely continue to become more popular for processing big data.
您已經了解了如何編寫更快的代碼。 您還已經了解了如何處理大數據和真正的大數據。 最后,您看到了一些新的庫,這些庫在處理大數據方面可能會繼續變得越來越流行。
I hope you’ve found this guide to be helpful. If you did, please share it on your favorite social media so other folks can find it, too. 😀
希望本指南對您有所幫助。 如果您這樣做了,請在您喜歡的社交媒體上分享它,以便其他人也可以找到它。 😀
I write about Python, SQL, Docker, and other tech topics. If any of that’s of interest to you, sign up for my mailing list of awesome data science resources and read more to help you grow your skills here. 👍
我撰寫有關Python , SQL , Docker和其他技術主題的文章。 如果您有任何興趣,請注冊我的超棒數據科學資源郵件列表,并在此處內容以幫助您提高技能。 👍


Happy big data-ing! 😀
大數據快樂! 😀
翻譯自: https://towardsdatascience.com/17-strategies-for-dealing-with-data-big-data-and-even-bigger-data-283426c7d260
數據分析和大數據哪個更吃香
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/391372.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/391372.shtml 英文地址,請注明出處:http://en.pswp.cn/news/391372.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!