大數據 notebook
Data science is hard. Data scientists spend hours figuring out how to install that Python package on their laptops. Data scientists read many pages of Google search results to connect to that database. Data scientists write a detailed document for engineers to deploy machine learning models into production. Data scientists prepare nice slides to convince business guys on how to improve retention rates. Data scientists worry about their data pipeline breaks which cause data quality issues.
數據科學很難。 數據科學家花了數小時來弄清楚如何在筆記本電腦上安裝該Python軟件包。 數據科學家閱讀了許多Google搜索結果頁面以連接到該數據庫。 數據科學家為工程師編寫了詳細的文檔,以將機器學習模型部署到生產中。 數據科學家準備了不錯的幻燈片,以說服業務人員如何提高保留率。 數據科學家擔心他們的數據管道中斷會導致數據質量問題。
The challenge of data science is real. There are steep learning curves of new languages that they are not familiar with. There are business impact requirements that no one knows how to meet in limited time. There are the best engineering practices to follow to ensure the quality of their deliverables. There is limited engineering support for the data science team.
數據科學的挑戰是真實的。 他們不熟悉的新語言有很多陡峭的學習曲線。 有一些業務影響需求,沒人會在有限的時間內滿足。 有最佳的工程實踐可遵循,以確保其交付成果的質量。 數據科學團隊的工程支持有限。
docker容器可以解決什么問題? (What problems do docker containers solve?)
For individual data scientists and other data team members: It is a frustrating experience to set up a development environment and maintain a consistent operating environment. The installation instructions often do not cover all dependency required. Some GPU-based AI libraries require data scientists to be familiar with low-level details of the hardware. The error information is not informative enough to explain the causes of the error. The dependency conflicts between libraries make it is hard to maintain a working development environment for multiple projects. The collaboration between data scientists and engineers requires extra and unnecessary works from both.
對于單個數據科學家和其他數據團隊成員:設置開發環境并維護一致的操作環境是令人沮喪的經驗。 安裝說明通常不會涵蓋所有必需的依賴項。 一些基于GPU的AI庫要求數據科學家熟悉硬件的底層細節。 錯誤信息的信息不足以解釋錯誤的原因。 庫之間的依賴關系沖突使得很難為多個項目維護有效的開發環境。 數據科學家和工程師之間的合作需要雙方的額外和不必要的工作。
Python虛擬環境如何? (How about Python virtual environment?)
Admittedly, Python virtual environment works for some data scientists nicely. However, it does not meet the diverse requirements for data science tasks:
誠然,Python虛擬環境非常適合某些數據科學家。 但是,它不能滿足數據科學任務的各種要求:
- It’s become more common that data scientists are using Spark, R, and SQL daily. How can Python virtual environment work for different languages and frameworks other than Python? 數據科學家每天使用Spark,R和SQL變得越來越普遍。 Python虛擬環境如何在Python以外的其他語言和框架下工作?
- Some data scientists mainly work with their engineering teammates to deploy machine learning models to production. How does Python virtual environment if there is a dependency on the operating system rather than the python library? 一些數據科學家主要與工程團隊合作,將機器學習模型部署到生產環境中。 如果依賴操作系統而不是python庫,那么Python虛擬環境如何處理?
The birth of conda
alleviates these two issues and it is a fact that conda is quite popular among the data science community. The installation of conda itself is not difficult and it ships environments with many common data science packages.
conda
的誕生緩解了這兩個問題,事實是conda在數據科學界非常流行。 conda本身的安裝并不困難,它隨環境提供了許多常見的數據科學軟件包。
However, not all packages that are available in pip
are available on conda
. If one package cannot be found on conda
, then data scientists may have to use pip alongside conda
which is a major source of confusion and unexpected issues. For example, in this unsolved Github issue, there are many arguments over how does pip
work with conda
.
然而,并非在所有可用的軟件包pip
可在conda
。 如果不能找到一個包conda
,那么數據科學家可能需要使用PIP一起conda
這是混亂和意外問題的主要來源。 例如,在這個尚未解決的Github問題中 ,關于pip
如何與conda
一起使用存在許多爭論。
Ironically, the VP of Anaconda once made a speech titled “Conda, Docker, and Kubernetes: The cloud-native future of data science”. It is useless if the environment-related issue is solved by 99%. It is the 1% issue left that makes the developer experience unacceptable.
具有諷刺意味的是,Anaconda的副總裁曾經發表過一篇題為“ Conda,Docker和Kubernetes:數據科學的云原生未來”的演講。 如果與環境有關的問題解決了99%,那就沒有用了。 剩下的1%問題使開發人員無法接受。
泊塢窗容器如何提供幫助? (How does a docker container help?)
Loosely speaking, a docker container is a “lightweight virtual machine” that packages everything needed to run applications into one docker image. Docker image is designed to move between servers and guarantee the environments are consistent.
松散地說,泊塢窗容器是“輕量級虛擬機”,它將運行應用程序所需的所有內容打包到一個泊塢窗映像中。 Docker映像旨在在服務器之間移動并確保環境一致。
As a result, data scientists would not worry anymore about the dependency breaks when deploying machine learning models into production. The new graduate onboarded last week can start to make contributions to the team as soon as the docker container is running, rather than secretly searching for new positions at companies that have a better infrastructure set up data science teams.
因此,在將機器學習模型部署到生產環境中時,數據科學家將不再擔心依賴關系中斷。 上周入職的新畢業生可以在Docker容器運行后立即開始為團隊做出貢獻,而不是在具有更好基礎架構的公司中秘密尋找新職位,以建立數據科學團隊。
Why are docker containers not popular among the data science community?
為什么Docker容器在數據科學界不受歡迎?
Docker is not a new technology at all, why the majority of data scientists have not adopted it? There are mainly two reasons:
Docker根本不是一種新技術,為什么大多數數據科學家都沒有采用它? 主要有兩個原因:
- The learning curve is steep. 學習曲線陡峭。
- The developer experience is bad. 開發人員體驗很差。
To get started with docker containers, one has to learn at least how to
要開始使用Docker容器,必須至少學習如何
- start/stop a container 啟動/停止容器
- attach the shell to a running container 將外殼連接到正在運行的容器
- mount the local volume to a container 將本地卷安裝到容器
In reality, these are not enough: how to sudo
inside a container that I do not know the password? Why my docker container lost all the data after it is stopped? How do I set up a private docker registry so I can pull the docker image from my remote clusters? How can I kill the processes that are using port 8808?
實際上,這些還不夠:如何在我不知道密碼的容器內進行sudo
操作? 為什么我的Docker容器停止后會丟失所有數據? 如何設置私有Docker注冊表,以便可以從遠程集群中提取Docker映像? 如何殺死正在使用端口8808的進程?
When it comes to writing Dockerfile
, one has to be familiar with Linux Shell command and Dockerfile
syntax. If one project is going to use one docker image, there are so many docker images to manage than a software engineer may have.
在編寫Dockerfile
,必須熟悉Linux Shell命令和Dockerfile
語法。 如果一個項目要使用一個docker映像,那么要管理的docker映像太多了,而軟件工程師可能沒有。
So data scientists either having a hard time fixing environment-related issues, giving up reproducibility and suffering from bad engineering practice, or spend too much time learning and operating docker.
因此,數據科學家要么很難解決與環境相關的問題,要么放棄可重復性并遭受不良的工程實踐之苦,要么花太多時間學習和操作docker。
It is NOT data scientists’ job to take care of the environment
照顧環境不是數據科學家的工作
Data scientists should NOT spend time on environments so that they can focus on what they are good at building dashboards, developing machine learning models, informing business teammates with actionable insights.
數據科學家不應該把時間花在環境,使他們能夠在構建儀表板,開發機器學習模型,提供可操作的見解通知業務的隊友們專注于他們所擅長。
Dockerless Notebook是未來 (Dockerless Notebook is the future)
Imagine there is a smart and capable docker helper that does everything for you: When you start the notebook, it can automatically start the container and attach it to the notebook. When you want to move your notebook to run on a remote cluster, it can commit your local docker container, send it to a remote local cluster, and manage it automatically.
想象一下,有一個聰明而功能強大的docker helper可以為您完成所有工作:啟動筆記本計算機時,它可以自動啟動容器并將其連接到筆記本計算機。 當您要移動筆記本以在遠程群集上運行時,它可以提交本地docker容器,將其發送到遠程本地群集,并自動進行管理。
The idea “Dockerless notebook” is that it allows you to develop and operate notebooks without thinking about docker containers. It is tightly integrated with the notebook data scientists use everyday. It eliminates learning docker container and operating tasks such as start/stop container, attach the shell to containers, and mount volumes to containers. You won’t even notice that a docker is running on your laptop like the way that you won’t notice how Jupyter Notebook exchanges data between browser and memory.
“無Docker筆記本 ”的想法是,它使您無需考慮Docker容器即可開發和操作筆記本。 它與科學家每天使用的筆記本電腦緊密集成。 它消除了學習docker容器和操作任務(例如啟動/停止容器,將外殼連接到容器以及將卷安裝到容器)的麻煩。 您甚至不會注意到docker在筆記本電腦上運行,就像您不會注意到Jupyter Notebook如何在瀏覽器和內存之間交換數據的方式一樣。
The “Dockerless notebook” will help the Data Science community move closer to “reproducible data science” and “frictionless data science” without unacceptable costs.
“無Docker筆記本 ”將幫助數據科學界向“可復制數據科學”和“無摩擦數據科學”靠攏,而不會產生不可接受的成本。
翻譯自: https://towardsdatascience.com/dockerless-notebook-the-long-awaited-future-of-data-science-7cde7707f7ff
大數據 notebook
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/392193.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/392193.shtml 英文地址,請注明出處:http://en.pswp.cn/news/392193.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!