大數據 notebook_Dockerless Notebook:數據科學期待已久的未來

大數據 notebook

Data science is hard. Data scientists spend hours figuring out how to install that Python package on their laptops. Data scientists read many pages of Google search results to connect to that database. Data scientists write a detailed document for engineers to deploy machine learning models into production. Data scientists prepare nice slides to convince business guys on how to improve retention rates. Data scientists worry about their data pipeline breaks which cause data quality issues.

數據科學很難。 數據科學家花了數小時來弄清楚如何在筆記本電腦上安裝該Python軟件包。 數據科學家閱讀了許多Google搜索結果頁面以連接到該數據庫。 數據科學家為工程師編寫了詳細的文檔,以將機器學習模型部署到生產中。 數據科學家準備了不錯的幻燈片,以說服業務人員如何提高保留率。 數據科學家擔心他們的數據管道中斷會導致數據質量問題。

The challenge of data science is real. There are steep learning curves of new languages that they are not familiar with. There are business impact requirements that no one knows how to meet in limited time. There are the best engineering practices to follow to ensure the quality of their deliverables. There is limited engineering support for the data science team.

數據科學的挑戰是真實的。 他們不熟悉的新語言有很多陡峭的學習曲線。 有一些業務影響需求,沒人會在有限的時間內滿足。 有最佳的工程實踐可遵循,以確保其交付成果的質量。 數據科學團隊的工程支持有限。

docker容器可以解決什么問題? (What problems do docker containers solve?)

For individual data scientists and other data team members: It is a frustrating experience to set up a development environment and maintain a consistent operating environment. The installation instructions often do not cover all dependency required. Some GPU-based AI libraries require data scientists to be familiar with low-level details of the hardware. The error information is not informative enough to explain the causes of the error. The dependency conflicts between libraries make it is hard to maintain a working development environment for multiple projects. The collaboration between data scientists and engineers requires extra and unnecessary works from both.

對于單個數據科學家和其他數據團隊成員:設置開發環境并維護一致的操作環境是令人沮喪的經驗。 安裝說明通常不會涵蓋所有必需的依賴項。 一些基于GPU的AI庫要求數據科學家熟悉硬件的底層細節。 錯誤信息的信息不足以解釋錯誤的原因。 庫之間的依賴關系沖突使得很難為多個項目維護有效的開發環境。 數據科學家和工程師之間的合作需要雙方的額外和不必要的工作。

Python虛擬環境如何? (How about Python virtual environment?)

Admittedly, Python virtual environment works for some data scientists nicely. However, it does not meet the diverse requirements for data science tasks:

誠然,Python虛擬環境非常適合某些數據科學家。 但是,它不能滿足數據科學任務的各種要求:

  1. It’s become more common that data scientists are using Spark, R, and SQL daily. How can Python virtual environment work for different languages and frameworks other than Python?

    數據科學家每天使用Spark,R和SQL變得越來越普遍。 Python虛擬環境如何在Python以外的其他語言和框架下工作?
  2. Some data scientists mainly work with their engineering teammates to deploy machine learning models to production. How does Python virtual environment if there is a dependency on the operating system rather than the python library?

    一些數據科學家主要與工程團隊合作,將機器學習模型部署到生產環境中。 如果依賴操作系統而不是python庫,那么Python虛擬環境如何處理?

The birth of conda alleviates these two issues and it is a fact that conda is quite popular among the data science community. The installation of conda itself is not difficult and it ships environments with many common data science packages.

conda的誕生緩解了這兩個問題,事實是conda在數據科學界非常流行。 conda本身的安裝并不困難,它隨環境提供了許多常見的數據科學軟件包。

However, not all packages that are available in pip are available on conda. If one package cannot be found on conda, then data scientists may have to use pip alongside conda which is a major source of confusion and unexpected issues. For example, in this unsolved Github issue, there are many arguments over how does pip work with conda.

然而,并非在所有可用的軟件包pip可在conda 。 如果不能找到一個包conda ,那么數據科學家可能需要使用PIP一起conda這是混亂和意外問題的主要來源。 例如,在這個尚未解決的Github問題中 ,關于pip如何與conda一起使用存在許多爭論。

Ironically, the VP of Anaconda once made a speech titled “Conda, Docker, and Kubernetes: The cloud-native future of data science”. It is useless if the environment-related issue is solved by 99%. It is the 1% issue left that makes the developer experience unacceptable.

具有諷刺意味的是,Anaconda的副總裁曾經發表過一篇題為“ Conda,Docker和Kubernetes:數據科學的云原生未來”的演講。 如果與環境有關的問題解決了99%,那就沒有用了。 剩下的1%問題使開發人員無法接受。

泊塢窗容器如何提供幫助? (How does a docker container help?)

Loosely speaking, a docker container is a “lightweight virtual machine” that packages everything needed to run applications into one docker image. Docker image is designed to move between servers and guarantee the environments are consistent.

松散地說,泊塢窗容器是“輕量級虛擬機”,它將運行應用程序所需的所有內容打包到一個泊塢窗映像中。 Docker映像旨在在服務器之間移動并確保環境一致。

As a result, data scientists would not worry anymore about the dependency breaks when deploying machine learning models into production. The new graduate onboarded last week can start to make contributions to the team as soon as the docker container is running, rather than secretly searching for new positions at companies that have a better infrastructure set up data science teams.

因此,在將機器學習模型部署到生產環境中時,數據科學家將不再擔心依賴關系中斷。 上周入職的新畢業生可以在Docker容器運行后立即開始為團隊做出貢獻,而不是在具有更好基礎架構的公司中秘密尋找新職位,以建立數據科學團隊。

Why are docker containers not popular among the data science community?

為什么Docker容器在數據科學界不受歡迎?

Docker is not a new technology at all, why the majority of data scientists have not adopted it? There are mainly two reasons:

Docker根本不是一種新技術,為什么大多數數據科學家都沒有采用它? 主要有兩個原因:

  1. The learning curve is steep.

    學習曲線陡峭。
  2. The developer experience is bad.

    開發人員體驗很差。

To get started with docker containers, one has to learn at least how to

要開始使用Docker容器,必須至少學習如何

  1. start/stop a container

    啟動/停止容器
  2. attach the shell to a running container

    將外殼連接到正在運行的容器
  3. mount the local volume to a container

    將本地卷安裝到容器

In reality, these are not enough: how to sudo inside a container that I do not know the password? Why my docker container lost all the data after it is stopped? How do I set up a private docker registry so I can pull the docker image from my remote clusters? How can I kill the processes that are using port 8808?

實際上,這些還不夠:如何在我不知道密碼的容器內進行sudo操作? 為什么我的Docker容器停止后會丟失所有數據? 如何設置私有Docker注冊表,以便可以從遠程集群中提取Docker映像? 如何殺死正在使用端口8808的進程?

When it comes to writing Dockerfile, one has to be familiar with Linux Shell command and Dockerfile syntax. If one project is going to use one docker image, there are so many docker images to manage than a software engineer may have.

在編寫Dockerfile ,必須熟悉Linux Shell命令和Dockerfile語法。 如果一個項目要使用一個docker映像,那么要管理的docker映像太多了,而軟件工程師可能沒有。

So data scientists either having a hard time fixing environment-related issues, giving up reproducibility and suffering from bad engineering practice, or spend too much time learning and operating docker.

因此,數據科學家要么很難解決與環境相關的問題,要么放棄可重復性并遭受不良的工程實踐之苦,要么花太多時間學習和操作docker。

It is NOT data scientists’ job to take care of the environment

照顧環境不是數據科學家的工作

Data scientists should NOT spend time on environments so that they can focus on what they are good at building dashboards, developing machine learning models, informing business teammates with actionable insights.

數據科學家應該把時間花在環境,使他們能夠在構建儀表板,開發機器學習模型,提供可操作的見解通知業務的隊友們專注于他們所擅長。

Dockerless Notebook是未來 (Dockerless Notebook is the future)

Imagine there is a smart and capable docker helper that does everything for you: When you start the notebook, it can automatically start the container and attach it to the notebook. When you want to move your notebook to run on a remote cluster, it can commit your local docker container, send it to a remote local cluster, and manage it automatically.

想象一下,有一個聰明而功能強大的docker helper可以為您完成所有工作:啟動筆記本計算機時,它可以自動啟動容器并將其連接到筆記本計算機。 當您要移動筆記本以在遠程群集上運行時,它可以提交本地docker容器,將其發送到遠程本地群集,并自動進行管理。

The idea “Dockerless notebook” is that it allows you to develop and operate notebooks without thinking about docker containers. It is tightly integrated with the notebook data scientists use everyday. It eliminates learning docker container and operating tasks such as start/stop container, attach the shell to containers, and mount volumes to containers. You won’t even notice that a docker is running on your laptop like the way that you won’t notice how Jupyter Notebook exchanges data between browser and memory.

“無Docker筆記本 ”的想法是,它使您無需考慮Docker容器即可開發和操作筆記本。 它與科學家每天使用的筆記本電腦緊密集成。 它消除了學習docker容器和操作任務(例如啟動/停止容器,將外殼連接到容器以及將卷安裝到容器)的麻煩。 您甚至不會注意到docker在筆記本電腦上運行,就像您不會注意到Jupyter Notebook如何在瀏覽器和內存之間交換數據的方式一樣。

The “Dockerless notebook” will help the Data Science community move closer to “reproducible data science” and “frictionless data science” without unacceptable costs.

“無Docker筆記本 ”將幫助數據科學界向“可復制數據科學”和“無摩擦數據科學”靠攏,而不會產生不可接受的成本。

翻譯自: https://towardsdatascience.com/dockerless-notebook-the-long-awaited-future-of-data-science-7cde7707f7ff

大數據 notebook

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/392193.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/392193.shtml
英文地址,請注明出處:http://en.pswp.cn/news/392193.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

【NGN學習筆記】6 代理(Proxy)和背靠背用戶代理(B2BUA)

1. 什么是Proxy模式? 按照RFC3261中的定義,Proxy服務器是一個中間的實體,它本身即作為客戶端也作為服務端,為其他客戶端提供請求的轉發服務。一個Proxy服務器首先提供的是路由服務,也就是說保證請求被發到更加”靠近”…

分布與并行計算—并行計算π(Java)

并行計算π public class pithread extends Thread {private static long mini1000000000;private long start,diff;double sum0;double cur1/(double)mini;public pithread(long start,long diff) {this.startstart;this.diffdiff;}Overridepublic void run() {long istart;f…

linux復制文件跳過相同,Linux cp指令,怎么跳過相同的文件

1、使用cp命令的-n參數即可跳過相同的文件 。2、cp命令使用詳解:1)、用法:cp [選項]... [-T] 源文件 目標文件或:cp [選項]... 源文件... 目錄或:cp [選項]... -t 目錄 源文件...將源文件復制至目標文件,或將多個源文件…

eclipse類自動生成注釋

1.創建新類時自動生成注釋 window->preference->java->code styple->code template 當你選擇到這部的時候就會看見右側有一個框顯示出code這個選項,你點開這個選項,點一下他下面的New …

rman恢復

--建表create table sales( product_id number(10), sales_date date, sales_cost number(10,2), status varchar2(20));--插數據insert into sales values (1,sysdate-90,18.23,inactive);commit; --啟用rman做全庫備份 運行D:\autobackup\rman\backup_orcl.bat 生成…

微軟大數據_我對Microsoft的數據科學采訪

微軟大數據Microsoft was one of the software companies that come to hire interns at my university for 2021 summers. This year, it was the first time that Microsoft offered any Data Science Internship for pre-final year undergraduate students.微軟是到2021年夏…

再次檢查打印機名稱 并確保_我們的公司名稱糟透了。 這是確保您沒有的方法。...

再次檢查打印機名稱 并確保by Dawid Cedrych通過戴維德塞德里奇 我們的公司名稱糟透了。 這是確保您沒有的方法。 (Our company name sucked. Here’s how to make sure yours doesn’t.) It is harder than one might think to find a good business name. Paul Graham of Y …

linux中文本查找命令,Linux常用的文本查找命令 find

一、常用的文本查找命令grep、egrep命令grep:文本搜索工具,根據用戶指定的文本模式對目標文件進行逐行搜索,先是能夠被模式匹配到的行。后面跟正則表達式,讓grep工具相當強大。-E之后還支持擴展的正則表達式。# grep [options] …

分布與并行計算—日志挖掘(Java)

日志挖掘——處理數據、計費統計 1、讀取附件中日志的內容,找出自己學號停車場中對應的進出車次數(in/out配對的記錄數,1條in、1條out,視為一個車次,本日志中in/out為一一對應,不存在缺失某條進或出記錄&a…

《人人都該買保險》讀書筆記

內容目錄: 1.你必須知道的保險知識 2.家庭理財的必需品 3.保障型保險產品 4.儲蓄型保險產品 5.投資型保險產品 6.明明白白買保險 現在我所在的公司Manulife是一家金融保險公司,主打業務就是保險,因此我需要熟悉一下保險的基礎知識&#xff0c…

Linux下查看txt文檔

當我們在使用Window操作系統的時候,可能使用最多的文本格式就是txt了,可是當我們將Window平臺下的txt文本文檔復制到Linux平臺下查看時,發現原來的中文所有變成了亂碼。沒錯, 引起這個結果的原因就是兩個平臺下,編輯器…

如何擊敗騰訊_擊敗股市

如何擊敗騰訊個人項目 (Personal Proyects) Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an…

滑塊 組件_組件制作:如何使用鏈接的輸入創建滑塊

滑塊 組件by Robin Sandborg羅賓桑德伯格(Robin Sandborg) 組件制作:如何使用鏈接的輸入創建滑塊 (Component crafting: how to create a slider with a linked input) Here at Stacc, we’re huge fans of React and the render-props pattern. When it came time…

配置靜態IPV6 NAT-PT

一.概述: IPV6 NAT-PT( Network Address Translation - Port Translation)應用與ipv4和ipv6網絡互訪的情況,根據參考鏈接配置時出現一些問題,所以記錄下來。參考鏈接:http://www.cisco.com/en/US/tech/tk648/tk361/technologies_c…

linux 線程與進程 pid,linux下線程所屬進程號問題

這一段看《unix環境高級編程》,一個關于線程的小例子。#include#include#includepthread_t ntid;void printids(const char *s){pid_t pid;pthread_t tid;pidgetpid();tidpthread_self();printf("%s pid %u tid %u (0x%x)n",s,(unsigned int)pid,(unsigne…

python3虛擬環境中解決 ModuleNotFoundError: No module named '_ssl'

前提是已經安裝了openssl 問題 當我在python3虛擬環境中導入ssl模塊時報錯,報錯如下: (py3) [rootlocalhost Python-3.6.3]# python3 Python 3.6.3 (default, Nov 19 2018, 14:18:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux Type "help…

python 使用c模塊_您可能沒有使用(但應該使用)的很棒的Python模塊

python 使用c模塊by Adam Goldschmidt亞當戈德施密特(Adam Goldschmidt) 您可能沒有使用(但應該使用)的很棒的Python模塊 (Awesome Python modules you probably aren’t using (but should be)) Python is a beautiful language, and it contains many built-in modules that…

分布與并行計算—生產者消費者模型實現(Java)

在實際的軟件開發過程中,經常會碰到如下場景:某個模塊負責產生數據,這些數據由另一個模塊來負責處理(此處的模塊是廣義的,可以是類、函數、線程、進程等)。產生數據的模塊,就形象地稱為生產者&a…

通過Xshell登錄遠程服務器實時查看log日志

主要想總結以下幾點: 1.如何使用生成密鑰的方式來登錄Xshell連接遠端服務器 2.在遠程服務器上如何上傳和下載文件(下載log文件到本地) 3.如何實時查看log,提取錯誤信息 一. 使用生成密鑰的方式來登錄Xshell連接遠端服務器 ssh登錄…

如何將Jupyter Notebook連接到遠程Spark集群并每天運行Spark作業?

As a data scientist, you are developing notebooks that process large data that does not fit in your laptop using Spark. What would you do? This is not a trivial problem.作為數據科學家,您正在開發使用Spark處理筆記本電腦無法容納的大數據的筆記本電腦…