databricks

by Shubhi Asthana

通過Shubhi Asthana

如何開始使用Databricks (How to get started with Databricks)

When I started learning Spark with Pyspark, I came across the Databricks platform and explored it. This platform made it easy to setup an environment to run Spark dataframes and practice coding. This post contains some steps that can help you get started with Databricks.

當我開始使用Pyspark學習Spark時，我遇到了Databricks平臺并進行了探索。通過該平臺，可以輕松設置運行Spark數據幀和練習編碼的環境。這篇文章包含一些步驟，可以幫助您開始使用Databricks。

Databricks is a platform that runs on top of Apache Spark. It conveniently has a Notebook systems setup. One can easily provision clusters in the cloud, and it also incorporates an integrated workspace for exploration and visualization.

Databricks是一個在Apache Spark之上運行的平臺。它方便地具有筆記本電腦系統設置。可以輕松地在云中配置群集，并且它還集成了用于探索和可視化的集成工作區。

You can also schedule any existing notebook or locally developed Spark code to go from prototype to production without re-engineering.

您還可以計劃任何現有的筆記本電腦或本地開發的Spark代碼，從原型制作到生產，而無需重新設計。

1. 設置一個Databricks帳戶 (1. Setup a Databricks account)

To get started with the tutorial, navigate to this link and select the free Community Edition to open your account. This option has single cluster with up to 6 GB free storage. It allows you to create a basic Notebook. You’ll need a valid email address to verify your account.

要開始使用本教程，請導航至此鏈接，然后選擇免費的Community Edition打開您的帳戶。此選項具有最多6 GB可用存儲的單個群集。它允許您創建一個基本的Notebook。您需要一個有效的電子郵件地址來驗證您的帳戶。

You will observe this screen once you successfully log in to your account.

成功登錄帳戶后，您將看到此屏幕。

2. 創建一個新集群 (2. Creating a new Cluster)

We start with creating a new cluster to run our programs on. Click on “Cluster” on the main page and type in a new name for the cluster.

我們首先創建一個新的集群來運行我們的程序。單擊主頁上的“群集”，然后為群集鍵入一個新名稱。

Next, you need to select the “Databricks Runtime” version. Databricks Runtime is a set of core components that run on clusters managed by Databricks. It includes Apache Spark, but also adds a number of components and updates to improve the usability and performance of the tool.

接下來，您需要選擇“ Databricks Runtime”版本。 Databricks Runtime是一組在Databricks管理的群集上運行的核心組件。它包括Apache Spark，但還添加了許多組件和更新以改善該工具的可用性和性能。

You can select any Databricks Runtime version — I have selected 3.5 LTS (includes Apache Spark 2.2.1, Scala 2.11). You also have a choice between Python 2 and 3.

您可以選擇任何Databricks Runtime版本-我選擇了3.5 LTS(包括Apache Spark 2.2.1，Scala 2.11)。您還可以在Python 2和3之間進行選擇。

It’ll take a few minutes to create the cluster. After some time, you should be able to see an active cluster on the dashboard.

創建集群需要幾分鐘。一段時間后，您應該能夠在儀表板上看到活動的集群。

3. 創建一個新的筆記本 (3. Creating a new Notebook)

Let’s go ahead and create a new Notebook on which you can run your program.

讓我們繼續創建一個新的Notebook，您可以在其上運行程序。

From the main page, hit “New Notebook” and type in a name for the Notebook. Select the language of your choice — I chose Python here. You can see that Databricks supports multiple languages including Scala, R and SQL.

在主頁上，單擊“新筆記本”，然后輸入筆記本的名稱。選擇您選擇的語言-我在這里選擇了Python。您可以看到Databricks支持多種語言，包括Scala，R和SQL。

Once the details are entered, you will observe that the layout of the notebook is very similar to the Jupyter notebook. To test the notebook, let’s import pyspark.

輸入詳細信息后，您會發現筆記本的布局與Jupyter筆記本非常相似。要測試筆記本，讓我們導入pyspark。

The command ran in 0.15 seconds and also gives the cluster name on which it is running. If there are any errors in the code, it would show below the cmd box.

該命令運行了0.15秒，并且還給出了運行命令的集群名稱。如果代碼中有任何錯誤，它將顯示在cmd框下方。

You can hit the keyboard icon on the top right corner of the page to see operating system-specific shortcuts.

您可以點擊頁面右上角的鍵盤圖標來查看特定于操作系統的快捷方式。

The most important shortcuts here are:

這里最重要的快捷方式是：

Shift+Enter to run a cell
Shift + Enter鍵運行單元格
Ctrl+Enter keeps running the same cell without moving to the next cell
Ctrl + Enter保持運行相同的單元格，而無需移動到下一個單元格

Note these shortcuts are for Windows. You can check the OS-specific shortcuts for your OS on the keyboard icon.

請注意，這些快捷方式適用于Windows。您可以在鍵盤圖標上檢查操作系統特定于操作系統的快捷方式。

4. 將數據上傳到Databricks (4. Uploading data to Databricks)

Head over to the “Tables” section on the left bar, and hit “Create Table.” You can upload a file, or connect to a Spark data source or some other database.

轉到左側欄上的“表格”部分，然后點擊“創建表格”。您可以上傳文件，或連接到Spark數據源或其他數據庫。

Let’s upload the commonly used iris dataset file here (if you don’t have the dataset, use this link )

讓我們在這里上傳常用的虹膜數據集文件(如果您沒有數據集，請使用此鏈接 )

Once you upload the data, create the table with a UI so you can visualize the table, and preview it on your cluster. As you can see, you can observe the attributes of the table. Spark will try to detect the datatype of each of the columns, and lets you edit it too.

上載數據后，使用UI創建表，以便可以可視化表并在集群上預覽。如您所見，您可以觀察表的屬性。 Spark將嘗試檢測每列的數據類型，并讓您對其進行編輯。

Now I need to put headers for the columns, so I can identify each column by their header instead of _c0, _c1 and so on.

現在，我需要為各列添加標題，以便可以通過其標題而不是_c0 ， _c1等等來標識每一列。

I put their headers as Sepal Length, Sepal Width, Petal Length, Petal Width and Class. Here, Spark detected the datatype of the first four columns incorrectly as a String, so I changed it to the desired datatype — Float.

我把它們的標題設置為“分隔長度”，“分隔寬度”，“花瓣長度”，“花瓣寬度”和“類”。在這里，Spark錯誤地將前四列的數據類型檢測為String，因此我將其更改為所需的數據類型-Float。

5. 如何從筆記本電腦訪問數據 (5. How to access data from Notebook)

Spark is a framework that can be used to analyze big data using SQL, machine learning, graph processing or real-time streaming analysis. We will be working with SparkSQL and Dataframes in this tutorial.

Spark是一個框架，可用于使用SQL，機器學習，圖形處理或實時流分析來分析大數據。在本教程中，我們將使用SparkSQL和Dataframes。

Let’s get started with working with the data on the Notebook. The data that we have uploaded is now put in tabular format.We require a SQL query to read the data and put it in a dataframe.

讓我們開始使用筆記本上的數據。現在，我們已上傳的數據以表格格式放置。我們需要一個SQL查詢來讀取數據并將其放置在數據框中。

Type df = sqlContext.sql(“SELECT * FROM iris_data”) to read iris data into a dataframe.

類型df = sqlContext.sql(“SELECT * FROM iris_data”) 將虹膜數據讀入數據幀。

To view the first five rows in the dataframe, I can simply run the command:

要查看數據框中的前五行，我可以簡單地運行以下命令：

display(df.limit(5))

Notice a Bar chart icon at the bottom. Once you click, you can view the data that you have imported into Databricks. To view the bar chart of complete data, rundisplay(df) instead of display(df.limit(5)).

注意底部的條形圖圖標。單擊后，您可以查看已導入到Databricks中的數據。要查看完整數據的條形圖，請運行display(df)而不是display(df.limit(5)) 。

The dropdown button allows you to visualize the data in different charts like bar, pie, scatter, and so on. It also gives you plot options to customize the plot and visualize specific columns only.

下拉按鈕使您可以可視化不同圖表中的數據，如條形圖，餅圖，散點圖等。它還提供了繪圖選項，以自定義繪圖并僅顯示特定的列。

You can also display matplotlib and ggplot figures in Databricks. For a demonstration, see Matplotlib and ggplot in Python Notebooks.

您還可以在Databricks中顯示matplotlib和ggplot數字。有關演示，請參閱Python Notebooks中的Matplotlib和ggplot 。

To view all the columns of the data, simply type df.columns

要查看數據的所有列，只需鍵入df.columns

To count how many rows total there are in the Dataframe (and see how long it takes to a full scan from remote disk/S3), run df.count().

要計算數據幀中總共有多少行(并查看從遠程磁盤/ S3進行全面掃描所花費的時間)，請運行df.count() 。

6.將Spark數據框轉換為Pandas數據框。 (6. Converting a Spark dataframe to a Pandas dataframe.)

Now if you are comfortable using pandas dataframes, and want to convert your Spark dataframe to pandas, you can do this by putting the command

現在，如果您習慣使用pandas數據框，并且想要將Spark數據框轉換為pandas，則可以通過以下命令來完成此操作

import pandas as pdpandas_df=df.to_pandas()

Now you can use pandas operations on the pandas_df dataframe.

現在，您可以在pandas_df數據幀上使用pandas操作。

7.查看Spark UI (7. Viewing the Spark UI)

The Spark UI contains a wealth of information needed for debugging Spark jobs. There are a bunch of great visualizations, so let’s view them in a gist.

Spark UI包含調試Spark作業所需的大量信息。有很多很棒的可視化效果，所以讓我們大致了解一下它們。

To go to Spark UI, you need to go to the top of the page where there are some menu options like “File,” “View,” “Code,” “Permissions,” and others. You will find the name of the cluster at the top next to “Attached” and a dropdown button next to it. Hit the dropdown button and select “View Spark UI.” A new tab will open up with the lots of information on your Notebook.

要轉到Spark UI，您需要轉到頁面頂部，這里有一些菜單選項，如“文件”，“視圖”，“代碼”，“權限”等。您將在“已附加”旁邊的頂部找到集群的名稱，并在其旁邊找到一個下拉按鈕。點擊下拉按鈕，然后選擇“查看Spark UI”。一個新的選項卡將打開，其中包含筆記本電腦上的大量信息。

The UI view gives plenty of information on each job executed on the cluster, stages, environment, and SQL queries executed. This UI can be helpful for users to debug their applications. Also, this UI gives a good visualization on Spark streaming statistics. To learn in more detail about each aspect of the Spark UI, refer to this link.

UI視圖提供了有關在集群上執行的每個作業，階段，環境和執行SQL查詢的大量信息。該UI有助于用戶調試其應用程序。此外，此UI還提供了關于Spark流統計的良好可視化效果。要詳細了解Spark UI的各個方面，請參閱此鏈接。

Once you are done with the Notebook, you can go ahead and publish it or export the file in different file formats, such that somebody else can use it using a unique link. I have attached my Notebook in HTML format.

使用筆記本電腦完成操作后，您可以繼續發布并以不同的文件格式導出文件，以便其他人可以通過唯一鏈接使用它。我已經以HTML格式附加了我的筆記本。

結語 (Wrapping up)

This is a short overview on how you can get started with Databricks quickly and run your programs. The advantage of using Databricks is that it offers an end-to-end service for building analytics, data warehousing, and machine learning applications. The entire Spark cluster can be managed, monitored, and secured using a self-service model of Databricks.

這是有關如何快速開始使用Databricks并運行程序的簡短概述。使用Databricks的優勢在于，它為構建分析，數據倉庫和機器學習應用程序提供了端到端服務。可以使用Databricks的自助模型來管理，監視和保護整個Spark集群。

Here are some interesting links for Data Scientists and for Data Engineers. Also, here is a tutorial which I found very useful and is great for beginners.

這是數據科學家和數據工程師的一些有趣鏈接。另外，這是我發現非常有用的教程，對初學者來說非常有用。