Why Apache Spark is a Crossover Hit for Data Scientists [FWD]

Spark is a compelling multi-purpose platform for use cases that span investigative, as well as operational, analytics.

Data science is a broad church. I am a data scientist — or so I’ve been told — but what I do is actually quite different from what other “data scientists” do. For example, there are those practicing “investigative analytics” and those implementing “operational analytics.” (I’m in the second camp.)

?

Data scientists performing investigative analytics use interactive statistical environments like?R?to perform ad-hoc, exploratory analytics in order to answer questions and gain insights. By contrast, data scientists building operational analytics systems have more in common with engineers. They build software that creates and queries machine-learning models that operate at scale in real-time serving environments, using systems languages like C++ and Java, and often use several elements of an enterprise data hub, including the?Apache Hadoop?ecosystem.

And there are subgroups within these groups of data scientists. For example, some analysts who are proficient with R have never heard of?Python?or?scikit-learn, or vice versa, even though both provide libraries of statistical functions that are accessible from a?REPL?(Read-Evaluate-Print Loop) environment.

A World of Tradeoffs

It would be wonderful to have one tool for everyone, and one architecture and language for investigative as well as operational analytics. If I primarily work in Java, should I really need to know a language like Python or R in order to be effective at exploring data? Coming from a conventional data analyst background, must I understand?MapReduce?in order to scale up computations? The array of tools available to data scientists tells a story of unfortunate tradeoffs:

  • R offers a rich environment for statistical analysis and machine learning, but it has some rough edges when performing many of the data processing and cleanup tasks that are required before the real analysis work can begin. As a language, it’s not similar to the mainstream languages developers know.
  • Python is a general purpose programming language with excellent libraries for data analysis like Pandas and scikit-learn. But like R, it’s still limited to working with an amount of data that can fit on one machine.
  • It’s possible to develop distributed machine learning algorithms on the classic MapReduce computation framework in Hadoop (see?Apache Mahout). But MapReduce is notoriously low-level and difficult to express complex computations in.
  • Apache Crunch?offers a simpler, idiomatic Java API for expressing MapReduce computations. But still, the nature of MapReduce makes it inefficient for iterative computations, and most machine learning algorithms have an iterative component.

And so on. There are both gaps and overlaps between these and other data science tools. Coming from a background in Java and Hadoop, I do wonder with envy sometimes: why can’t we have a nice REPL-like investigative analytics environment like the Python and R users have? That’s still scalable and distributed? And has the nice distributed-collection design of Crunch? And can equally be used in operational contexts?

Common Ground in Spark

These are the desires that make me excited about?Apache Spark. While discussion about Spark for data science has mostly noted its ability to keep data resident in memory, which can speed up iterative machine learning workloads compared to MapReduce, this is perhaps not even the big news, not to me. It does not solve every problem for everyone. However, Spark has a number of features that make it a compelling crossover platform for investigative as well as operational analytics:

  • Spark comes with a machine-learning library,?MLlib, albeit bare bones so far.
  • Being?Scala-based, Spark embeds in any JVM-based operational system, but can also be used?interactively in a REPL?in a way that will feel familiar to R and Python users.
  • For Java programmers, Scala still presents a learning curve. But at least, any Java library can be used from within Scala.
  • Spark’s?RDD (Resilient Distributed Dataset)?abstraction resembles Crunch’s?PCollection, which has proved a useful abstraction in Hadoop that will already be familiar to Crunch developers. (Crunch can even be used?on top of Spark.)
  • Spark imitates Scala’s collections API and functional style, which is a boon to Java and Scala developers, but also somewhat familiar to developers coming from Python. Scala is also a?compelling choice for statistical computing.
  • Spark itself, and Scala underneath it, are not specific to machine learning. They provide APIs supporting related tasks, like data access,?ETL, and integration. As with Python, the entire data science pipeline can be implemented within this paradigm, not just the model fitting and analysis.
  • Code that is implemented in the REPL environment can be used mostly as-is in an operational context.
  • Data operations are transparently distributed across the cluster, even as you type.

Spark, and MLlib in particular, still has a lot of growing to do. For example, the project needs optimizations, fixes, and deeper integration with?YARN. It doesn’t yet provide nearly the depth of library functions that conventional data analysis tools do. But as a best-of-most-worlds platform, it is already sufficiently interesting for a data scientist of any denomination to look at seriously.

In Action: Tagging Stack Overflow Questions

A complete example will give a sense of using Spark as an environment for transforming data and building models on Hadoop. The following example uses a dump of data from the popular?Stack Overflow?Q&A site. On Stack Overflow, developers can ask and answer questions about software. Questions can be tagged with short strings like “java” or “sql“. This example will build a model that can suggest new tags to questions based on existing tags, using thealternating least squares?(ALS) recommender algorithm; questions are “users” and tags are “items”.

Getting the Data

Stack Exchange?provides?complete dumps of all data, most recently from January 20, 2014. The data is provided as a?torrent?containing different types of data from Stack Overflow and many sister sites. Only the filestackoverflow.com-Posts.7z?needs to be downloaded from the torrent.

This file is just a?bzip-compressed file. Spark, like Hadoop, can directly read and split some compressed files, but in this case it is necessary to uncompress a copy on to?HDFS. In one step, that’s:

?

?

?

Uncompressed, it consumes about 24.4GB, and contains about 18 million posts, of which 2.1 million are questions. These questions have about 9.3 million tags from approximately 34,000 unique tags.

Set Up Spark

Given that Spark’s integration with Hadoop is relatively new, it can be time-consuming to?get it working manually. Fortunately, CDH hides that complexity by integrating Spark and managing setup of its processes. Spark can beinstalled separately?with CDH 4.6.0, and is?included?in CDH 5 Beta 2. This example uses an?installation of CDH 5 Beta 2.

This example uses MLlib, which uses the?jblas?library for linear algebra, which in turn calls native code using?LAPACKand Fortran. At the moment, it is necessary to manually install the Fortran library dependency to enable this. The package is called?libgfortran?or?libgfortran3, and should be available from the standard package manager of major Linux distributions. For example, for?RHEL 6, install it with:

?

?

?

This must be installed on all machines that have been designated as Spark workers.

Log in to the machine designated as the Spark master with?ssh. It will be necessary, at the moment, to ask Spark to let its workers use a large amount of memory. The code in MLlib that is used in this example, in version 0.9.0, has amemory issue, one that is already fixed for the next release. To configure for more memory and launch the shell:

?

?

?

Interactive Processing in the Shell

The shell is the Scala REPL. It’s possible to execute lines of code, define methods, and in general access any Scala or Spark functionality in this environment, one line at a time. You can paste the following steps into the REPL, one by one.

First, get a handle on the?Posts.xml?file:

?

?

?

In response the REPL will print:

?

?

?

The text file is an RDD (Resilient Distributed Dataset) of Strings, which are the lines of the file. You can query it by calling methods of the?RDD class. For example, to count the lines:

?

?

?

This command yields a great deal of output from Spark as it counts lines in a distributed way, and finally prints18066983.

The next snippet transforms the lines of the XML file into a collection of?(questionID,tag)?tuples. This demonstrates Scala’s functional programming style, and other quirks. (Explaining them is out of scope here.) RDDs behave like Scala collections, and expose many of the same methods, like?map:

(You can copy the source for the above from?here.)

You will notice that this returns immediately, unlike previously. So far, nothing requires Spark to actually perform this transformation. It is possible to force Spark to perform the computation by, for example, calling a method like?count. Or Spark can be told to compute and persist the result through?checkpointing, for example.

The MLlib implementation of ALS operates on numeric IDs, not strings. The tags (“items”) in this data set are strings. It will be sufficient here to hash tags to a nonnegative integer value, use the integer values for the computation, and then use a reverse mapping to translate back to tag strings later. Here, a hash function is defined since it will be reused shortly.

?

?

?

Now, you can convert the tuples from before into the format that the ALS implementation expects, and the model can be computed:

?

?

?

This will take minutes or more, depending on the size of your cluster, and will spew a large amount of output from the workers. Take a moment to find the Spark master web UI, which can be found from Cloudera Manager, and will run by default at?http://[master]:18080. There will be one running application. Click through, then click “Application Detail UI”. In this view it’s possible to monitor Spark’s distributed execution of lines of code in?ALS.scala:

When it is complete, a factored matrix model is available in Spark. It can be used to predict question-tag associations by “recommending” tags to questions. At this early stage of MLlib’s life, there is not even a proper?recommend?method yet, that would give suggested tags for a question. However it is easy to define one:?????

?

?

?

And to call it, pick any question with at least four tags, like “How to make substring-matching query work fast on a large table?” and get its ID from the URL. Here, that’s?7122697:

?

?

?

This method will take a minute or more to complete, which is slow. The lookups in the last line are quite expensive since each requires a distributed search. It would be somewhat faster if this mapping were available in memory. It’s possible to tell Spark to do this:

?

?

?

Because of the magic of Scala closures, this does in fact affect the object used inside the?recommend?method just defined. Run the method call again and it will return faster. The result in both cases will be something similar to the following:

?

?

?

(Your result will not be identical, since ALS starts from a random solution and iterates.) The original question was tagged “postgresql”, “query-optimization”, “substring”, and “text-search”. It’s reasonable that the question might also be tagged “sql” and “database”. “oracle” makes sense in the context of questions about optimization and text search, and “ruby-on-rails” often comes up with PostgreSQL, even though these tags are not in fact related to this particular question.

Something for Everyone

Of course, this example could be more efficient and more general. But for the practicing data scientists out there — whether you came in as an R analyst, Python hacker, or Hadoop developer — hopefully you saw something familiar in different elements of the example, and have discovered a way to use Spark to access some benefits that the other tribes take for granted.

Learn more about?Spark’s role in an EDH, and join the discussion in our brand-new?Spark forum.

Sean is Director of Data Science for?EMEA?at Cloudera, helping customers build large-scale machine learning solutions on Hadoop. Previously, Sean founded Myrrix Ltd, producing a real-time recommender and clustering product evolved from Apache Mahout. Sean was primary author of recommender components in Mahout, and has been an active committer and?PMC?member for the project. He is co-author of?Mahout in Action.

-----

FWD from post:?http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/287551.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/287551.shtml
英文地址,請注明出處:http://en.pswp.cn/news/287551.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

Blazor University (21)使用 RenderFragments 模板化組件 —— 傳遞占位符

原文鏈接:https://blazor-university.com/templating-components-with-renderfragements/passing-placeholders-to-renderfragments/將占位符傳遞給 RenderFragments源代碼[1]說明:此頁面的靈感來自用戶 ?ister?agoo 的 Twitter 帖子。首先&#xff0c…

物聯網(車聯網)平臺架構方案

技術支持QQ:787728951、車載終端網關采用mina/nettyspring架構,獨立于其他應用,主要負責維護接入終端的tcp鏈接、上行以及下行消息的解碼、編碼、流量控制,黑白名單等安全控制,網關同時支持交通部JT/T808-2011、JT/T80…

[python opencv 計算機視覺零基礎到實戰] 八、ROI泛洪填充

一、學習目標 了解什么是ROI了解floodFill的使用方法 如有錯誤歡迎指出~ 目錄 [python opencv 計算機視覺零基礎到實戰] 一、opencv的helloworld [【python opencv 計算機視覺零基礎到實戰】二、 opencv文件格式與攝像頭讀取] 一、opencv的helloworld [[python opencv 計…

【經典回放】JavaScript學習詳細干貨筆記之(二)

【經典回放】JavaScript學習詳細干貨筆記之(一) 【經典回放】JavaScript學習詳細干貨筆記之(二) 【經典回放】JavaScript學習詳細干貨筆記之(三) 一、JavaScript 數組 JavaScript數組的定義、使用都是非常簡單的,從a17.htm就可以知道,僅僅定義的話,就使用: var …

java string類api_java基礎—String類型常用api

1、字符串比較equalsequalsIgnoreCase 忽略大小寫做比較2、字符串拆分(切片)splitString a "lemon:python:Java";//split切片之后的結果是一個一維字符串類型數組String[] arr a.split(":");for(int i 0 ;i System.out.println(arr[i]);}3、字符串截取…

解決沖突

人生不如意之事十之八九,合并分支往往也不是一帆風順的。 準備新的feature1分支,繼續我們的新分支開發: $ git checkout -b feature1 Switched to a new branch feature1修改readme.txt最后一行,改為: Creating a new …

Android之java.lang.OutOfMemoryError: Failed to allocate a ** byte allocation with **free bytes and 2M

1 問題 glide加載圖片出現oom java.lang.OutOfMemoryError: Failed to allocate a 23970828 byte allocation with 2097152 free bytes and 2MB until OOM 2 解決辦法 1) 簡單粗暴點的在AndroidManifest.xml添加如下,增大安卓虛擬機內存 android:largeHeap"…

HQL入門學習

2019獨角獸企業重金招聘Python工程師標準>>> package myHibernate; /** 測試簡單的HQL語句* 2010年4月9日 23:36:54* */ import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Calendar; import java.util.Date; import java.uti…

Oracle精簡客戶端配置

2019獨角獸企業重金招聘Python工程師標準>>> 由于Oracle client體積很大。而且安裝后,基本上就用2個功能:TNS配置服務名和SQL*Plus。下面是一種小巧、快捷的Oracle客戶端配置方法: 1.下載Instant Client 下載地址: htt…

【經典回放】JavaScript學習詳細干貨筆記之(三)

【經典回放】JavaScript學習詳細干貨筆記之(一) 【經典回放】JavaScript學習詳細干貨筆記之(二) 【經典回放】JavaScript學習詳細干貨筆記之(三) 一、再次從var開始說起 var到底是什么? 在前面的所有介紹中, JavaScript的var變量說明、是非常令人迷惑的事情。 var中…

WinUI遷移到.NET MAUI個人體驗

遷移的初衷本人平時是做.net相關的工作,對于.net技術棧也有一些了解,自從新的.net能夠跨平臺之后,之前也有跨平臺的ui框架Xamarin,現在微軟推出了.NET MAUI這個說是 統一了開發體驗,而且都RC版本了,所以本人…

祝CSDN2021牛氣沖天祝我也撥云散霧

前言 2020年4月,我寫了一篇用turtle繪制《小清新風格的樹》,反響挺好。現在打算使用turtle修改一下繪制方式,因為線條的繪制太過考慮因素過多,如果使用方塊進行堆疊,繪制出來的形狀可以如馬賽克一樣,既符合…

Android之Only fullscreen opaque activities can request orientation

1 問題 使用透明的activity主題,并且固定了方向,在Android8.0手機上提示錯誤如下 Only fullscreen opaque activities can request orientation 2 解決辦法 簡單粗暴就是去在AndroidManifest.xml文件去掉當前activity配置的里面的橫豎屏方向設置 and…

wamp5.5.12安裝re dis擴展

轉載地址:http://hanqunfeng.iteye.com/blog/1984387 phpredis是個人覺得最好的一個php-redis客戶端,因為其提供的function與redis的命令基本一致,降低的了學習成本,同時功能也很全面。 一。linux安裝方法 phpredis下載地址&#…

java 數組轉bean_json數組轉java對象怎么轉

展開全部首先需要 commons-beanutils jar包,然后轉bean的方法為:62616964757a686964616fe59b9ee7ad9431333363386133/**** Title: transMap2Bean* param:param map* param:param obj* return:void* Description&#x…

FPGA圖案--數字表示(代碼+波形)

在數字邏輯系統,僅僅存在高低。所以用它只代表一個整數數字。并且有3代表性的種類。這是:原碼表示(符號加絕對值值)、反碼表示(加-minus標志)而補碼(符號加補)。這三個在FPGA中都有著廣泛的應用。以下分別討論。1、原碼表示法 原碼表示法是機器數的一種簡…

WPF效果第一百八十四篇之網頁視頻保存

一年一度的小學入學采集開始了;我一朋友很是頭大,他說頭都大了好幾圈了;既要準備各種入學材料又要聽線上專人視頻直播講解;然而在直播結束后,他發現自己仍是一臉疑惑;雖說直播有回訪吧,但是他那蝸牛網速簡直了;這時他場外找我,讓我看能不能給他自己下載一份;1、畢竟第一次,直接…

【遙感數字圖像處理】基礎知識:第一章 緒論

第一章 緒 論 ◆ 課程學習要求 主要教學內容:遙感數字圖像處理的概念和基礎知識,遙感數字圖像的幾何處理,遙感圖像的輻射校正,遙感數字圖像的增強處理,遙感圖像的計算機分類,遙感數字圖像的分析方法&…

Android之Canvas的drawRoundRect()

1 問題 Canvas的drawRoundRect()函數怎么用 public void drawRoundRect(RectF rect, float rx, float ry, Paint paint) 功能:該方法用于在畫布上繪制圓角矩形,通過指定RectF對象以及圓角半徑來實現。float rx:生成圓角的橢圓的X軸半徑 float ry:生成圓角的橢圓的Y軸半徑…

201671010128 2017-10-08《Java程序設計》之Lambda與內部類

一、基本概念 Java Lambda 表達式是 Java 8 引入的一個新的功能,主要用途是提供一個函數化的語法來簡化編碼。Lambda表達式本質上是一個匿名方法。Java Lambda 表達式以函數式接口為應用基。內部類(inner class)是定義在另一個類內部的類。二、幾點注意 使用內部類的…