spark中遇到的问题_ScienceCluster遇到Spark-白红宇

spark中遇到的问题_ScienceCluster遇到Spark

阅读量：2525 次

发布时间：2019-05-11

本文共 8364 字，大约阅读时间需要 27 分钟。

spark中遇到的问题

入门 (Getting Started)

When did all the ‘big data’ hoopla start? By the very first definition, in a 1997 , a data set that is too big to fit on a local disk has officially graduated to big-data-dom.

所有的“大数据”潮流始于何时？根据最初的定义，在1997年，一个太大而无法容纳在本地磁盘上的数据集已经正式升级为大数据域。

Whether you’re working with or processing the , having the right tool(s) for the job is extremely important. This is especially true when dealing with large, distributed datasets.

无论您是使用还是要处理，拥有合适的工具进行工作都非常重要。在处理大型分布式数据集时尤其如此。

Enter Apache Spark.

输入Apache Spark。

什么是Apache Spark？ (What is Apache Spark?)

When you have a job that is too big to process on a laptop or single server, Spark enables you to divide that job into more manageable pieces. Spark then runs these job pieces in-memory, on a cluster of servers to take advantage of the collective memory available on the cluster.

当您的工作量太大而无法在便携式计算机或单个服务器上处理时，Spark使您可以将该工作划分为更易于管理的部分。然后，Spark在服务器群集上的内存中运行这些作业，以利用群集上可用的集合内存。

Apache Spark is an open source processing engine originally developed at UC Berkeley in 2009.

Apache Spark是一个开源处理引擎，最初于2009年在UC Berkeley开发。

Spark is able to do this thanks to its Resilient Distributed Dataset (RDD) application programming interface (or API). If you want to know more about what happens under the Spark hood, check out this on how Spark’s RDD API and the original Apache Mapper and Reducer API differ.

凭借其弹性分布式数据集（RDD）应用程序编程接口（或API），Spark能够做到这一点。如果您想进一步了解Spark幕后发生的事情，请查阅此，了解Spark的RDD API与原始Apache Mapper和Reducer API有何不同。

Spark has gained a lot of popularity in the big data world recently due to its lightning-fast computing speed and its wide array of libraries, including SQL and DataFrames, MLlib, GraphX and Spark Streaming. Given how useful and efficient Spark is for interactive queries and iterative big data processing, we decided it was time to invite Spark to the ScienceCluster party.

由于其闪电般的快速计算速度和广泛的库（包括SQL和DataFrames，MLlib，GraphX和Spark Streaming），Spark最近在大数据世界中获得了很大的普及。鉴于Spark在交互式查询和大数据迭代处理中多么有用和高效，我们决定是时候邀请Spark参加ScienceCluster聚会了。

欢迎参加聚会：ScienceCluster中的Spark (Welcome to the Party: Spark in ScienceCluster)

ScienceCluster is Yhat’s enterprise workplace for data science teams to collaborate on projects and harness the power of distributed computing to allocate tasks across a cluster of servers. Now that we’ve added support for Spark to ScienceCluster, you can work either interactively or by submitting standalone Python jobs to Spark.

ScienceCluster是Yhat的企业工作场所，供数据科学团队协作进行项目，并利用分布式计算的功能跨服务器集群分配任务。现在，我们已经在ScienceCluster中添加了对Spark的支持，您可以交互地进行工作，也可以通过将独立的Python作业提交给Spark来进行工作。

ScienceCluster是由Yhat（即我们）开发的数据科学平台。

Working interactively means launching a Spark shell in Jupyter in order to prototype or experiment with Spark algorithms. If and when you decide to run your algorithm on your team’s Spark cluster, you can easily submit a standalone job from your Jupyter notebook and monitor its progress, without ever leaving ScienceCluster.

交互工作意味着在Jupyter中启动Spark壳，以便对Spark算法进行原型设计或实验。如果您决定在团队的Spark集群上运行算法，则无需离开ScienceCluster，就可以轻松地从Jupyter笔记本提交一个独立的作业并监视其进度。

比尔·莎士比亚的互动式火花分析 (Interactive Spark Analysis with Bill Shakespeare)

For this example, suppose that you are interested in performing a word count on a text document using spark interactively. You’ve selected to run this on the entire works of William Shakespeare, the man responsible for giving the English language words like swagger, scuffle and multitudinous, among others.

对于此示例，假设您有兴趣以交互方式使用spark对文本文档执行字数统计。您已选择在威廉·莎士比亚（William Shakespeare）的全部作品中使用该作品，该人物负责给英语单词带来如招摇，混战和大量使用等。

Shakespeare’s plays also feature the first written instances of the phrases “wild goose chase” and “in a pickle.”

莎士比亚的戏剧还以短语“野鹅追逐”和“泡菜”为例。

Begin by uploading the complete works of William Shakespeare to ScienceCluster. Next, launch the interactive spark shell that is integrated right into the Jupyter IDE.

首先将威廉·莎士比亚的完整作品上载到ScienceCluster。接下来，启动直接集成到Jupyter IDE中的交互式Spark Shell。

Launching an interactive spark shell in the Jupyter IDE.

在Jupyter IDE中启动交互式Spark Shell。

This interactive shell works by running a Python-Spark kernel inside a container that a Jupyter notebook can communicate with. Each instance of a Jupyter notebook running the Python-Spark kernel gets its own container.

该交互式外壳通过在Jupyter笔记本可以与之通信的容器内运行Python-Spark内核来工作。运行Python-Spark内核的Jupyter笔记本的每个实例都有其自己的容器。

Next, add the following code to your Jupyter notebook:

接下来，将以下代码添加到Jupyter笔记本中：

from __future__ import print_function  from operator import add  # Use the SparkContext sc here, see below.  lines = sc.textFile("shakespeare.txt")  counts = lines.flatMap(lambda x: x.split(' '))                 .map(lambda x: (x, 1))                 .reduceByKey(add)  output = counts.collect()  # Print the first 100 word counts as pairs  for (word, count) in output[:100]:      print("%s: %i" % (word, count))  from __future__ import print_function  from operator import add  # Use the SparkContext sc here, see below.  lines = sc.textFile("shakespeare.txt")  counts = lines.flatMap(lambda x: x.split(' '))                 .map(lambda x: (x, 1))                 .reduceByKey(add)  output = counts.collect()  # Print the first 100 word counts as pairs  for (word, count) in output[:100]:      print("%s: %i" % (word, count))

Note that you do not need to instantiate the SparkContext. The PySpark kernel will take care of this for you. You must use the sc object as the SparkContext in our notebook session. The path that you use to instantiate an RDD textFile is just the filename. After you run this code in your notebook, you see the following output:

请注意，您无需实例化SparkContext。 PySpark内核将为您解决这个问题。您必须在我们的笔记本会话中将sc对象用作SparkContext。用于实例化RDD textFile的路径就是文件名。在笔记本中运行此代码后，将看到以下输出：

Output:

输出：

运行独立Spark作业 (Running Standalone Spark Jobs)

Once you have perfected our Shakespeare word count algorithm, let’s say that you decide that it needs to be run on your team’s Spark cluster.

完善我们的莎士比亚文字计数算法后，假设您决定需要在团队的Spark集群上运行它。

The first step is to convert our notebook code into a standalone Python script. To convert your Shakespeare algorithm, you might borrow the wordcount.py file below from .

第一步是将我们的笔记本代码转换为独立的Python脚本。要转换您的莎士比亚算法，您可以从借用以下wordcount.py文件。

from __future__ import print_functionimport sysfrom operator import addfrom pyspark import SparkContextif __name__ == "__main__":    if len(sys.argv) != 2:        print("Usage: wordcount 
    
     ", file=sys.stderr)        exit(-1)    sc = SparkContext(appName="PythonWordCount")    lines = sc.textFile(sys.argv[1], 1)    counts = lines.flatMap(lambda x: x.split(' '))                   .map(lambda x: (x, 1))                   .reduceByKey(add)    output = counts.collect()    for (word, count) in output:        print("%s: %i" % (word, count))    sc.stop()from __future__ import print_functionimport sysfrom operator import addfrom pyspark import SparkContextif __name__ == "__main__":    if len(sys.argv) != 2:        print("Usage: wordcount 
     
      ", file=sys.stderr)        exit(-1)    sc = SparkContext(appName="PythonWordCount")    lines = sc.textFile(sys.argv[1], 1)    counts = lines.flatMap(lambda x: x.split(' '))                   .map(lambda x: (x, 1))                   .reduceByKey(add)    output = counts.collect()    for (word, count) in output:        print("%s: %i" % (word, count))    sc.stop()

Note that it is best practice to stop a SparkContext after your code is finished running.

请注意，最佳实践是在代码运行完毕后停止SparkContext。

Next, upload your standalone script to ScienceCluster project. To run your Spark job from ScienceCluster indicate that “This is a Spark Job” on your job form and complete the Spark submission fields.

接下来，将您的独立脚本上传到ScienceCluster项目。要从ScienceCluster运行您的Spark作业，请在您的作业表单上指出“这是一个Spark作业”，并填写Spark提交字段。

Once you’ve completed the job form, send your standalone job to your Spark Cluster by clicking Run. Your Spark Cluster logs will now stream back to ScienceCluster, accessible via the Running Jobs tab. You can either monitor your job in ScienceCluster to see when it is finished, or opt in to an email alert. Either way, rest you merry as Spark processes your Shakespearian query!

完成作业表格后，通过单击“运行”将独立作业发送到Spark Cluster。您的Spark群集日志现在将流回到ScienceCluster，可通过“运行作业”选项卡进行访问。您可以在ScienceCluster中监视您的工作以查看其完成时间，也可以选择加入电子邮件警报。无论哪种方式，当Spark处理您的莎士比亚查询时，请您放心！