Amazon-EC2 – SparkException: Master Deletes our app

Question

I know that there are other very similar questions on Stackoverflow, but these questions are either not answered or did not help me. In contrast to these questions, I will add more stack traces and log files The information is put into this question. I hope this will help, although it will make the question lengthy and ugly. Sorry.

Build

I’m using install The m3.xlarge instance with DSE (DataStax Enterprise) version 4.6 runs a 9-node cluster on Amazon EC2. For each workload (Cassandra, search and analysis), 3 nodes are used. DSE 4.6 is bundled with Spark 1.1 and Cassandra 2.0.

Question

Even if I did not run any queries, the application (Spark / Shark-Shell) will be deleted after about 3 minutes. Queries on small data sets only need to be It will run successfully within 3 minutes.

I want to analyze a larger data set. Therefore I need the application (shell) not to be deleted after about 3 minutes.

Error Description

On Spark or Shark shell, after being idle for ~3 minutes or after executing (long-running) queries, Spark will eventually terminate and provide the following stack trace:

15/08/25 14:58:09 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
org.apache.spark.SparkException: Job aborted due to stage failure : Master removed our application: FAILED
 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 at org.apache. spark.scheduler.DAGSched uler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
 at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark. scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 at org.apache. spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 at scala.Option.foreach(Option.scala:236)
 at org.apache.spark.scheduler. DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
 at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
 at akka.actor.ActorCell. receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala: 219)
 at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at scala .concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at scala.concurrent.forkjoin.ForkJoinWorkerThread .run(ForkJoinWorkerThread.java:107)
FAILED: Execution Error, return code -101 from shark.execution.SparkTask

This is not very helpful to me, this is what I want to The reason why you show more log file information.

Error details/log file

Master

Start from master.log , I think the interesting part is

INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp:// [email protected]:46715 got disassociated, removing it.
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://[email protected]:42136 got disassociated, removing it.

And

ERROR 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Application Shark::ip-172-31-46- 49 with ID app-20150825091745-0007 failed 10 times, removing it
INFO 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Removing app app-20150825091745-0007

Why are the worker nodes disassociated?

If you need to see it, I also attached master's executor (ID 1) stdout. The executor stderr is empty. However, I think this will solve the problem. Not helpful.

On the Spark Master UI, I verified that all worker nodes are ALIVE. The second screenshot shows the application details

Spark Master UI

Spark Master UI Application details

An executor is generated on the master instance, and the executors on the two worker nodes are regenerated until the entire application is deleted . Is this okay, or does it indicate a problem? I think it may be related to the above "(It) failed 10 times" error message.

Worker log

In addition, I can show you the logs of two Spark worker nodes. I removed most of the classpath parameters to shorten the log. If you need to see, please let me know. When each worker node generates multiple executors, I attached some (not all) executors stdout and stderr dump links The dumps of the rest of the executors look basically the same.

Worker I

> worker.log
> Executor (ID 10) stdout
> Executor (ID 10) stderr

Worker two

> worker.log
> Executor (ID 3) stdout
> Executor (ID 3) stderr

The execution program dump seems to indicate the existence of permissions and / Or some issues with timeouts. But I can't figure out any details from the dump.

Try

As mentioned above, there are some similar questions, but none of them got answered Or it did not help me solve the problem. Anyway, what I tried and verified was:

> Opened port 2552. Nothing changed.
> Increased spark.akka.askTimeout caused Spark/Shark application The program lives longer, but it will still be deleted in the end.
> Ran the Spark shell locally with spark.master = local [4]. On the one hand, this allows me to successfully run queries longer than 3 minutes, on the other hand , It obviously does not take advantage of the distributed environment.

Summary

In conclusion, it can be said that the successful execution of timeouts and long-running queries in local mode indicates that there are some configuration errors. Although I Not sure, butI don't know how to solve it.

Any help will be greatly appreciated.

Edit: added two Analytics and two Solr nodes after the initial setup of the cluster . Just in case.

Edit (2): I was able to solve the above problem by replacing the Analytics node with three newly installed Analytics nodes. I can now run queries on a larger data set. Don't delete the shell. I plan to Do not use this as an answer to the question, because it is still not clear what is wrong with the three original Analytics nodes. However, because it is a cluster for testing purposes, the node can be simply replaced (I performed nodetool rebuild after replacing the node- Cassandra on each new node restores its data from the Cassandra data center).

Answer 1

as in the attempt As stated, the root cause is a timeout between the master node and one or more workers.

Another thing to try: verify the master through entries in the dns or /etc/hosts file Is the server accessible to all workers by hostname.

In my case, the problem is that the cluster is running in an AWS subnet without DNS. By starting the node, adding the node to the cluster, the cluster grows over time When the master server is built, only a part of the address in the cluster is known, and only this subset is added to the /etc/hosts file. When dse spark runs from the "new" node, the host name of the worker is used from The communication made by the master failed, and the master terminated the job.

Leave a Comment Cancel reply