Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: Weka & Spark

  1. #1
    Join Date
    Aug 2006
    Posts
    1,741

    Default Weka & Spark

    Hi folks,

    If anyone is interested in the Spark distributed processing framework
    there is now a distributedWekaSpark package for Weka 3.7.12. Some info can
    be found at:

    http://markahall.blogspot.co.nz/2015...and-spark.html


    Feedback and bug reports welcome.

    Cheers,
    Mark.

  2. #2
    Join Date
    Dec 2015
    Posts
    3

    Default

    Dear Mark,

    During the past week I have been reading and trying the nice features you built in Weka for Big Data (especially for Spark platform). In local mode everything works perfectly and I have been able to connect Weka with Hadoop and get data from HDFS with no problems.
    However, when I try to connect to Spark in Yarn-client mode I find problems. The Spark version I have is spark-1.3.0-cdh5.4.0 installed using Cloudera Manager. I have replaced the Spark files in the lib folder inside the distributedWekaSpark package with the assembly file from my Spark distribution. The problem I find is the following:

    15/12/03 13:59:28 INFO yarn.Client: Application report for application_1449150977972_0001 (state: FAILED)
    15/12/03 13:59:28 INFO yarn.Client:
    client token: N/A
    diagnostics: Application application_1449150977972_0001 failed 2 times due to AM Container for appattempt_1449150977972_0001_000002 exited with exitCode: -1000
    For more detailed output, check application tracking page:http://cloudera1.localdomain:8088/pr...7972_0001/Then, click on links to logs of each attempt.
    Diagnostics: File file:/home/bdcoe/wekafiles/packages/distributedWekaSpark/lib/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar does not exist
    java.io.FileNotFoundException: File file:/home/bdcoe/wekafiles/packages/distributedWekaSpark/lib/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
    at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

    Failing this attempt. Failing the application.
    ApplicationMaster host: N/A
    ApplicationMaster RPC port: -1
    queue: root.bdcoe
    start time: 1449151164038
    final status: FAILED
    tracking URL: http://cloudera1.localdomain:8088/cl...150977972_0001
    user: bdcoe
    15/12/03 13:59:28 INFO yarn.Client:
    client token: N/A
    diagnostics: Application application_1449150977972_0001 failed 2 times due to AM
    It seems unable to find the assembly file, although it is placed there.. I have also tried to recompile the package using Maven but I am new in it and I get some errors.

    Do you have a clue on what is happening and how to solve it?

    Thank you so much,

    Ricard
    Last edited by RicB; 12-09-2015 at 05:09 AM.

  3. #3
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    Hi Ricard,

    I'm not sure why Yarn is trying to access the copy of the spark assembly that is included in Weka's distributedWekaSpark lib directory. The code configures the job with the Weka jar files that are required to run on the cluster, and this does not include the assembly because that should already be in the CLASSPATH for Spark on the cluster! It does look (from the stack trace) like Yarn is trying to download the assembly to the local disk - I can only assume that it is doing this because Spark on the client/driver side has this in its CLASSPATH (argh!).

    Are you launching the job from a client machine that is separate from the Cloudera cluster? I think I did try on a Cloudera sandbox VM at one stage, but I launched Weka from the VM itself (due to the nightmare that's involved with configuring DNS and opening ports so that clients outside the VM can talk to Spark). I have executed successfully from one machine on my LAN against Apache Hadoop/Yarn running on another, but in this case I guess my directory structure (though not shared) was identical on both machines. It's possible I might have encountered this problem too if my Yarn machine didn't have distributedWekaSpark installed locally. The whole Yarn thing is ugly in Spark, especially if you are trying to work with Spark programatically. They really want everyone to package up their applications in a jar and use the spark-submit shell script to run it on the cluster.

    My large scale testing with distributed Weka has been done using Spark standalone clustering + Hadoop HDFS, running on a torque cluster with an NFS filesystem. This has worked perfectly, and seems to scale quite nicely. I have run up to 15 nodes and processed datasets on the order of 100 million instances.

    If I get time next week I'll see if I can replicate this error with Apache Hadoop on my Macs. If my suspicions (above) are correct, then I'm not too sure what the solution is at this stage.

    Cheers,
    Mark.

    P.S. I did get emails from blogger with your comment, but the comment does not actually seem to be on my blog?!? So I couldn't reply.

  4. #4
    Join Date
    Dec 2015
    Posts
    3

    Default

    Thanks Mark,

    The setup I am currently using is the following:

    I am using an Amazon EC2 VM with CentOS 6.7 installed. I have CDH 5.4.0 with these components: HDFS, HBase, Spark, Yarn, ZooKeeper.
    Everything is in one single machine, so the client is in the same machine as the "cluster" (only for test purposes, the cluster has one machine)

    Cheers,
    Ricard

  5. #5
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    Hi Ricard,

    I finally got a chance to have a go with this myself (if you are still interested). I downloaded cloudera-quickstart-vm-5.5.0 for vmware. This uses Spark 1.5.0 by the looks of it (spark-assembly-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar). Anyhow, I had success after doing the following:

    * Delete all jar files in $WEKA_HOME/packages/distributedWekaSpark/lib
    * Copy /usr/lib/spark/assembly/lib/spark-assembly.jar to $WEKA_HOME/packages/distributedWekaSpark/lib
    * Copy all jars from /usr/lib/hadoop/client/ to $WEKA_HOME/packages/distributedWekaSpark/lib
    * Obtain a copy of jersey-bundle-1.17.1.jar and place it in $WEKA_HOME/packages/distributedWekaSpark/lib
    - I could not find this jar on the Cloudera VM under /usr/lib. It is needed for some servlet stuff that the Spark driver process does.
    * When running the graphical KnowledgeFlow, MaxPermSize needs to be increased (-XX:MaxPermSize=256m). This not necessary when running the command line FlowRunner (or the jobs directly I expect).
    * The hadoop conf dir (/etc/hadoop/conf) needs to be in the classpath when starting Weka

    Anyhow, this was sufficient to run the ArffHeader job in yarn-client mode (and in local mode too). I didn't bother trying to start the Spark stand-alone cluster, but would expect that should work too (since yarn does, and that is arguably more complex :-)). I didn't see the error that you reported. And it looks like Spark was successfully staging it's assembly jar in HDFS, ready for the resource manager to use when launching Spark workers. Perhaps there were some bugs fixed/stuff changed since Spark 1.3?

    Note that this was running Weka from the actual Cloudera VM, since the yarn resource manager etc. is configured to run on localhost. I didn't try to reconfigure it to bind to an actual IP/hostname (along with all the port-related nightmare that would ensue) in order to test trying Weka/spark-driver from another machine/OS :-)

    Cheers,
    Mark.
    Last edited by Mark; 01-11-2016 at 10:21 PM.

  6. #6
    Join Date
    Dec 2015
    Posts
    3

    Default

    Hi Mark,

    I have tried what you suggested, which is quite similar to the procedure I had previously used and I keep getting the same error, so I guess the problem may reside in the version of Spark actually installed in Cloudera (I have also had some problems with some other applications using this Apache Spark version).

    I will upgrade CDH to a newer version, try it again and tell you what is the result!

    Thank you Mark!

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.