Hitachi Vantara Pentaho Community Forums
Results 1 to 2 of 2

Thread: Using Spark with PDI

  1. #1

    Default Using Spark with PDI

    I am trying to integrate PDI (Windows 7) with Spark (CDH5.8 Cluster on VM Workstation running on Windows 7) and submit the "JavaWordCount" which I believe available part of "spark-examples.jar". Spark job is not getting submitted to CDH5.8 cluster from PDI 7.0 and not getting proper error log message also. I just get the following message.

    2017/04/03 11:40:31 - Spark PI - ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : Could not submit Spark task: Cannot run program "C:\Chandra\Training\Pentaho_Mywork\Spark\Spark submit.kjb": CreateProcess error=193, %1 is not a valid Win32 application

    2017/04/03 11:40:31 - Spark PI - ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : java.io.IOException: Cannot run program "C:\Chandra\Training\Pentaho_Mywork\Spark\Spark submit.kjb": CreateProcess error=193, %1 is not a valid Win32 application


    2017/04/03 11:40:31 - Spark PI - Caused by: java.io.IOException: CreateProcess error=193, %1 is not a valid Win32 application

    I did the following step for using Spark with PDI as suggested in the link in Pentaho help - https://help.pentaho.com/Documentati...0/Spark_Submit

    1) Spark is running successfully on CDH5.8 quick start VM on VMWare Workstation
    2) Created new environment variable "HADOOP_CONF_DIR" and set this to PDI Shim directory - C:\Pentaho\design-tools\data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations\cdh58
    - Variable name - HADOOP_CONF_DIR
    - Variable Value - C:\Pentaho\design-tools\data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations\cdh58
    3) In CDH 5.8 Cluster VM (VMWare Workstation), added the following to spark-defaults.conf in /etc/spark/conf folder.
    spark.yarn.jar=hdfs://root:cloudera@quickstart.cloudera:8020/usr/lib/spark/assembly/lib/spark-examples.jar
    4) Since I am using CDH5.8 Quickstart, I did not comment out property net.topology.script.file.name in core-site.xml
    5) Created home folders with write permissions for Pentaho user in CDH5.8 who will be running the Spark job
    6) I have modified the following the sample job given in PDI samples (Spark Submit.kjb)
    -
    Class: org.apache.spark.examples.JavaWordCount
    - Application Jar: hdfs://root:cloudera@quickstart.cloudera:8020/usr/lib/spark/assembly/lib/spark-examples.jar
    - Spark Submit Utility: C:\Chandra\Training\Pentaho_Mywork\Spark\Spark submit.kjb (This is where my .kjb is stored)
    - Master URL: yarn-client
    - Argument (
    path to the file you want to run Word Count on) - hdfs://root:cloudera@quickstart.cloudera:8020/user/training/di2000/input

    Any help or did I miss out any configuration steps above or any suggestion would be highly appreciated.

  2. #2

    Default

    Your Spark Submit Utility should point to something like "spark/bin/spark-submit" which will be wherever your spark installation is, not your .kjb file.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2017 Pentaho Corporation. All Rights Reserved.