Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: Issues connecting Cloudera CDH5 (VM) on PDI

  1. #1
    Join Date
    Feb 2008
    Posts
    1

    Default Issues connecting Cloudera CDH5 (VM) on PDI

    I am attempting to do a simple Hadoop File Input via PDI/Spoon, and getting a BlockMissingException error. I can browse the HDFS file system and locate the file I need to load, but any attempt to preview it (or push to some Output file) fails with same error.

    Here are the system configurations:
    Pentaho (running on Windows 7 Ultimate SP1, 64-bit OS, 8GB RAM, Intel i7-4770):
    BA Suite 5.1.0
    Kettle / Spoon 5.1.0 GA Release

    Cloudera (running on VirtualBox VM, Red Hat 64 bit OS):
    [cloudera@localhost ~]$ hadoop version
    Hadoop 2.3.0-cdh5.0.0
    Subversion git://github.sf.cloudera.com/CDH/cdh.git -r 8e266e052e423af592871e2dfe09d54c03f6a0e8
    Compiled by jenkins on 2014-03-28T04:30Z
    Compiled with protoc 2.5.0
    From source with checksum fae92214f92a3313887764456097e0
    This command was run using /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop/hadoop-common-2.3.0-cdh5.0.0.jar

    I have followed the instructions found here (http://wiki.pentaho.com/display/BAD/...ro+and+Version) and set every instance (i.e. not just DI server, but BA Server, Spoon, ReportDesigner, and Metadata Editor) of /pentaho-big-data-plugin/plugin.properties to reflect active.hadoop.configuration=cdh50 and confirmed proper shims are in each sub-directory.

    I've also edited the core-site.xml file on CDH5 VM to reflect fs.default.name as hdfs://{vm-ip}:8020 (in my case hdfs://10.0.2.15:8020) as cited in similar forum post (http://forums.pentaho.com/showthread...nnection-issue).

    Below is a paste of the error message returned by PDI when attempting read first 100 lines (similar message returned when I attempt to Get Fields):
    ~~~~~~~~~~~~~~~~~~~~~~~~
    org.pentaho.di.core.exception.KettleException:
    Error getting first 100 from file hdfs://cloudera:cloudera@10.0.2.15:8020/user/cloudera/pentaho_sample_pdi_testing/test_recip2.csv




    Exception reading line: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-2063677620-10.0.2.15-1396649405250:blk_1073742633_1817 file=/user/cloudera/pentaho_sample_pdi_testing/test_recip2.csv
    Could not obtain block: BP-2063677620-10.0.2.15-1396649405250:blk_1073742633_1817 file=/user/cloudera/pentaho_sample_pdi_testing/test_recip2.csv


    Could not obtain block: BP-2063677620-10.0.2.15-1396649405250:blk_1073742633_1817 file=/user/cloudera/pentaho_sample_pdi_testing/test_recip2.csv




    at org.pentaho.di.ui.trans.steps.hadoopfileinput.HadoopFileInputDialog.getFirst(HadoopFileInputDialog.java:2829)
    at org.pentaho.di.ui.trans.steps.hadoopfileinput.HadoopFileInputDialog.first(HadoopFileInputDialog.java:2734)
    at org.pentaho.di.ui.trans.steps.hadoopfileinput.HadoopFileInputDialog.access$200(HadoopFileInputDialog.java:116)
    at org.pentaho.di.ui.trans.steps.hadoopfileinput.HadoopFileInputDialog$3.handleEvent(HadoopFileInputDialog.java:472)
    at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown Source)
    at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
    at org.eclipse.swt.widgets.Display.runDeferredEvents(Unknown Source)
    at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source)
    at org.pentaho.di.ui.trans.steps.hadoopfileinput.HadoopFileInputDialog.open(HadoopFileInputDialog.java:679)
    at org.pentaho.di.ui.spoon.delegates.SpoonStepsDelegate.editStep(SpoonStepsDelegate.java:124)
    at org.pentaho.di.ui.spoon.Spoon.editStep(Spoon.java:8648)
    at org.pentaho.di.ui.spoon.trans.TransGraph.editStep(TransGraph.java:3020)
    at org.pentaho.di.ui.spoon.trans.TransGraph.mouseDoubleClick(TransGraph.java:737)
    at org.eclipse.swt.widgets.TypedListener.handleEvent(Unknown Source)
    at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown Source)
    at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
    at org.eclipse.swt.widgets.Display.runDeferredEvents(Unknown Source)
    at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source)
    at org.pentaho.di.ui.spoon.Spoon.readAndDispatch(Spoon.java:1297)
    at org.pentaho.di.ui.spoon.Spoon.waitForDispose(Spoon.java:7801)
    at org.pentaho.di.ui.spoon.Spoon.start(Spoon.java:9130)
    at org.pentaho.di.ui.spoon.Spoon.main(Spoon.java:638)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.pentaho.commons.launcher.Launcher.main(Launcher.java:151)
    Caused by: org.pentaho.di.core.exception.KettleFileException:


    Exception reading line: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-2063677620-10.0.2.15-1396649405250:blk_1073742633_1817 file=/user/cloudera/pentaho_sample_pdi_testing/test_recip2.csv
    Could not obtain block: BP-2063677620-10.0.2.15-1396649405250:blk_1073742633_1817 file=/user/cloudera/pentaho_sample_pdi_testing/test_recip2.csv


    Could not obtain block: BP-2063677620-10.0.2.15-1396649405250:blk_1073742633_1817 file=/user/cloudera/pentaho_sample_pdi_testing/test_recip2.csv


    at org.pentaho.di.trans.steps.textfileinput.TextFileInput.getLine(TextFileInput.java:157)
    at org.pentaho.di.trans.steps.textfileinput.TextFileInput.getLine(TextFileInput.java:95)
    at org.pentaho.di.ui.trans.steps.hadoopfileinput.HadoopFileInputDialog.getFirst(HadoopFileInputDialog.java:2822)
    ... 26 more
    Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-2063677620-10.0.2.15-1396649405250:blk_1073742633_1817 file=/user/cloudera/pentaho_sample_pdi_testing/test_recip2.csv
    at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:883)
    at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:560)
    at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:793)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
    at java.io.DataInputStream.read(Unknown Source)
    at java.io.BufferedInputStream.fill(Unknown Source)
    at java.io.BufferedInputStream.read(Unknown Source)
    at org.apache.commons.vfs.util.MonitorInputStream.read(Unknown Source)
    at org.pentaho.di.core.compress.CompressionInputStream.read(CompressionInputStream.java:36)
    at java.io.InputStream.read(Unknown Source)
    at sun.nio.cs.StreamDecoder.readBytes(Unknown Source)
    at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
    at sun.nio.cs.StreamDecoder.read(Unknown Source)
    at sun.nio.cs.StreamDecoder.read0(Unknown Source)
    at sun.nio.cs.StreamDecoder.read(Unknown Source)
    at java.io.InputStreamReader.read(Unknown Source)
    at org.pentaho.di.trans.steps.textfileinput.TextFileInput.getLine(TextFileInput.java:106)
    ... 28 more

    ~~~~~~~~~~~~~~~~~~~~~~~~

    Bad block error message indicates corrupt HDFS, so I checked as follows:
    [cloudera@localhost ~]$ hadoop fsck /user/cloudera
    DEPRECATED: Use of this script to execute hdfs command is deprecated.
    Instead use the hdfs command for it.


    Connecting to namenode via http://localhost.localdomain:50070
    FSCK started by cloudera (auth:SIMPLE) from /127.0.0.1 for path /user/cloudera at Tue Jul 15 16:51:34 PDT 2014
    ..Status: HEALTHY
    Total size: 13611942 B
    Total dirs: 10
    Total files: 2
    Total symlinks: 0
    Total blocks (validated): 2 (avg. block size 6805971 B)
    Minimally replicated blocks: 2 (100.0 %)
    Over-replicated blocks: 0 (0.0 %)
    Under-replicated blocks: 0 (0.0 %)
    Mis-replicated blocks: 0 (0.0 %)
    Default replication factor: 1
    Average block replication: 1.0
    Corrupt blocks: 0
    Missing replicas: 0 (0.0 %)
    Number of data-nodes: 1
    Number of racks: 1
    FSCK ended at Tue Jul 15 16:51:34 PDT 2014 in 38 milliseconds




    The filesystem under path '/user/cloudera' is HEALTHY

    I was also able to cat the file, run a Pig script against it, and Cloudera Manager health check shows no bad blocks either. So it seems that the CDH5 instance is fine, and something in the way I'm connecting via PDI is root cause.

    Does anybody have any ideas? Any help, advice, or guidance would be greatly appreciated.

    Cheers,
    -Jivechops

  2. #2
    Join Date
    Jan 2014
    Posts
    3

    Default

    Hi Jivechops,

    Not sure if this will resolve the issue, but did you also configure the yarn-site, mapred-site, and hive-site files?

    Instructions on how to do that appear here:
    http://wiki.pentaho.com/display/BAD/...for+YARN+Shims

  3. #3
    rfellows Guest

    Default dfs.client.use.datanode.hostname

    Hi, I was recently having this same problem when connecting PDI to the cdh 5.7 quickstart docker container. The solution for me was to modify the hdfs-site.xml in the shim configuration on the PDI client machine by setting the dfs.client.use.datanode.hostname propterty to true. That property is is false by default.

    In file data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh57/hdfs-site.xml
    Code:
      <property>
        <name>dfs.client.use.datanode.hostname</name>
        <value>true</value>
      </property>

    I know this is a few years too late for the original problem, but maybe this will help others in the future.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.