Uploaded image for project: 'Pentaho Data Integration - Kettle'
  1. Pentaho Data Integration - Kettle
  2. PDI-16829

The Get Records From Stream step is retrieving topic metadata in an order unchangeable by the user in Spark only

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Severity: Urgent
    • Resolution: Fixed
    • Affects Version/s: 8.0.0 GA, Master
    • Fix Version/s: 8.1.0 GA
    • Component/s: None
    • Labels:
    • Environment:
      • Windows 10 64-bit
      • Ubuntu 16.04 LTS 64-bit
      • pentaho-business-analytics-8.0.0.0-28-x64
    • Story Points:
      0
    • Notice:
      When an issue is open, the "Fix Version/s" field conveys a target, not necessarily a commitment. When an issue is closed, the "Fix Version/s" field conveys the version that the issue was fixed in.
    • Sprint Team:
      Tatooine (Maint)
    • Steps to Reproduce:
      Hide
      1. Configure Spoon to connect to a Hadoop cluster of your choice.
        • e.g. CDH 512 Unsecure
      2. Configure and run a Spark daemon on the hadoop cluster chosen.
      3. Open Spoon (PDI).
      4. Open or Import the attached three transformations:
        • kafkaConsumer.ktr
        • kafkaSubTrans.ktr
        • kafkaProducer.ktr
      5. Create a new "Run configuration" in Spoon pointing to the Spark daemon on the cluster.
      6. Create a "Hadoop cluster" configuration in Spoon using the information of the chosen cluster.
      7. Modify the following steps in the three transformations to point to the correct Hadoop cluster configured:
        • Kafka Consumer
        • Kafka Producer
        • Hadoop file output
      8. Run the Kafka Consumer transformation on the Spark instance.
      9. Run the Kafka Producer transformation locally in Kettle. (It's just to generate Kafka messages, it does not need to run on Spark for this)
      10. Check the Hadoop file system for output from the Kafka topics and observe NO content in the "part - #####" files.
      11. Stop all running transformations.
      12. Return to the Kafka Consumer SUB transformation.
      13. Edit the "Get Records from Stream" step.
      14. Move the "Topic" field to the BOTTOM of the fields list and save the step.
      15. Edit the "Hadoop File Output" step and update the field list to match the order of the previous step.
      16. Rerun the Kafka Consumer step on the Spark instance.
      17. Rerun the Kafka Producer step locally in Kettle to generate more messages.
      18. Check the Hadoop file system for output from the Kafka topics and observe the content in the "part - #####" files.
      Show
      Configure Spoon to connect to a Hadoop cluster of your choice. e.g. CDH 512 Unsecure Configure and run a Spark daemon on the hadoop cluster chosen. Open Spoon (PDI). Open or Import the attached three transformations: kafkaConsumer.ktr kafkaSubTrans.ktr kafkaProducer.ktr Create a new "Run configuration" in Spoon pointing to the Spark daemon on the cluster. Create a "Hadoop cluster" configuration in Spoon using the information of the chosen cluster. Modify the following steps in the three transformations to point to the correct Hadoop cluster configured: Kafka Consumer Kafka Producer Hadoop file output Run the Kafka Consumer transformation on the Spark instance. Run the Kafka Producer transformation locally in Kettle. (It's just to generate Kafka messages, it does not need to run on Spark for this) Check the Hadoop file system for output from the Kafka topics and observe NO content in the "part - #####" files. Stop all running transformations. Return to the Kafka Consumer SUB transformation. Edit the "Get Records from Stream" step. Move the "Topic" field to the BOTTOM of the fields list and save the step. Edit the "Hadoop File Output" step and update the field list to match the order of the previous step. Rerun the Kafka Consumer step on the Spark instance. Rerun the Kafka Producer step locally in Kettle to generate more messages. Check the Hadoop file system for output from the Kafka topics and observe the content in the "part - #####" files.

      Description

      Actual Result
      If a user is pulling data from a topic (or multiple topics) in Kafka, they are limited to a specific order in which the metadata fields (key, messsage, topic, partition, offset, and timestamp) are retrieved and will be output while running under Spark only because the field types specified in the "Get records from stream" step by the user may be incorrectly matched up with the incoming order the data types are in from Kafka.

      Note: This does not occur if the user runs the same transformation in Kettle.

      Expected Result
      The user can place their topic metadata in the fields section of the "Get records from stream" step in any order they chose and have it output correctly under the correct field type.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              mbatchelor Marc Batchelor
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: