Uploaded image for project: 'Pentaho Data Integration - Kettle'
  1. Pentaho Data Integration - Kettle
  2. PDI-5313

XML: Create a new step that is capable of processing very large and complex XML files very fast



    • Type: New Feature
    • Status: Closed
    • Severity: Medium
    • Resolution: Fixed
    • Affects Version/s: 4.1.0 GA (Platform Release 3.7.0)
    • Component/s: None
    • Labels:
    • Notice:
      When an issue is open, the "Fix Version/s" field conveys a target, not necessarily a commitment. When an issue is closed, the "Fix Version/s" field conveys the version that the issue was fixed in.


      Use case is to read XML files with different logical block, e.g. customers, products and flexible hierarchies on each of those logical blocks. Another need is to read this in very fast even on very big files (some GBs).
      The actual "Get Data from XML" implementation uses DOM parsers that need in memory processing and even the purging of parts of the file is not sufficient since these parts are far too big and produce OOME.

      I investigated in different XML parsers and depending on the licensing models and processing (in-memory or not), I found that a StAX parser is suitable for this. Different implementations exist and tests with the Java 6 default are very satisfying. But we should have an option to use also others (e.g. Woodstox).

      The design goals are:
      1) very fast and independend of the memory regardless of the file size
      2) very flexible reading different parts of the XML file in different ways (and avoid parsing the file many times)


          Issue Links



              • Assignee:
                gdavid Golda Thomas
                jbleuel Jens Bleuel
              • Votes:
                0 Vote for this issue
                2 Start watching this issue


                • Created: