Uploaded image for project: 'Pentaho Data Integration - Kettle'
  1. Pentaho Data Integration - Kettle
  2. PDI-12410

Get Data From XML step fails with "Content is not allowed in prolog." if BOM is present

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Severity: High
    • Resolution: Fixed
    • Affects Version/s: 5.1.0 GA
    • Fix Version/s: 7.0.0 GA
    • Component/s: Step
    • Labels:
    • Story Points:
      3
    • PDI Sub-component:
    • Notice:
      When an issue is open, the "Fix Version/s" field conveys a target, not necessarily a commitment. When an issue is closed, the "Fix Version/s" field conveys the version that the issue was fixed in.

      Description

      Some web services serve UTF-8 XML documents with a BOM. Although that practice is disrecommended, it is allowed by the unicode standard. However, java xml parsers have the habit of not reading through the BOM, resulting in an XML wellformedness error.

      In kettle the error message (which originates in dom4j) is somewhat cryptic and provides the user no hint that there is an issue with the BOM:

      "Content is not allowed in prolog."

      The problem can be witnessed with this URL

      http://api.worldbank.org/en/countries?page=1

      The attached transformation PDI-12410.ktr illustrates the error.

      Interestingly, the BOM doesn't seem to bother kettle if the source of the XML is a file. Attaching another transformation and XML file + BOM to demonstrate this.

        Attachments

        1. PDI-12410.ktr
          15 kB
        2. PDI-12410-file.txt.ktr
          19 kB
        3. xmlUTF8.xml
          1 kB

          Activity

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            rbouman Roland Bouman (Inactive)
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: