Uploaded image for project: 'Pentaho Data Integration - Kettle'
  1. Pentaho Data Integration - Kettle
  2. PDI-16942

When using the Excel Input Step an org.apache.poi.POIXMLException error can result if the compression ratio of the file is too high

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Severity: High
    • Resolution: Fixed
    • Affects Version/s: 7.0.0 GA, 7.1.0 GA, 8.0.0 GA
    • Fix Version/s: 8.1.0 GA
    • Component/s: Step
    • Labels:
    • Story Points:
      0
    • PDI Sub-component:
    • Notice:
      When an issue is open, the "Fix Version/s" field conveys a target, not necessarily a commitment. When an issue is closed, the "Fix Version/s" field conveys the version that the issue was fixed in.
    • Sprint Team:
      Tatooine (Maint)
    • Steps to Reproduce:
      Hide

      1. Download the ktr and excel file in the linked ESR
      2. Run the transformation in Spoon 7.1 or 8.0

      Show
      1. Download the ktr and excel file in the linked ESR 2. Run the transformation in Spoon 7.1 or 8.0

      Description

      In Pentaho version 7.0 and above, using the Excel Input step can result in an exception if the compression ratio is too high for an xlsx file. The issue seems to be related to a new ZipSecureFile class that was introduced in the Apache POI library in version 3.14 of that library. Pentaho version 7.0 and above is using the 3.15 version of the POI library. This new class seems to be a security feature that tries to determine if the xlsx file is a zip bomb. If the check fails, there is no way in Pentaho to indicate that the file is a trusted file.

      Example of what the exception might look like:

      2018/01/16 12:20:05 - Microsoft Excel Input.0 - ERROR (version 7.1.0.3-57, build 1 from 2017-08-23 11.43.14 by buildguy) : Error processing row from Excel file [/pentaho/ERI/source/MARS/UAT2/working_dir/20180116/ERI/Balance_Type_Codes.xlsx] : org.pentaho.di.core.exception.KettleException: 
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - java.lang.reflect.InvocationTargetException
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - ERROR (version 7.1.0.3-57, build 1 from 2017-08-23 11.43.14 by buildguy) : org.pentaho.di.core.exception.KettleException: 
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - java.lang.reflect.InvocationTargetException
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.pentaho.di.trans.steps.excelinput.poi.PoiWorkbook.<init>(PoiWorkbook.java:81)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.pentaho.di.trans.steps.excelinput.WorkbookFactory.getWorkbook(WorkbookFactory.java:41)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.pentaho.di.trans.steps.excelinput.ExcelInput.getRowFromWorkbooks(ExcelInput.java:553)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.pentaho.di.trans.steps.excelinput.ExcelInput.processRow(ExcelInput.java:431)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at java.lang.Thread.run(Thread.java:745)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - Caused by: org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:63)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:625)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:186)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:260)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:263)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:222)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:201)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.pentaho.di.trans.steps.excelinput.poi.PoiWorkbook.<init>(PoiWorkbook.java:73)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	... 5 more
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - Caused by: java.lang.reflect.InvocationTargetException
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.xssf.usermodel.XSSFFactory.createDocumentPart(XSSFFactory.java:56)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:60)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	... 12 more
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - Caused by: java.io.IOException: Zip bomb detected! The file would exceed the max. ratio of compressed file size to the size of the expanded data. This may indicate that the file is used to inflate memory usage and thus could pose a security risk. You can adjust this limit via ZipSecureFile.setMinInflateRatio() if you need to work with files which exceed this limit. Counter: 4916901, cis.counter: 49152, ratio: 0.009996540503866155Limits: MIN_INFLATE_RATIO: 0.01
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.advance(ZipSecureFile.java:257)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:214)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown Source)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.xerces.impl.XMLEntityScanner.scanQName(Unknown Source)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown Source)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:137)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:115)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.openxmlformats.schemas.spreadsheetml.x2006.main.StyleSheetDocument$Factory.parse(Unknown Source)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.xssf.model.StylesTable.readFrom(StylesTable.java:203)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	at org.apache.poi.xssf.model.StylesTable.<init>(StylesTable.java:146)
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - 	... 18 more
      2018/01/16 12:20:05 - Microsoft Excel Input.0 - Finished processing (I=0, O=0, R=0, W=0, U=0, E=1)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                upihin Uladzimir Pihin (Inactive)
                Reporter:
                jeicher@pentaho.com Chris Eicher
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: