Uploaded image for project: 'Pentaho Data Integration - Kettle'
  1. Pentaho Data Integration - Kettle
  2. PDI-7235

NullPointerException in DimensionLookup.addToCache()

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Severity: High
    • Resolution: Fixed
    • Affects Version/s: 4.2.0 GA (4.0.0 GA Suite Release)
    • Component/s: Step
    • Labels:
      None
    • Environment:
      MS SQL Server 2008 R2, using Windows Authentication and jTDS 1.2.5
    • PDI Sub-component:
    • Notice:
      When an issue is open, the "Fix Version/s" field conveys a target, not necessarily a commitment. When an issue is closed, the "Fix Version/s" field conveys the version that the issue was fixed in.
    • Operating System/s:
      Windows Server 2008 R2 (64-bit)

      Description

      A NullPointerException is thrown in two places (in two scenarios) from DimensionLookup.addToCache, although both scenarios seem to have the same cause, which is related to reading data from MS SQL Server and/or the way the Dimension cache is managed.

      Key elements of the circumstances in which this occurs: (a) Using MS SQL Server (in my case MS SQL Server 2008 R2); and (b) a type 2 slowly changing dimension is used, with its technical key being an auto-incrementing integer; and (c) cache enabled in the dimension; and (d) the dimension being updated with more rows than the cache holds.

      In this bug report:

      A. Setup
      B. Scenario 1
      C. Scenario 2
      D. Discussion of apparent source of problem
      E. Workarounds

      A. Setup

      Both scenarios below rely on this setup:

      Create a new, empty, database in MS SQL Server. Then add a dimension table with a seed row using the following SQL script:

      create table dim_demo
      ( demo_id int identity(1,1) primary key,
      demo_key varchar(10) not null,
      demo_data varchar(30) not null,

      demo_version int not null,
      demo_valid_from datetime not null,
      demo_valid_to datetime not null,
      demo_last_version char(1) not null
      );

      set identity_insert dim_demo on;

      insert into dim_demo
      ( demo_id,
      demo_key,
      demo_data,

      demo_version,
      demo_valid_from,
      demo_valid_to,
      demo_last_version )

      values ( 1,
      'DEMOSEED',
      'Demo dimension seed',

      1,
      '1970-01-01 00:00:00',
      '1970-01-01 00:00:00',
      'Y'
      );

      set identity_insert dim_demo off;

      (I will add this SQL file as an attachment.)

      B. Scenario 1

      Steps to reproduce Scenario 1 error:

      i. Perform the setup as above.
      ii. Use Pan to run scenario-1.ktr. (You can also run it from Spoon, but that doesn't seem to give stack traces.) It should run fine.
      iii. Run scenario-1.ktr a second time. You should get the following exception:

      ERROR 16-01 19:12:33,569 - Dimension lookup/update - Unexpected error
      ERROR 16-01 19:12:33,569 - Dimension lookup/update - java.lang.NullPointerException
      at java.util.Arrays.mergeSort(Unknown Source)
      at java.util.Arrays.sort(Unknown Source)
      at java.util.Collections.sort(Unknown Source)
      at org.pentaho.di.trans.steps.dimensionlookup.DimensionLookup.addToCache(DimensionLookup.java:1433)
      at org.pentaho.di.trans.steps.dimensionlookup.DimensionLookup.lookupValues(DimensionLookup.java:424)
      at org.pentaho.di.trans.steps.dimensionlookup.DimensionLookup.processRow(DimensionLookup.java:221)
      at org.pentaho.di.trans.step.RunThread.run(RunThread.java:40)
      at java.lang.Thread.run(Unknown Source)

      C. Scenario 2

      Steps to reproduce Scenario 2 error:

      i. Perform the setup as above.
      ii. Use Pan to run scenario-2-part-1.ktr
      iii. Use Pan to run scenario-2-part-2-.ktr. You should get the following (slightly different) exception:

      ERROR 16-01 19:11:42,550 - Dimension lookup/update - Unexpected error
      ERROR 16-01 19:11:42,550 - Dimension lookup/update - java.lang.NullPointerException
      at org.pentaho.di.trans.steps.dimensionlookup.DimensionLookup.addToCache(DimensionLookup.java:1457)
      at org.pentaho.di.trans.steps.dimensionlookup.DimensionLookup.lookupValues(DimensionLookup.java:424)
      at org.pentaho.di.trans.steps.dimensionlookup.DimensionLookup.processRow(DimensionLookup.java:221)
      at org.pentaho.di.trans.step.RunThread.run(RunThread.java:40)
      at java.lang.Thread.run(Unknown Source)

      (There is a small element of randomness in the critical code, so you may need to run it a couple of times to get the above exception, although I've found that it appears consistently.)

      D. Discussion of apparent source of problem

      Both exceptions above stem from DimensionLookup.addToCache(), but one lies deeper than the other. The null can be tracked back to the following code:

      // Lines 1428 and 1429, relevant to Scenario 1:

      Long tk = data.cacheValueRowMeta.getInteger(values, 0);
      samples.add(tk);

      // Line 1457, relevant to Scenario 2:

      long tk = data.cacheValueRowMeta.getInteger(values, 0).longValue();

      The Dimension's cache is expected to contain the technical key for each row cached, and this is obtained using data.cacheValueRowMeta.getInteger(values, 0). However, in our two scenarios a null has been used instead of the technical key, so data.cacheValueRowMeta.getInteger(values, 0) returns a null instead of a Long.

      In Scenario 1 the null has been added to a Collection called samples, which is used to identify suitably old rows to evict from the cache. But when we subsequently sort our collection of samples (line 1433) Java complains that it's trying to sort a null value.

      In Scenario 2 the technical key is null, hence calling method longValue() on null throws the exception.

      So how does a null technical key get into the cache in the first place? It happens earlier in the transformation, when a new version of a row gets inserted into the dimension, at DimensionLookup line 647. This line calls dimInsert() which should return the technical key of the new row, which then gets put in the cache. However, dimInsert() fails to get the new technical key, and instead it returns null. Thus null gets put into the cache in place of a technical key. This becomes a problem when it is accessed later.

      E. Workarounds

      I can think of the following workarounds for the moment:

      (a) don't enable the dimension cache;
      (b) if you have to enable the dimension cache then make sure it's larger than the number of rows that might go through it in any one run of the transformation;
      (c) don't use an auto-increment for the technical key.

      Or at a push, I suppose you could also try:

      (d) don't use MS SQL Server 2008 R2 But this bug may of course exist on other databases, too.

        Attachments

        1. Clinical Advice Dim Look up error.ktr
          34 kB
        2. pentaho.log
          6 kB
        3. scenario-1.ktr
          14 kB
        4. scenario-2-part-1.ktr
          14 kB
        5. scenario-2-part-2.ktr
          14 kB
        6. setup.sql
          0.6 kB

          Activity

            People

            • Assignee:
              jdorlus John Dorlus (Inactive)
              Reporter:
              nik.silver Nik Silver
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: