Uploaded image for project: 'Pentaho Data Integration - Kettle'
  1. Pentaho Data Integration - Kettle
  2. PDI-7414

Using sorted merge to re-combine data stream results in transformation getting stuck (deadlock)

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Reopened
    • Severity: Medium
    • Resolution: Unresolved
    • Affects Version/s: 4.2.0 GA (4.0.0 GA Suite Release)
    • Fix Version/s: Backlog
    • Component/s: Step
    • Labels:
      None
    • PDI Sub-component:
    • Notice:
      When an issue is open, the "Fix Version/s" field conveys a target, not necessarily a commitment. When an issue is closed, the "Fix Version/s" field conveys the version that the issue was fixed in.

      Description

      When a data stream needs to be separated into different flows for conditional logic, and then combined back together while maintaining row order, using a sorted merge can lead to the transformation getting stuck with all input buffers full, but no rows being processed. This happens when one condition only happens less often than 1 row per row set in a portion of the data, so that the common condition can fill all the input buffers up from the sorted merge back to the conditional logic step without the rare condition occurring.

      In the attached example transform, the number of rows in rowset has been set to 10 to demonstrate the problem quickly. When you launch or preview the example, the final sorted merge step reads one row and then freezes with its input buffer full, and the transform never finishes. If you increase the number of rows in the rowset, then it will finish normally.

      In the actual use case I encountered it, I was dealing with millions of records and conditions that happened less often than once very hundred thousand consecutive rows. Such as processing data that is a mix of legacy system data and new system data, where all the data before row A is 100% legacy, there is a mix, and then all the data after row B is 100% new system data. The transform already was running into memory limitations, so I was not able to increase the rowset size to avoid it, once I finally realized what was happening.

      I don't think the problem is so much that the deadlock happens, as it is probably unavoidable, as the fact that PDI never detects the deadlock and stops the transform. Since the behavior is data and volume dependent, the transform can appear to work fine on a smaller test data set, and then mysteriously fail to finish when run for the full data set non-interactively. It can take a fair amount of investigation to determine that the transform is stuck, not merely running slow, and to determine which step is the origin of the problem.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                kcaswick Kevin Caswick
              • Votes:
                1 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated: