Pentaho Data Integration - Kettle
  1. Pentaho Data Integration - Kettle
  2. PDI-8202

Allow MongoDB Input to read from secondaries/slaves

    Details

    • PDI Sub-component:
    • Notice:
      When an issue is open, the "Fix Version/s" field conveys a target, not necessarily a commitment. When an issue is closed, the "Fix Version/s" field conveys the version that the issue was fixed in.
    • Operating System/s:
      RedHat Enterprise Linux 5
    • QA Validation Status:
      Validated by QA

      Description

      If I attempt to connect to a secondary (slave) Mongo instance and issue a query using the MongoDB Input step, I get the following error:

      INFO 17-07 13:17:28,459 - etl_test - Dispatching started for transformation [etl_test]
      ERROR 17-07 13:17:29,341 - MongoDb Input - Unexpected error
      ERROR 17-07 13:17:29,341 - MongoDb Input - com.mongodb.MongoException: not talking to master and retries used up
      at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:246)
      at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:248)
      at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:248)
      at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:305)
      at com.mongodb.DBCursor._check(DBCursor.java:369)
      at com.mongodb.DBCursor._hasNext(DBCursor.java:498)
      at com.mongodb.DBCursor.hasNext(DBCursor.java:523)
      at org.pentaho.di.trans.steps.mongodbinput.MongoDbInput.processRow(MongoDbInput.java:72)
      at org.pentaho.di.trans.step.RunThread.run(RunThread.java:50)
      at java.lang.Thread.run(Thread.java:662)

      I get similar errors if I use the mongo client to connect to the secondary and issue queries. The problem goes away in the client if I issue the command:

      rs.slaveOk()

      and then run my queries. Unfortunately, I see no way to execute this command when using the MongoDB Input step, so I am unable to query any data from the secondary MongoDB instance. In production environments, I would expect most queries to be run against secondaries rather than the master, especially for ETL, so I imagine this would be a blocker for many people. Hopefully it can be addressed soon.

        Issue Links

          Activity

          Hide
          Doug Moran added a comment -

          Mark, would it make sense to have this as a checkbox option on the dialog?

          Show
          Doug Moran added a comment - Mark, would it make sense to have this as a checkbox option on the dialog?
          Hide
          Mark Hall added a comment -

          Actually, it looks like we should update the connection code in both Mongo steps to allow "replica sets" to specified instead of single host:port. The API allows various options (some of which we should provide too - such as connection timeout, socket timeouts and the slaveOK thing) to be specified as well.

          http://www.mongodb.org/display/DOCS/Connecting+to+Replica+Sets+from+Clients
          http://api.mongodb.org/java/2.6/com/mongodb/MongoOptions.html

          Cheers,
          Mark.

          Show
          Mark Hall added a comment - Actually, it looks like we should update the connection code in both Mongo steps to allow "replica sets" to specified instead of single host:port. The API allows various options (some of which we should provide too - such as connection timeout, socket timeouts and the slaveOK thing) to be specified as well. http://www.mongodb.org/display/DOCS/Connecting+to+Replica+Sets+from+Clients http://api.mongodb.org/java/2.6/com/mongodb/MongoOptions.html Cheers, Mark.
          Hide
          Andy Tompkins added a comment -

          I want to fix this! Actually, I've modified the code, but I am completely new to jira, and I am not sure how to add my contribution.
          It seems like I should create a new case and add the code to that?
          Can someone help me please?

          Show
          Andy Tompkins added a comment - I want to fix this! Actually, I've modified the code, but I am completely new to jira, and I am not sure how to add my contribution. It seems like I should create a new case and add the code to that? Can someone help me please?
          Hide
          Matt Casters added a comment -

          Andy, you can attach a java patch to this case which we'll review. One of the BigData folks will assign it for review and then probably commit the code change slated for the next stable release.

          Show
          Matt Casters added a comment - Andy, you can attach a java patch to this case which we'll review. One of the BigData folks will assign it for review and then probably commit the code change slated for the next stable release.
          Hide
          Andy Tompkins added a comment -

          Matt, I created a new issue and attached the code there.
          http://jira.pentaho.com/browse/PDI-8253
          Sorry if that was not the right thing to do!
          If a patch would be better I could generate one, but those were just the 3 changed files.

          Show
          Andy Tompkins added a comment - Matt, I created a new issue and attached the code there. http://jira.pentaho.com/browse/PDI-8253 Sorry if that was not the right thing to do! If a patch would be better I could generate one, but those were just the 3 changed files.
          Hide
          Matt Casters added a comment -

          It's perfect Andy, thanks for caring.

          Show
          Matt Casters added a comment - It's perfect Andy, thanks for caring.
          Hide
          Mark Hall added a comment -

          https://github.com/pentaho/big-data-plugin/pull/84

          Hostname(s) field now accepts a comma-separated list of host:<port> specifications that define a replica set. The default port (specified in the original Port field) is used for all hosts where a specific port is not supplied. The "read preference" can now be specified in order to allow reads from secondary servers.

          Validation will require testing against a cluster of Mongo servers.

          Show
          Mark Hall added a comment - https://github.com/pentaho/big-data-plugin/pull/84 Hostname(s) field now accepts a comma-separated list of host:<port> specifications that define a replica set. The default port (specified in the original Port field) is used for all hosts where a specific port is not supplied. The "read preference" can now be specified in order to allow reads from secondary servers. Validation will require testing against a cluster of Mongo servers.
          Hide
          Carter Everett added a comment -

          validated

          Show
          Carter Everett added a comment - validated

            People

            • Assignee:
              Unassigned User
              Reporter:
              Kaushal Sheth
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: