Pentaho Data Integration - Kettle

Named parameters

Details

  • Customer Case:

Description

This is also one that comes back from time to time.

As proposal:
- Named parameters of key/value pairs. That you can supply to pan, kitchen, spoon without having to change scripts. the type of the key/value would be String/String. In the command line tools the arguments would like:
    pan.bat .... /param dbname=ADS_P1 /param direction=OUT

- A parameter in a job is described as
      key: name of the parameter
      description: description of the parameter

      the type of a parameter is always string (current variables are also only string)

- The parameters would be injected as variables right before the startup and would overwrite variables if they already existed (parameters have priority over existing variables).

- When you execute a job or transformation via spoon a check should be done whether they have named arguments. If they do, the dialog should be changed to allow people to enter values for them (in place of the current positional arguments).

- If you start a sub transformation without named parameters you can still used the variables defined in the top execution. If the subtransformation has named parameters, the job entry dialog should "ask" for them. If they are not set parameters are assumed to be empty and will overwrite the existing variable in sub transformations.
For an example of use in subtransformations:

Suppose the top job has as named parameter "dbname", when a subtransformations wants to take over the dbname it can either:
a) define its own named parameter dbname (or another name) and define it as "key dbname, value ${dbname}" in the sub transformation job entry
b) don't define named parameters and use ${dbname} as the top execution will set the variable defined as parameter.
If however you use method a) and ${dbname} would not have been set dbname will be set to "" a.k.a. null

- On the command line no check will be done whether the names of the parameters match the names of the job being executed.... unless an extract parameter is specified called "strictparam". With strictparam on a check will be done whether all parameters defined on the job/transformation exist on the command line (and vica versa). Without strictparam it allows people to run transformations and supply more arguments than required for some transformation (think of it as optional parameters): as use-case e.g. 1 script running transformations with some username and password as parameters,. but 1 or 2 jobs don't need those username and password (but you would still like to run them using the same script)
Only the parameters which are define on job/transformation will be set up as variables. I don't use strictparam, I have a job with dbname as parameter, but I supply -param dbname=ADS -user=sipsch, ${user} will not have been setup as variable afterwards.


The advantages:
- Most people are used to named arguments from other ETL tools
- If you want to use parameters as variables it avoids 1 extra layer of setting up the variables via a get system info/set variable step.
- Named parameters can coexist with the current positional parameters

Other things to change
- Get System info to be able to do the same for named parameters as positional parameters (getting them in a row)
- transformation and job job entry which would then also need to be able to pass named parameters.
- Maybe the start dialogs in spoon and the arguments dialog.

Activity

Hide
Damien added a comment - 25/Jun/08 8:00 AM
I think it can be a good improvement !

If you want to run transformations save in XML using variable to define the path of the files, it will be possible with named parameters ! Actually we must create a transformation to define path variable, so it's impossible to use variable to define the path of this transformation :(

Another solution would be to have an access to command line arguments in a job and not only in a transformation.

Show
Damien added a comment - 25/Jun/08 8:00 AM I think it can be a good improvement ! If you want to run transformations save in XML using variable to define the path of the files, it will be possible with named parameters ! Actually we must create a transformation to define path variable, so it's impossible to use variable to define the path of this transformation :( Another solution would be to have an access to command line arguments in a job and not only in a transformation.
Hide
Nicola Benaglia added a comment - 21/Jan/09 5:08 AM
I agree with that proposal.
I look forward to use a new kettle release with that improvement ;)
Show
Nicola Benaglia added a comment - 21/Jan/09 5:08 AM I agree with that proposal. I look forward to use a new kettle release with that improvement ;)
Hide
Matt Casters added a comment - 21/Jan/09 7:32 AM
To stick with the old Pan/Kitchen syntax, I propose to use something like this:

/param:VAR=value

or indeed

-param=VAR=value
Show
Matt Casters added a comment - 21/Jan/09 7:32 AM To stick with the old Pan/Kitchen syntax, I propose to use something like this: /param:VAR=value or indeed -param=VAR=value
Hide
Matt Casters added a comment - 21/Jan/09 7:35 AM
By the way, I think in general, this is *way* too complicated. I couldn't make heads nor tails from Sven description.
I see little to no advantage over the current Variables implementation. I propose to just set a bunch of variables on the command line and get it over with.
Show
Matt Casters added a comment - 21/Jan/09 7:35 AM By the way, I think in general, this is *way* too complicated. I couldn't make heads nor tails from Sven description. I see little to no advantage over the current Variables implementation. I propose to just set a bunch of variables on the command line and get it over with.
Hide
Sven Boden added a comment - 22/Jan/09 1:58 PM

What I had in mind is actually quite simple. Simplified:
- Each job/transformation can have named parameters (key/value pairs).
- Whenever a job/transformation is executed (and before initialization of the job entries/steps) the values of the parameters are evaluated in the parent variable space with environmentSubstitute() and the results are setup as variables.

This allows for recursive use, and would probably cater for most uses I can see medium term. I had the change for the run-time engine again but it was wiped, the GUI changes might take a bit more time.
Show
Sven Boden added a comment - 22/Jan/09 1:58 PM What I had in mind is actually quite simple. Simplified: - Each job/transformation can have named parameters (key/value pairs). - Whenever a job/transformation is executed (and before initialization of the job entries/steps) the values of the parameters are evaluated in the parent variable space with environmentSubstitute() and the results are setup as variables. This allows for recursive use, and would probably cater for most uses I can see medium term. I had the change for the run-time engine again but it was wiped, the GUI changes might take a bit more time.
Hide
Matt Casters added a comment - 22/Jan/09 2:04 PM
OK, so there are 2 things that I had mixed up:

1) The ability to pass variables/information from the command line (pan/kitchen) to the transformation/jobs in the form of variables rather than command line parameters. All in favor of building this ASAP.
2) The ability to create named parameters inside transformations and jobs, to be evaluated before execution. Also in favor of this one, but not as urgent I think.
Show
Matt Casters added a comment - 22/Jan/09 2:04 PM OK, so there are 2 things that I had mixed up: 1) The ability to pass variables/information from the command line (pan/kitchen) to the transformation/jobs in the form of variables rather than command line parameters. All in favor of building this ASAP. 2) The ability to create named parameters inside transformations and jobs, to be evaluated before execution. Also in favor of this one, but not as urgent I think.
Hide
Sven Boden added a comment - 22/Jan/09 4:38 PM
I can give 1) a try during the weekend.... as I see it if we implement 1) in a nice way, 2) almost comes for free (without the GUI changes).
Show
Sven Boden added a comment - 22/Jan/09 4:38 PM I can give 1) a try during the weekend.... as I see it if we implement 1) in a nice way, 2) almost comes for free (without the GUI changes).
Hide
Matt Casters added a comment - 04/Feb/09 10:31 AM
Sven, this works great AFAICT.
2 questions:

1) Should we also add support for this in Jobs?
2) Shouldn't we add a "mandatory" / "optional" selection in the parameters tab?

With 2) we could prevent the transformation from being started and we could display a meaningful error message like:

The following parameters are missing:
- LIMIT : The number of rows to generate

etc.
Show
Matt Casters added a comment - 04/Feb/09 10:31 AM Sven, this works great AFAICT. 2 questions: 1) Should we also add support for this in Jobs? 2) Shouldn't we add a "mandatory" / "optional" selection in the parameters tab? With 2) we could prevent the transformation from being started and we could display a meaningful error message like: The following parameters are missing: - LIMIT : The number of rows to generate etc.
Hide
Sven Boden added a comment - 04/Feb/09 12:44 PM

For 1) absolutely... but I was waiting for comments on transformations. Jobs are in the "planning" for the next weekend.
For 2) either optional or mandatory... or a default value... I was still thinking on this. If a value is given for a parameter use that, if none is given use the default. If no default value exists use "" as value.

If this is implemented a next step would be to change the job entry step/transformation step to allow some kind of linking to of incoming rows to named parameters. But this can be done later.

Other open thing is spaces in parameter values, but I'm also playing with that.
Show
Sven Boden added a comment - 04/Feb/09 12:44 PM For 1) absolutely... but I was waiting for comments on transformations. Jobs are in the "planning" for the next weekend. For 2) either optional or mandatory... or a default value... I was still thinking on this. If a value is given for a parameter use that, if none is given use the default. If no default value exists use "" as value. If this is implemented a next step would be to change the job entry step/transformation step to allow some kind of linking to of incoming rows to named parameters. But this can be done later. Other open thing is spaces in parameter values, but I'm also playing with that.
Hide
Sven Boden added a comment - 11/Feb/09 3:07 AM

Basic version of the named parameters for jobs also checked in. In the following days I will clean up some small stuff around this change (small gui improvements, ...)

Guidance still needed on the following:
- Where to save the parameters for jobs in the repository? transformations have a attribute table, jobs seem to lack this. Either we can make a job attribute table, or try to cram them in somewhere else.
- optional/mandatory or a default value. personally I'm more for default value. Only disadvantage I can see with using defaults is that it's not possible to have an empty parameter if a default is set for the parameter. If an empty parameter like that is provided the default would be taken.


After the above, the next step would be some doc changes, more testing, unit test cases.

And then the next step would be changing the job/transformation job entries (and possibly mapping step) to allow linking row fields to named parameters, so that you can execute e.g. jobs supplying named parameters from a row. What I had more or less in mind (but it's still just an idea):
1) 3 columns for parameters in job/transformation entry: first being the parameter name, one column for hard coding values like 'ODS_P1', ${DATABASE},... and a third to pick row fields... reason for jobs e.g. you may use a mix of assigning named parameters... some hardcoded, some from an input row. For a named parameter you either pick a row field or you hardcode a value
2) Introduce a new naming scheme to get at values... 2 columns: first being parameter name, second the value. And using something as $[ROWFIELD] to get at the value of a field of an incoming row e.g. ... an extension to VariableSpace taking an extra input row/metadata to figure out how to translate $[ROWFIELD]

Show
Sven Boden added a comment - 11/Feb/09 3:07 AM Basic version of the named parameters for jobs also checked in. In the following days I will clean up some small stuff around this change (small gui improvements, ...) Guidance still needed on the following: - Where to save the parameters for jobs in the repository? transformations have a attribute table, jobs seem to lack this. Either we can make a job attribute table, or try to cram them in somewhere else. - optional/mandatory or a default value. personally I'm more for default value. Only disadvantage I can see with using defaults is that it's not possible to have an empty parameter if a default is set for the parameter. If an empty parameter like that is provided the default would be taken. After the above, the next step would be some doc changes, more testing, unit test cases. And then the next step would be changing the job/transformation job entries (and possibly mapping step) to allow linking row fields to named parameters, so that you can execute e.g. jobs supplying named parameters from a row. What I had more or less in mind (but it's still just an idea): 1) 3 columns for parameters in job/transformation entry: first being the parameter name, one column for hard coding values like 'ODS_P1', ${DATABASE},... and a third to pick row fields... reason for jobs e.g. you may use a mix of assigning named parameters... some hardcoded, some from an input row. For a named parameter you either pick a row field or you hardcode a value 2) Introduce a new naming scheme to get at values... 2 columns: first being parameter name, second the value. And using something as $[ROWFIELD] to get at the value of a field of an incoming row e.g. ... an extension to VariableSpace taking an extra input row/metadata to figure out how to translate $[ROWFIELD]
Hide
Daniel Einspanjer added a comment - 25/Feb/09 6:39 PM
Sven, what is the current status of this implementation in regards to transformations?
I tried playing around with it a little today and couldn't get it to work.
I found a few bugs that kept it from working in spoon or from picking up default values, but I wasn't sure if they were actually bugs or just only partially implemented so far.
Show
Daniel Einspanjer added a comment - 25/Feb/09 6:39 PM Sven, what is the current status of this implementation in regards to transformations? I tried playing around with it a little today and couldn't get it to work. I found a few bugs that kept it from working in spoon or from picking up default values, but I wasn't sure if they were actually bugs or just only partially implemented so far.
Hide
Sven Boden added a comment - 26/Feb/09 4:07 AM - edited
I changed additional stuff yesterday evening (25/2) ... basic functionality should be working now for all things (except maybe carte, and BI server things related to PDI).

- Send me the transformations/jobs that don't work for you and I'll check it out.
- Also try the last zip file created by hudson
Show
Sven Boden added a comment - 26/Feb/09 4:07 AM - edited I changed additional stuff yesterday evening (25/2) ... basic functionality should be working now for all things (except maybe carte, and BI server things related to PDI). - Send me the transformations/jobs that don't work for you and I'll check it out. - Also try the last zip file created by hudson
Hide
Matt Casters added a comment - 26/Feb/09 7:44 AM
As you requested Sven, I'll start to review this tomorrow. Things have been a bit too busy for me to get involved, sorry about that.
Show
Matt Casters added a comment - 26/Feb/09 7:44 AM As you requested Sven, I'll start to review this tomorrow. Things have been a bit too busy for me to get involved, sorry about that.
Hide
Sven Boden added a comment - 26/Feb/09 4:52 PM
Thanks... not that urgent yet, unless you're going to make an RC1 version.

I'm going to do some further testing and cleanup (e.g. when parameters collide) during the weekend. I'll also attach some examples showing the use of parameters here.
Show
Sven Boden added a comment - 26/Feb/09 4:52 PM Thanks... not that urgent yet, unless you're going to make an RC1 version. I'm going to do some further testing and cleanup (e.g. when parameters collide) during the weekend. I'll also attach some examples showing the use of parameters here.
Hide
Sven Boden added a comment - 26/Feb/09 4:54 PM

examples:

- simple transformation using parameters
- simple job using parameters
- complex job passing parameter to to transformation which then makes rows and the rows are bound to the parameters of a job.
Show
Sven Boden added a comment - 26/Feb/09 4:54 PM examples: - simple transformation using parameters - simple job using parameters - complex job passing parameter to to transformation which then makes rows and the rows are bound to the parameters of a job.
Hide
Matt Casters added a comment - 26/Feb/09 5:49 PM
I was thinking about doing an RC1 in a few weeks but I'll review after the weekend in that case, write some wiki docs etc.
At the moment, still pretty busy fixing things all over the place :-) Plenty of things to do.
Show
Matt Casters added a comment - 26/Feb/09 5:49 PM I was thinking about doing an RC1 in a few weeks but I'll review after the weekend in that case, write some wiki docs etc. At the moment, still pretty busy fixing things all over the place :-) Plenty of things to do.
Hide
Mike Melby added a comment - 05/Mar/09 3:25 AM
I think this is a great feature and I have immediate use for it. I am trying it now and am having an issue. When testing a transformation with Pan from the command line, I find that the command line arg params are set, but are not available to TransMeta once activated in the Trans itself. Since this is new for me I can't say if this intended, or my expectations are off. I would suspect that having the parameters available to all VariableSpace objects would be ideal, but the parameters are added after the TransMeta variables are set from what I can tell.

Just for reference I was trying this from the source head. Thank you.
Show
Mike Melby added a comment - 05/Mar/09 3:25 AM I think this is a great feature and I have immediate use for it. I am trying it now and am having an issue. When testing a transformation with Pan from the command line, I find that the command line arg params are set, but are not available to TransMeta once activated in the Trans itself. Since this is new for me I can't say if this intended, or my expectations are off. I would suspect that having the parameters available to all VariableSpace objects would be ideal, but the parameters are added after the TransMeta variables are set from what I can tell. Just for reference I was trying this from the source head. Thank you.
Hide
Sven Boden added a comment - 05/Mar/09 5:31 AM

- Parameters override already existing variables.
- When you set variables with e.g. set variables set, also only the trans ones are affected (from memory).

Attach your sample transformation and the command line you use I can have a look at it. And maybe more detail of what you want to do... why not take the trans ones e.g.
Show
Sven Boden added a comment - 05/Mar/09 5:31 AM - Parameters override already existing variables. - When you set variables with e.g. set variables set, also only the trans ones are affected (from memory). Attach your sample transformation and the command line you use I can have a look at it. And maybe more detail of what you want to do... why not take the trans ones e.g.
Hide
Matt Casters added a comment - 09/Mar/09 7:06 AM
Found some issues in Repository handling of parameters. R_JOB_ATTRIBUTES tables is not created correctly.
Show
Matt Casters added a comment - 09/Mar/09 7:06 AM Found some issues in Repository handling of parameters. R_JOB_ATTRIBUTES tables is not created correctly.
Hide
Matt Casters added a comment - 09/Mar/09 10:12 AM
Repository issues so far fixed in revision 10328
Show
Matt Casters added a comment - 09/Mar/09 10:12 AM Repository issues so far fixed in revision 10328
Hide
Sven Boden added a comment - 09/Mar/09 6:06 PM
thanks for the fix
Show
Sven Boden added a comment - 09/Mar/09 6:06 PM thanks for the fix
Hide
Matt Casters added a comment - 10/Mar/09 9:52 AM
Defined parameters (on the command line for example) should pass down to sub-jobs and sub-transformations after applying override values of the respective job entries.
Also added a "listparam" option to see the defined parameters, their values and defaults in Pan and Kitchen.

All this in revision 10341 (and before)
Show
Matt Casters added a comment - 10/Mar/09 9:52 AM Defined parameters (on the command line for example) should pass down to sub-jobs and sub-transformations after applying override values of the respective job entries. Also added a "listparam" option to see the defined parameters, their values and defaults in Pan and Kitchen. All this in revision 10341 (and before)
Hide
Sven Boden added a comment - 10/Mar/09 1:43 PM

On the first we disagree... I wouldn't automatically pass it down, it breaks some recursive stuff I had in mind and creates some grey areas. How I saw/see it working.

You define parameters on a job/transformation. At the moment it's executed the variables (with the named parameters) are set overriding the variables sub-environment (inherited from the parent).

If you use a job job entry/trans entry job nothing is passed automatically... they either take a field from a results row or a "constant" variable sensitive field. If you don't specify anything the sub job/transformation will take the default parameters.

The case of passing down parameters would e.g. ... assume in the parent you have a parameter USERID, and you want the USERID parameter of a subjob set to the same you could just define ${USERID} as value for the job entry parameter (getting an automatic pass down). The advantage is that when the job is executed multiple times you can reset e.g the USERID variable (when passing down variables automatically this probably won't work anymore).
Show
Sven Boden added a comment - 10/Mar/09 1:43 PM On the first we disagree... I wouldn't automatically pass it down, it breaks some recursive stuff I had in mind and creates some grey areas. How I saw/see it working. You define parameters on a job/transformation. At the moment it's executed the variables (with the named parameters) are set overriding the variables sub-environment (inherited from the parent). If you use a job job entry/trans entry job nothing is passed automatically... they either take a field from a results row or a "constant" variable sensitive field. If you don't specify anything the sub job/transformation will take the default parameters. The case of passing down parameters would e.g. ... assume in the parent you have a parameter USERID, and you want the USERID parameter of a subjob set to the same you could just define ${USERID} as value for the job entry parameter (getting an automatic pass down). The advantage is that when the job is executed multiple times you can reset e.g the USERID variable (when passing down variables automatically this probably won't work anymore).
Hide
Matt Casters added a comment - 10/Mar/09 1:59 PM
Your use-case is pretty far fetched, but I'll go along with it for the time being.

That being said, the need to specify the parameter in the job entry again is counter-intuitive and wasn't clear from looking at the code, nor from the dialog.
We had the same ordeal in the mapping step and there too we created a "pass all variables" option.
So I guess that's what I'm going to create next since otherwise my resource export is never going to work.



Show
Matt Casters added a comment - 10/Mar/09 1:59 PM Your use-case is pretty far fetched, but I'll go along with it for the time being. That being said, the need to specify the parameter in the job entry again is counter-intuitive and wasn't clear from looking at the code, nor from the dialog. We had the same ordeal in the mapping step and there too we created a "pass all variables" option. So I guess that's what I'm going to create next since otherwise my resource export is never going to work.
Hide
Matt Casters added a comment - 10/Mar/09 3:12 PM
"Pass all variables" option added, enabled by default
All in revision 10346
Show
Matt Casters added a comment - 10/Mar/09 3:12 PM "Pass all variables" option added, enabled by default All in revision 10346
Hide
Matt Casters added a comment - 10/Mar/09 3:20 PM
That's about it for me, it all seems to work fine.
Show
Matt Casters added a comment - 10/Mar/09 3:20 PM That's about it for me, it all seems to work fine.
Hide
Matt Casters added a comment - 10/Mar/09 3:21 PM
Back to you
Show
Matt Casters added a comment - 10/Mar/09 3:21 PM Back to you
Hide
Sven Boden added a comment - 10/Mar/09 5:42 PM
Functionality still works for me... I had some junit test cases (but for the moment they blow up on get variable step)... I'll do some further tests in the rest of the week.

Grey area for the "pass all parameters" is now what happens when execute for each row is on and parameters have a field defined from the result row. But that's not high on my list.
Show
Sven Boden added a comment - 10/Mar/09 5:42 PM Functionality still works for me... I had some junit test cases (but for the moment they blow up on get variable step)... I'll do some further tests in the rest of the week. Grey area for the "pass all parameters" is now what happens when execute for each row is on and parameters have a field defined from the result row. But that's not high on my list.
Hide
Matt Casters added a comment - 18/Mar/09 3:19 PM
I did some more tweaks in the clustering / remote execution department but I would call this one "Resolved".
Show
Matt Casters added a comment - 18/Mar/09 3:19 PM I did some more tweaks in the clustering / remote execution department but I would call this one "Resolved".
Hide
Kurtis Cruzada added a comment - 14/Sep/10 3:16 PM
Bulk change to close issue for QA Validation.
Show
Kurtis Cruzada added a comment - 14/Sep/10 3:16 PM Bulk change to close issue for QA Validation.

People

Vote (5)
Watch (3)

Dates

  • Created:
    19/Jun/08 8:57 AM
    Updated:
    14/Sep/10 3:16 PM
    Resolved:
    18/Mar/09 3:19 PM