This transformation step allows data science models created in an H2O cloud instance to be used within Yellowfin, in order to generate predictions or other results. This is done via a transformation step in the data transformation module.
Before using this step, ensure that you have access to an instance of H2O with at least one model created in it. You will also need to have the H2O plug-in installed in Yellowfin.
It would be ideal to get a better understanding of the data transformation module before continuing with this guide. |
The following types of models are currently supported:
Here is a quick overview of the process. This guide will cover most of these steps in detail.
In order to configure the H2O step, you will need to connect to an instance of H2O by providing a valid URL, then choose a data science model. Next, the input fields of the model will need to be set up. This is done by mapping them with the data from the transformation flow. Unlike configuration with other outputs of data science models (such as PMML), there is no need to configure an output field. The generated result is defined during the creation of the model.
The types of models that Yellowfin supports can be generalized into four categories, listed below. Note: To check the category of a model, refer to the model’s Output section in H2O.
Following describes the type of output each of these categories generate:
In most cases, the user would know the output of the model. But you can still determine the output by selecting the model from your instance of H2O and checking the output settings. For example, for a binary model, the output can be checked in the model’s parameters.
The datatype of the output column will also depend on what has been configured in the model. It will be NUMERIC for models belonging to the “Clustering” and “Regression” categories. For other cases it will be TEXT.
H2O is a modern open source AI platform that allows users to work with predictive models. You can download the latest version of H2O from here: http://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/7/index.html
You can use H2O either locally by starting it and using it on your machine, or by using a publicly available space accessible through a URL.
Use the following procedure to run H2O locally:
To establish a connection to an instance of H2O, you would need to provide the instance’s URL. This can be done by either giving the default one if set up locally (e.g. http://localhost:54321) or including the IP address (such as, http://127.0.0.1:54321) for both, a local setup or remote access (ensure that you have a stable internet connection if trying to access it remotely).
Note: You need to include http:// as part of the URL for it to be recognized correctly. The transformation step will not function if an incorrect URL is provided.
This guide shows you how to integrate your data science model using the H2O transformation step into Yellowfin.
Choose the model that you want to use.
If no models appear in the list, then ensure that the instance has models created in it, and that they are supported by Yellowfin. If a model you have selected, or are trying to select, gets deleted from the H2O instance, then an error will appear. You will then need to manually update the model list in the step configuration panel, e.g. by clicking Connect to H2O again. |
Match the data to the input columns required by the model. For example, our model requires input in the form of age, income, and gender.
It is mandatory to match all the fields for this step to run and generate a result. You must also match the data correctly, ensuring the right values with the same datatypes are mapped. (If an incorrect mapping is made, the Errors field will generate errors for each of the data values.) |
Models created in H2O can be edited and saved again using the same ID. This entails minor changes or even completely changing the entire model. If the model that you’re using in your transformation flow changes, then depending on the nature of the change, you might need to reconfigure the H2O transformation step. This will be relative to the fact that the new model has new fields that will need to be mapped.
Additional new fields: However, if changes to the model include addition of new input fields, then these will need to be configured manually, otherwise the step will break. Note: Even though the new model has additional fields included in it, they will not automatically appear in the step configuration panel. Some sort of action has to be manually carried out by the user to make these fields visible. For example, clicking the Save button in the step configuration panel, or selecting another model from the list, and then the new model again, will bring up the new input fields.
For instance, the H2O model used in this example has been changed to include an additional input field (say, Education). On running this model without configuring this field, the step will break.
As seen, the configure panel does not display any changes in the input form.
Click on the Save button. The form will get updated to display the input required by the new model (in this example, it’s the addition of a new field).
Map this field and then save it to execute the flow properly.
|
As discussed in our workflow, the output generated by the model will appear in a new column named “H2O Model Result” in the data preview panel. If changes to the H2O step are made, then it will affect the proceeding steps in the flow. This could even result in these steps failing.
Changed output type: If the model in the H2O step gets changed, then it is possible that it’s output type might get changed as well. If the next connected step was expecting, say, a numeric value, but due to the changes receives a text value, then it will fail.
For example: Here is a flow in which the H2O step transforms data from an input step using a model. It is designed to produce a numeric result, which is used by the next step, that is, Aggregate to calculate the sum of this result. This result is then sent to a Calculated Field step.
But if the output of the model is changed to generate a text result, then the aggregate step will fail (or get broken), as it requires a numeric value to run properly. All steps proceeding the failed step will get cancelled.