Table of Contents
Overview
The Input-based Random Number Generator Processor creates a data set filled with random values in one or more columns. The configuration of the random generators is done via the Processor's input. This Processor can be useful for generating a large data set for testing in a short amount of time.
Input
The Processor uses the columns of the input dataset as an additional configuration. Therefore all columns must contain a non-empty, SQL compatible name, a distribution type and a numeric seed. Invalid configuration entries in the input data result in errors, unless the Processor option "Error on invalid configuration" is disabled.
Three types of distributions can be configured: uniform, normal and discrete. A valid input would be the following:
type | name | seed | min | mean | max | standard_deviation | value | probability |
---|---|---|---|---|---|---|---|---|
uniform | column1 | 123 | 0 | 100 | ||||
normal | column2 | 456 | 100 | 10.5 | ||||
normal | column3 | 789 | 0.50 | 0.25 | ||||
discrete | column4 | 100 | val1 | 0.75 | ||||
discrete | column4 | 100 | val2 | 0.25 |
Uniform:
- Must always have a numeric value for their min and max settings (min < max).
- There should only be one line for each configuration (mapped via name).
- Result columns will contain numeric content.
Normal:
- Must always have a numeric value for their mean and a positive numeric value for their standard_deviation.
- There should only be one line for each configuration (mapped via name).
- Result columns will contain numeric content.
Discrete:
- Entries of this kind always must have a non-null value in their value column and a positive numeric value in their probability column.
- Usually span across multiple lines. Mapping is done via their name.
- Result columns will always contain strings.
Configuration
Partitions: Specify the number of run-throughs. The level of parallelism that Spark uses to generate the actual data increases with increasing number of partitions.
This refers to Spark Resilient Distributed Datasets (RDD) partitions. It has no effect in small values, but for huge generated tables, it is wise to have more partitions since one single partition must fit in the working memory of the server at any given time.
Rows per partition: The effect of this configuration depends on "Use long-list format" being enabled or disabled:
- Enabled: Amount of rows that should be created for every generator configured. The generators will be distributed across the configured partitions.
- Disabled: Amount of rows that should be generated per partition (range: [1 .. 1.000.000]). Consider using more partitions when you want to generate more rows. Multiplying the number of partitions with the number of rows per partition will result in the total number of generated rows.
Use Long-list format: As an example, we use three random generators named generator_1, generator_2 and generator_3.
With this option disabled, a broad list is being generated. The output will contain one column per generator:
generator_1 | generator_2 | generator_3 |
1.03 | 0.01 | val1 |
0.34 | 0.32 | val1 |
2.56 | 0.09 | val2 |
3.21 | 0.87 | val1 |
4.20 | 0.56 | val2 |
With this option enabled, there will be three columns regardless of the amount of generators:
- name is the generator name, which was the column name in above table
- drawing corresponds to the row number in above table
- value is the drawn value of the generator (the value type being string)
name | drawing | value |
generator_1 | 1 | 1.03 |
generator_1 | 2 | 0.34 |
generator_1 | 3 | 2.56 |
generator_1 | 4 | 3.21 |
generator_1 | 5 | 4.20 |
generator_2 | 1 | 0.01 |
generator_2 | 2 | 0.32 |
generator_2 | 3 | 0.09 |
generator_2 | 4 | 0.87 |
generator_2 | 5 | 0.56 |
generator_3 | 1 | val1 |
generator_3 | 2 | val1 |
generator_3 | 3 | val2 |
generator_3 | 4 | val1 |
generator_3 | 5 | val2 |
Error on invalid configuration:
Per default, errors are thrown and Workflow execution stopped when encountering invalid configurations in the Processor input. By disabling this setting, the Processor will be more permissive and ignore such invalid
configurations. Instead of errors, a warning for each invalid configuration will then be generated by the Processor.
If all input configurations are invalid or there are problems with the input schema, the Processor will generate an error regardless of this setting.
Disabling this setting can lead to unexpected RNG configurations when executing with varying input.
Output
A data set with random values according to the configuration is generated. The number of rows depends on the number of partitions and rows per partition, as well as the chosen format, as already seen in the explanation for configuration "Use Long-list format".
Example
In this example, we show the results in a Result Table, but you can also use it as direct input for another processor or save it as a Data Table.
Workflow
Example Input
As input, the example input from before was used.
type | name | seed | min | mean | max | standard_deviation | value | probability |
---|---|---|---|---|---|---|---|---|
uniform | column1 | 123 | 0 | 100 | ||||
normal | column2 | 456 | 100 | 10.5 | ||||
normal | column3 | 789 | 0.50 | 0.25 | ||||
discrete | column4 | 100 | val1 | 0.75 | ||||
discrete | column4 | 100 | val2 | 0.25 |
Example Configuration
The configuration is set as the image shows, with two partitions and three rows per partition. We first run the workflow with "Use Long-list format" disabled, and then enabled.
Result
The result columns will then be filled with values according to the configuration in the input.
- column1: between 0 and 100
- column2 and column3: around the mean values
- column4: either val1 or val2 (in this case only val1)
Having "Use Long-list format" disabled, the result will be the following.
Having "Use Long-list format" enabled, the result will be the following. As explained previously, the "value" column is of type string and the "drawing" column shows the row number of the rows in the broad-list format.
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article