Bucketing
UUID: 00000000-0000-0000-0026-000000000001
Description
Associates a bucket number (starting at 1) for all values in a selected column. The bucket count is determined by user input. The size of the buckets is ((maximum column value - minimum column value) / bucket count).
For more details refer to the following article.
Input(s)
- in.data - Input
Output(s)
- out.bucketed - Output
Configurations
Selected Column * [single column selection]
The column to find minimal and maximal values for and associate a bucket number to for each value.
Bucket Count * [integer]
The amount of buckets to create. The minimum is 1.
Bucket Column Name * [column name]
The name of the additional column containing the bucket number.
Column Summary
UUID: 00000000-0000-0000-0145-000000000002
Description
Computes information about statistical means of the attributes of the data.
For more details refer to the following article.
Input(s)
- in.data - Input
Output(s)
- out.metrics - Computed Metrics with Identifier
Correlation
UUID: 00000000-0000-0000-0038-000000000001
Description
Computes the correlation matrix for the given dataset
For more details refer to the following article.
Input(s)
- in.input - Input
Output(s)
- out.output - Correlation Matrix
Configurations
Correlation Method * [single enum selection]
Specifies the correlation method.
Columns for correlation [multiple columns selection]
The columns selected for computing the correlation. If no column is selected here, all suitable (Double, Integer, Numeric) columns in the input will be correlated among each other.
Distinct Summary
UUID: 00000000-0000-0000-0011-000000000002
Description
Creates summaries by grouping for nominally scaled column values and counts the amount of rows for each distinct column value
For more details refer to the following article.
Input(s)
- in.data - Input
Output(s)
- out.output - Distinct Summaries
Configurations
Maximum number of distinct values (defaults to 500) [integer]
Columns to include [multiple columns selection]
Selected columns that are processed additionally to all columns with nominal and ordinal scaled values.
Enable Failsafe Mode * [boolean]
If you expect your data set to have a vast amount of distinct values in its cells (> 100.000), consider enabling this failsafe mode. It triggers a memory friendly version of summary computation. However, the memory friendly version will take longer due to data getting grouped multiple times. Basically, this toggle trades short execution times for stability.
Distinct Textual Summary
UUID: 00000000-0000-0000-0151-000000000001
Description
Creates summaries for every column that the statistics used are applicable for. Statistics include most frequent values, most frequent patterns (value formats, e.g. number, uppercase and lowercase combinations), amount of invalid rows (invalid value can be specified) and valid rows, amount of distinct values as well as minimum, mean and maximum value length (for textual representations). The statistics will be output of this processor.
For more details refer to the following article.
Input(s)
- in.data - Input
Output(s)
- out.distincts - Computed Textual Statistics
Configurations
Invalid String (special treatment in summary) [string]
String value that should be treated as invalid and not be used as a distinct value. The amount of invalid cells will be tracked per column as well. Defaults to empty String
Distinct values to take [integer]
The amount of most frequent distinct values to output in the analysis. The values will be seperated ba a "|" token and can be found in the "Most_Frequent_Distinct_Values" column. Must be positive.
Distinct Cell formats to take [integer]
The amount of most frequent distinct formats to output in the analysis. The values will be seperated ba a "|" token and can be found in the "Most_Frequent_Column_Format" column. Must be positive.
Special Characters [string]
Characters that will be indicated with an "S" in the "Most_Frequent_Column_Format" column. Can also have an effect on the amount of distinct formats recorded in the "Column_Formats" column. Defaults to: /*!@#$%^&*()"{}_[]|\?/<>,
Forecast Metrics
UUID: 00000000-0000-0000-0143-000000000001
Description
Calculate different error measures from forecasts
For more details refer to the following article.
Input(s)
- in.data - Input
Output(s)
- out.outMetrics - Metrics
- out.out - Errors
Configurations
Prediction column * [single column selection]
Column with double value as prediction
Original value column * [single column selection]
Column with double value as original value.
Grouping columns [multiple columns selection]
Can be used to specify columns over which the performance measures are aggregated.
Forecast Metrics For Foreach
UUID: 00000000-0000-0000-0143-000000000002
Description
Calculate different error measures from forecasts generated in a foreach branch. Outputs a single-row data set with information (first value of selected column) about the foreach run it was produced in.
For more details refer to the following article.
Input(s)
- in.data - Input
Output(s)
- out.out - Input carried through
- out.metrics - Computed Metrics with Identifier
Configurations
Prediction column * [single column selection]
Column with double value as prediction
Original value column * [single column selection]
Column with double value as original value.
Identifier for the foreach run. Select the same column as in the Foreach Destinct Processor preceding this Processor! * [single column selection]
The selected column is assumed to always have the same value in all rows of the input data set. The value of the column is used to identify the foreach-run it has been produced in.
Heuristic Summaries
UUID: 00000000-0000-0000-0004-000000000002
Description
Computes information about statistical means of the attributes of the data.
For more details refer to the following article.
Input(s)
- in.data - Input
Output(s)
- out.output - Summary Results
Configurations
Compression size. [integer]
Can override the logarithmic compression with a positive fixed value. Defaults to 0 which triggers logarithmic compression. Will ignore values <= 0. Override is experimental!
Merge Interval [integer]
Interval for local AVLTreeDigest merges. Defaults to 1000. Override with positive integer values. Override is experimental!
Grouping columns [multiple columns selection]
Column which is used for grouping the dataset before computing the statistical means.
Row Count
UUID: 00000000-0000-0000-0027-000000000001
Description
Counts the (distinct) rows in the dataset
For more details refer to the following article.
Input(s)
- in.data - Input
Output(s)
- out.out - Count Output
Configurations
Create distinct row count (additionally to overall row count) * [boolean]
When toggled, distinct rows are counted, too.
Summaries (Deprecated!)
UUID: 00000000-0000-0000-0004-000000000001
Deprecated: This Processor calculates exact values but has rather slow performance. To get an impression on the data, use the Heuristic Summaries. It uses heuristics for median and percentile computation that have a high performance even on large datasets. Only use this processor if you need exact values for median and percentiles.
Replaced by: Heuristic Summaries
Removed: true
Description
Computes information about statistical means of the attributes of the data.
Input(s)
- in.data - Input
Output(s)
- out.output - Summary Results
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article