# ALS Recommender

**UUID:** 00000000-0000-0000-0252-000000000001

## Description

Recommends items for users based on their previous item ratings by using the alternating least squares approach. Requires numeric user and item IDs.To convert non-numeric ids use the Alphanumeric to Numeric ID Processor.

**For more details refer to the following ****article****. **

## Input(s)

*in.data*- Input

## Output(s)

*out.outputRecommendatoins*- Output Recommendations*out.ouputCustomerFeatures*- Output Customer Features*out.outputItemFeatures*- Output Item Features

## Configurations

#### Customers * *[single column selection]*

Column containing the Customer IDs (should be different from Item and Ratings columns).

#### Items * *[single column selection]*

Column containing the Item IDs (should be different from Customers and Rating columns).

#### Rating * *[single column selection]*

Column containing the Ratings (should be different from Customers and Items columns).

#### Rank * *[integer]*

The length of the feature vector. The higher the rank, the better the result but performance may suffer. Recommended: Start with default value 10 and increment until no further improvement

#### Number of Recommendations * *[integer]*

Number of recommendations per user.

#### Iterations * *[integer]*

Number of iterations the algorithm uses to improve.

#### Lambda * *[double]*

Regulation parameter to counter sparseness/overfitting.

#### Only postive feature values * *[boolean]*

Set whether the least-squares problems solved at each iteration should have non negativity constraints.

#### Number of Blocks *[integer]*

Set the number of blocks for both user blocks and product blocks to parallelize the computation into.

#### Seed *[integer]*

Seed for generation of random elements. Setting a seed allows for deterministic behavior

# Assert

**UUID:** 00000000-0000-0000-1137-000000000001

## Description

Evaluates a boolean SQL query and creates an error or warning if the query evaluates false.

**For more details refer to the following ****article****. **

## Input(s)

*in.input*- Input for inputTable

## Output(s)

*out.output*- Output

## Configurations

#### Boolean Query * *[string]*

Specify a query that defines the test condition. This query must return a single boolean or numeric column. If any of the values in this column is false or zero the assertion fails.

#### Action on Assertion Failure * *[single enum selection]*

The action to perform if the assertion fails. Either create an error and abort further execution or create a warning continuing execution.

#### Message on Failure * *[string]*

The message of the error/warning that is created if the assertion fails.

#### Warning Details Message * *[string]*

Specifies the message that is shown in the details of the created warnings. This message is ignored if the action is set to 'Error'.

# Association Rule Application

**UUID:** 00000000-0000-0000-0165-000000000001

## Description

This processor uses the association rules to compute the resulting recommendations for the users depending on their transactions in the longlist. As input it requires a list containing of single-item transactions (one item to one user in each row) and a list of association rules to be applied. The rules can be computed with the Association Rule Generation or the Network Rule Generation.

**For more details refer to the following ****article****. **

## Input(s)

*in.transaction*- Longlist of Transactions*in.rules*- Rules

## Output(s)

*out.recommendations*- Recommendations

## Configurations

#### Users * *[single column selection]*

Column containing the user IDs.

#### Items * *[single column selection]*

Column containing the item IDs.

#### Rule IDs * *[single column selection]*

Column containing the unique rule IDs

#### LHS * *[single column selection]*

Column containing the left hand side of the rules where the different item IDs are separated by the string chosen as 'List Separator'.

#### RHS * *[single column selection]*

Column containing the right hand side of the rules where the different item IDs are separated by the string chosen as 'List Separator'.

#### List Separator * *[string]*

String which separates the diffrent items listed in the LHS/RHS of the rules.

#### Confidence * *[single column selection]*

Column containing the confidence of the applied rules.

#### Check for uniqueness of rule IDs * *[boolean]*

Check that the rule identifiers are unique, if they are not an error is returned when this option is toggled, nothing happens otherwise. This check may have a rather huge impact on performance.

# Association Rule Generation

**UUID:** 00000000-0000-0000-0154-000000000001

## Description

This processor generates association rules with a single item as right hand side using frequent item sets generated by an FPGrowth Algorithm. As input it requires a list containing single-item transactions (one item to one user in each row). The generated rules can be applied using the Association Rule Application Processor. Note: For the output the items will be listed divided by `,`. Thus the item names should not contain this character.

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- Longlist of Sales

## Output(s)

*out.rules*- Rules*out.itemSets*- Frequent Item Sets

## Configurations

#### customers * *[single column selection]*

Column containing the Customer IDs.

#### items * *[single column selection]*

Column containing the Item IDs.

#### Min support *[double]*

The minimal support for the frequent item sets.

#### Min confidence *[double]*

The minimal confidence for the association rules.

# BNN

**UUID:** 00000000-0000-0000-0017-000000000001

## Description

Combined BNN Training and Forecast Processor

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- trainingData*in.forecastData*- forecastData

## Output(s)

*out.forecast_out*- forecast_out*out.debug_out*- debug_out

## Configurations

#### Dependent column * *[single column selection]*

Column of the training data containing the dependent data

#### Independents *[multiple columns selection]*

Name of the independent columns.

#### Forecast * *[column name]*

Name of the column to add to the forecastData input containing the forecast

#### Hidden Nodes * *[integer]*

Amount of hidden nodes

#### Cooling factor * *[double]*

Factor for cooling [0..1]. Higher values trigger slower cooling.

#### Initial Temperature * *[double]*

Initial System Temperature used for exponential cooling schedule.

#### Burn-In iterations * *[integer]*

Number of iterations for burn-in (no samples taken in this phase)

#### Sample Count * *[integer]*

Number Samples that are taken after burn-in phase

#### Crossover Probability *[double]*

Crossover Probability for DE algorithm (default: 0.5)

#### Differential Weight *[double]*

Differential Weight for DE algorithm (default: 0.9)

#### Population Size * *[integer]*

Size of Config population for the DE algorithm (default: 50)

#### Mutation Rate *[double]*

Probability of mutation for one single parameter [0..1]

#### Standard Deviation Alpha *[double]*

Standard Deviation for input-to-output weight generation (must be positive, default: 0.1)

#### Standard Deviation Beta *[double]*

Standard Deviation for hidden-to-output weight generation (must be positive, default: 0.1)

#### Standard Deviation Gamma *[double]*

Standard Deviation for input-to-hidden weight generation (must be positive, default: 0.1)

#### Lambda *[double]*

Weight for network complexity penalty (must be positive, default: 1.0)

#### Random Generator Seed *[integer]*

Integer Seed that overrides the built-in default. The seed is used to initialize the random generator used for model generation and evolution

# Centroid Clustering

**UUID:** 00000000-0000-0000-0034-000000000001

## Description

Assigns rows to k clusters by k-means algorithm

**For more details refer to the following ****article****. **

## Input(s)

*in.data*- Input

## Output(s)

*out.Output*- Output

## Configurations

#### Select columns *[multiple columns selection]*

Columns to be considered in clustering. Only columns with numeric type can be selected.

#### K (Single or minimal K) * *[integer]*

Number of clusters. If used in combination with 'K (Empty or maximum K)' config element this value is the lower value (inclusive) of range of Ks to calculate. Must be greater than 1.

#### K (Empty or maximum K) *[integer]*

Optional upper value (inclusive) of range of Ks to calculate. When specified, must be greater than 1 and also greater than value of 'K (Single or minimal K)' config element

#### Epsilon * *[double]*

Determines the distance threshold within k-means has converged. Must not be negative.

#### Maximum number of iterations * *[integer]*

The maximum number of iterations to run. Must be positive.

#### Initialization Mode * *[single enum selection]*

Specifies the initialization. Either by random or by k-means parallel.

#### Initialization Steps * *[integer]*

Set the number of steps for the k-means parallel initialization mode. This is an advanced setting - the default of 5 is almost always enough.

#### Random seed *[integer]*

Seed used in random generators while carrying out clustering.

#### Silhouette Coefficient *[composed]*

Calculates silhouette coefficient for each 'K' cluster. The silhouette of a data instance is a measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighbouring cluster. The 'K' with the highest value for this coefficient represents the best number of clusters. Calculated value is added to the output in 'Silhouette_Coefficient' column.

###### Silhouette Coefficient > Output Mode * *[single enum selection]*

Specifies for which K values the data will be outputed:

- OUTPUT_ALL - Output clusters for each K in the range,
- OUTPUT_BEST - Output only clusters for K with the best sillhouette coefficient.

# Collaborative Filtering Forecast

**UUID:** 00000000-0000-0000-0053-000000000001

## Description

Collaborative filtering processor recommending products for a new user utilizing the existing user base and a selection of products the new user has already chosen. Currently in an experimental state.

**For more details refer to the following ****article****. **

## Input(s)

*in.training*- Data Set for Model Training*in.forecast*- Products of the new User

## Output(s)

*out.output*- Recommendations

## Configurations

#### User Column * *[single column selection]*

Column containing user IDs

#### Product Column * *[single column selection]*

Column containing product IDs

#### Rating Column * *[single column selection]*

Column containing ratings

#### Item Column in forecast * *[single column selection]*

Column containing forecast items, assuming all items from a single user

#### Amount of Recommendations * *[integer]*

Number of additional products recommended (1-20). Actual value may be less than the amount entered in case there are not enough suitable recommendations.

#### Size of the User Base *[integer]*

The recommendation is based on a set of similar users. This variable defines the size of this set. Defaults to 20.

#### Seed *[integer]*

Random Seed for model training

# Collaborative Filtering Forecast2 (experimental)

**UUID:** 00000000-0000-0000-0053-000000000003

## Description

Collaborative Filtering Forecast for testing purpose only (uses ALS model and Spark-only-forecasts)

For more details refer to the following **article****.**

## Input(s)

*in.training*- training*in.forecast*- forecast

## Output(s)

*out.output*- output

## Configurations

#### user * *[single column selection]*

user

#### item * *[single column selection]*

item

#### rating * *[single column selection]*

rating

# Cross Correlation

**UUID:** 00000000-0000-0000-0038-000000000002

## Description

Computes the cross / auto correlation for given time series data.

**For more details refer to the following ****article****. **

## Input(s)

*in.input*- Input

## Output(s)

*out.correlation*- Correlation Output

## Configurations

#### The columns that should be correlated * *[multiple columns selection]*

The columns that should be correlated pair-wise. If only one column is chosen, it's auto correlation is computed. For the cases where more than one column is chosen, for each of the columns, the auto correlation is computed additionally to the cross correlation between all given columns.

#### Maximal Time Delay * *[integer]*

The maximal delay that will be applied to the time series. This can be anything from 1 to 1000. The delay will also be used in both time-directions. If the chosen delay is larger than the given dataset, the delay will be decreased to the number of rows minus one automatically and a warning is shown to the user.

#### Sorting Column *[single column selection]*

If not empty, this column is used for sorting the input data in ascending order, otherwise we assume that the dataset is already sorted.

#### Category Column *[single column selection]*

If the data is available in a format like Row 1: value1 timestamp1 categoryA, Row2: value2 timestamp1 categoryA, the category column can be chosen here. If a category column is chosen, the input must have 3 columnns (the category, the timestamp, and the to-be splitted column). These columns will then be transformed to rows looking Like Row: value1 value2 timestamp1.

# Decision Tree Classification

**UUID:** 00000000-0000-0000-0037-000000000001

## Description

Processor creating a decision tree for classification out of the given data

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- Training Data

## Output(s)

*out.output*- Output

## Configurations

#### Dependent attribute * *[single column selection]*

Dependent attribute

#### Independent attributes *[multiple columns selection]*

Independent attributes

#### Maximum depth * *[integer]*

Maximum depth of tree.

#### Number of bins used when discretizing continuous features * *[integer]*

Increasing maxBins allows the algorithm to consider more split candidates and make fine-grained split decisions. However, it also increases computation and communication. Note that the maxBins parameter must be at least the maximum number of categories M for any categorical feature.

#### Impurity * *[single enum selection]*

Measure of homogeneity of the labels.

# Decision Tree Classification Forecast

**UUID:** 00000000-0000-0000-0045-000000000001

## Description

Processor creating a decision tree for classification out of the given data

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- Training Data*in.forecastData*- Foracast Data

## Output(s)

*out.output*- Output

## Configurations

#### Dependent attribute * *[single column selection]*

Dependent attribute

#### Independent attributes *[multiple columns selection]*

Independent attributes

#### Forecast Column Name * *[column name]*

Name of the Column that gets added to the data set and contains the forecast. Must not contain whitespaces!

#### Maximum depth * *[integer]*

Maximum depth of tree.

#### Number of bins used when discretizing continuous features * *[integer]*

Increasing maxBins allows the algorithm to consider more split candidates and make fine-grained split decisions. However, it also increases computation and communication. Note that the maxBins parameter must be at least the maximum number of categories M for any categorical feature.

#### Impurity * *[single enum selection]*

Measure of homogeneity of the labels.

#### Create Result * *[boolean]*

When switched off, the processor will not create a result and just do the forecast. Use this options when you are not interested in the model but only the forecast. Will speed up execution time.

#### Clean forecast data * *[boolean]*

Clean the forecast data from values not present in the training dataset. Setting this option may remove rows from the forecast dataset. You may need to enable this flag to prevent an error like "Can only zip with RDD which has the same number of partitions" to happen

#### Handling of unseen categorical features * *[single enum selection]*

How to handle categorical features which were unseen during the training phase. KEEP creates one new category for all unseen values, ERROR fails if unseen values occur, SKIP ignores the unseen values.

# Decision Tree Regression

**UUID:** 00000000-0000-0000-0036-000000000001

## Description

Processor creating a decision tree for regression out of the given data

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- trainingData

## Output(s)

*out.output*- Forwarded Input

## Configurations

#### Dependent attribute * *[single column selection]*

Dependent attribute

#### Independent attributes *[multiple columns selection]*

Independent attributes

#### Maximum depth * *[integer]*

Maximum depth of tree.

#### Number of bins used when discretizing continuous features * *[integer]*

Increasing maxBins allows the algorithm to consider more split candidates and make fine-grained split decisions. However, it also increases computation and communication. Note that the maxBins parameter must be at least the maximum number of categories M for any categorical feature.

# Decision Tree Regression Forecast

**UUID:** 00000000-0000-0000-0042-000000000001

## Description

Processor creating a decision tree for regression out of the given training data

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- trainingData*in.forecastData*- forecastData

## Output(s)

*out.output*- Forecast Input with additional forecast column

## Configurations

#### Dependent attribute * *[single column selection]*

Dependent attribute

#### Independent attributes *[multiple columns selection]*

Independent attributes

#### Forecast Column Name * *[column name]*

Name of the Column that gets added to the data set and contains the forecast. Must not contain whitespaces!

#### Maximum depth * *[integer]*

Maximum depth of tree.

#### Number of bins used when discretizing continuous features * *[integer]*

#### Clean forecast data * *[boolean]*

Clean the forecast data from values not present in the training dataset. Setting this option may remove rows from the forecast dataset. You may need to enable this flag to prevent an error like "Can only zip with RDD which has the same number of partitions" to happen

#### Create Result * *[boolean]*

When switched off, the processor will not create a result and just do the forecast. Use this options when you are not interested in the model but only the forecast. Will speed up execution time.

#### Handling of unseen categorical features * *[single enum selection]*

How to handle categorical features which were unseen during the training phase. KEEP creates one new category for all unseen values, ERROR fails if unseen values occur, SKIP ignores the unseen values.

# Distinct Textual Format Extractor

**UUID:** 00000000-0000-0000-0251-000000000001

## Description

Creates summaries for every column that the statistics used are applicable for. Statistics include most frequent values, most frequent patterns (value formats, e.g. number, uppercase and lowercase combinations), amount of invalid rows (invalid value can be specified) and valid rows, amount of distinct values as well as minimum, mean and maximum value length (for textual representations). The statistics will be output of this processor.

**For more details refer to the following ****article****. **

## Input(s)

*in.data*- Input

## Output(s)

*out.distincts*- Computed Textual Statistics

## Configurations

#### Distinct Cell formats to take *[integer]*

The amount of most frequent distinct formats to output in the analysis. The values will be seperated ba a "|" token and can be found in the "Example" column. Must be positive.

#### Distinct Format examples to take *[integer]*

The amount of examples that are shown per found format pattern

#### Explicit Characters *[string]*

Characters that will be shown explicitly

# Forecast Method Selection

**UUID:** 00000000-0000-0000-0180-000000000001

## Description

The goal of this processor is the assessment of different forecast methods and the selection of the best forecasts (Winning Method) or blended forecasts (Mean Method or Weighted Mean Method) per forecast step and group. The processor requires forecast values from different methods (left input) and information about the methods applied, e.g. the method's name and the maximum number of forecast steps valid for each method (right input).

**For more details refer to the following ****article****. **

## Input(s)

*in.forecasts*- Forecasts*in.horizons*- Forecast Methods and Maximum Forecast Horizons

## Output(s)

*out.selectedFc*- Selected Forecasts*out.metrics*- Metrics

## Configurations

#### Forecast Method Column * *[single column selection]*

Select the column that contains the forecast methods as a list (from right input)

#### Restriction Forecast Horizon Column *[single column selection]*

This column contains the maximum number of forecast horizon possible to use this method (from right input). Must be of type int or a string representing an int value.

#### Value Column * *[single column selection]*

Select the column that contains the actual values to compare forecasts to

#### Time Key Index * *[single column selection]*

Select the column that represents the Time Key Index. Must be of type int or a string representing an int value

#### Current Time Index * *[integer]*

Define the Time Index from which the forecasts should start to be processed. Must be of type int.

#### Forecast Start Index Column * *[single column selection]*

Marks the last time key included in the training set for cross validation. Must be of type int or a string representing an int value.

#### Forecast Step Column * *[single column selection]*

Select the column that contains the forecast step information. The forecast step describes how many periods had been forecasted since the Forecast Start Index.

#### Group by Columns *[multiple columns selection]*

Select all these columns that will be set as group by in aggregation

#### Compute Residuals and Weights * *[boolean]*

Toggle enabled means that the residuals and weights per method will be returned in the second output

#### Winning Method *[composed]*

Use Winning Method as assessment method for selecting a forecast

###### Winning Method > Number of Best Methods * *[integer]*

Define the number of methods that should be taken into consideration

###### Winning Method > Residual Weight x * *[double]*

Define a double value to weight the residuals {0 - inf}

###### Winning Method > Column Name * *[column name]*

Choosa a name for the column containing the chosen forecast values

#### Mean Method *[composed]*

Use Mean as assessment method for selecting a forecast

###### Mean Method > Column Name * *[column name]*

Choosa a name for the column containing the chosen forecast values

#### Weighted Mean Method *[composed]*

Use Weighted Mean as assessment method for selecting a forecast

###### Weighted Mean Method > Residual Weight x * *[double]*

Define a double value to weight the residuals {0 - inf}

###### Weighted Mean Method > Column Name * *[column name]*

Choosa a name for the column containing the chosen forecast values

# Frequent Pattern Mining

**UUID:** 00000000-0000-0000-0137-000000000001

## Description

Processor finding equal patterns within the columns

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- Input Data

## Configurations

#### Selected columns * *[multiple columns selection]*

Selected columns

#### Splitter string *[string]*

Splitter string

#### Min confidence *[double]*

Min confidence

#### Min support *[double]*

Min support

#### Remove duplicates in transactions * *[boolean]*

Cleans up the input data by enforcing distinct values inside transactions

# Frequent Sequences Mining

**UUID:** 00000000-0000-0000-0138-000000000001

## Description

Processor finding equal sequences within the columns.

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- trainingData

## Configurations

#### Independent attributes * *[multiple columns selection]*

Independent attributes

#### Maximal sequence length *[integer]*

The maximal length of sequences. Only shorter sequences will be mined.

#### minimal supported frequency *[double]*

The minimal frequency for a sequence to be mined.

#### Separator *[string]*

The separator used to isolate sequences.

# Gradient Boosting Classification Forecast

**UUID:** 00000000-0000-0000-0083-000000000001

## Description

Predicts the value of a binary {0,1} dependent variable using Gradient Boosted Trees (GBTs) in a classification setting. The independent variables can either be continuous or categorical.

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- Training dataset*in.testData*- Forecast dataset

## Output(s)

*out.output*- Test dataset and forecast values*out.outputFeatureImportance*- Feature Importance Output

## Configurations

#### Dependent attribute * *[single column selection]*

The binary dependent attribute which shall be explained using the independent variables. The dependent variable has to be of type integer and must only contain the levels {0,1}

#### Independent attributes *[multiple columns selection]*

The independent variables which are used to predict the dependent variable. The independent variables can be both continuous or categorical. All string variables are treated as categorical. If a string column has more distinct values than a non-string column, the non-string column is also treated as categorical.

#### Forecast Column Name * *[column name]*

Name of the column that contains the forecast values of the dependent attribute.

#### Tree Depth * *[integer]*

Specifies the maximum depth the trees used in the Gradient Boosting Classification model.

#### Number of iterations * *[integer]*

Number of iterations used to train the GBT algorithm. In the current version of the processor, the maximum number of iterations is limited to 100.

#### Learning rate * *[double]*

Level of regularization applied in the tree models. A higher level of regularization reduces the number of variables used for forecasting the dependent variable and makes the model more sparse. The model selects the most influential independent variables automatically.

#### Subsampling rate * *[double]*

Rate of columns (features) which are used in the model training process of the Gradient Boosted Tree Models. The lower the rate, the less variables are selected.

#### Base name of Probability Columns *[column name]*

The name given here is the start of the column names for the two columns containing the probabilities for the classes 0 and 1

# Gradient Boosting Regression Forecast

**UUID:** 00000000-0000-0000-0083-000000000002

## Description

Predicts the value of a dependent variable using Gradient Boosted Trees (GBTs) in a regression setting. This implies that the dependent variable has to be numeric and continuous. The independent variables can either be continuous or categorical.

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- Training dataset*in.testData*- Forecast dataset

## Output(s)

*out.output*- Forecast dataset and forecast values*out.outputFeatureImportance*- Feature Importance Output

## Configurations

#### Dependent attribute * *[single column selection]*

The dependent attribute which shall be explained using the independent variables.

#### Independent attributes *[multiple columns selection]*

The independent variables which are used to predict the dependent variable. The independent variables can be either continuous or categorical variables. All string variables are treated as categorical. If a string column has more distinct values than a non-string column, the non-string column is also treated as categorical.

#### Forecast Column Name * *[column name]*

Name of the column that contains the prediction values of the dependent attribute.

#### Tree Depth * *[integer]*

Specifies the maximum depth of the trees used in the Gradient Boosting Regression model.

#### Number of iterations * *[integer]*

Number of iterations used to train the GBT algorithm. In the current version of the processor, the maximum number of iterations is limited to 100.

#### Learning rate * *[double]*

Level of regularization applied in the tree models. A higher level of regularization reduces the number the variables which are selected by the model and makes the model more sparse. The model selects the most influential independent variables automatically.

#### Subsampling rate * *[double]*

Rate of columns (features) which are used in the model training process of the Gradient Boosted Tree Models. The lower the rate, the less variables are selected.

#### Loss Strategy * *[single enum selection]*

The loss strategy to be applied in the process of training the GBT model (the single decision trees). Absolute error loss is especially suited when extreme values (outliers) are present in the data.

# Grouped Decision Tree

**UUID:** 00000000-0000-0000-0060-000000000001

## Description

Divides the input into several groups and computes and outputs a forecast based on a decision tree for every group of rows. If classification/regression is done depends on the type of the dependent variable, Integers Text and Dates are used with classification, Numeric and Double are used with regression.

**For more details refer to the following ****article****. **

## Input(s)

*in.inputTraining*- Training Data Input*in.inputForecast*- Forecast Data Input

## Output(s)

*out.forecast*- Decision Tree Forecast

## Configurations

#### Dependent Column * *[single column selection]*

Select the dependent column for the creation of the decision tree models. If classification/regression is done depends on the type of the dependent variable, Integers Text and Dates are used with classification, Numeric and Double are used with regression.

#### Independent Columns * *[multiple columns selection]*

Select the independent columns for the creation of the decision tree models.

#### Name of the forecast column * *[column name]*

The name that should be used for the forecast column in the output. It has to be different from all available columns in the forecast dataset (if a column with the given name is already available, we rename the given name until we find a valid solution). The default name is "Forecast".

#### Group by Column * *[single column selection]*

Select the Column to group by. A decision tree model will be computed for every group.

#### Maximal number of leafs in the decision trees * *[integer]*

Select the maximal number of leafs the computed decision trees may have.

#### Number of training data partitions *[integer]*

The number of data partitions the training input should be arranged in. If set, data is re-arranged trough re-partitioning by the rows' hash values before performing 'group by' operation. Can improve performance.

#### Number of forecast data partitions *[integer]*

The number of data partitions the forecast input should be arranged in. If set, data is re-arranged trough re-partitioning by the rows' hash values before performing 'group by' operation. Can improve performance.

# Grouped FFT Computation

**UUID:** 00000000-0000-0000-0179-000000000001

## Description

Processor for computing fast-fourier transformations for data grouped by a given time-window size.

**For more details refer to the following ****article****. **

## Input(s)

*in.inputData*- Input Data

## Output(s)

*out.outputData*- Windowed-FFT computations per Sensor

## Configurations

#### Timestamps * *[single column selection]*

Timestamps for the sensor-data.

#### Columns with Sensor-Data *[multiple columns selection]*

Columns with Sensor-Data to be transformed. If not set, all columns with an appropriate type (Double, Int, Numeric) are taken as sensor data.

#### Size of Windows in seconds * *[integer]*

Size of the windows for which the FFT should be computed (in seconds).

#### Lower frequency bound for output (hz) *[double]*

Lower bound for frequencies that should be present in the output, all frequencies below the given one are not part of the output. If this option is not set, no limit is enforced.

#### Upper frequency bound for output (hz) *[double]*

Upper bound for frequencies that should be present in the output, all frequencies above the given one are not part of the output. If this option is not set, no limit is enforced.

#### Windowing that should be applied * *[single enum selection]*

Windowing that should be applied before fft-computation.

#### Sample-filling strategy * *[single enum selection]*

If the number of samples in a given time window is not a power of 2, we have to enlarge the dataset. This can either be done by zero-padding or by interpolation.

# Grouped Flow Analysis

**UUID:** 00000000-0000-0000-1136-000000000003

## Description

Processor sorting columns and returning the values as flow

**For more details refer to the following ****article****. **

## Input(s)

*in.input*- Input data

## Configurations

#### Independent attributes * *[column tuple]*

The first column depicts the amount, the second column represents the order, the third and fourth columns are more detailed levels of the order column.

#### Group *[multiple columns selection]*

If one or more grouping columns are set a separate sankey diagram will be computed for each group.

# Grouped Forecast

**UUID:** 00000000-0000-0000-0046-000000000001

## Description

A processor for creating arbitrary forecasts on grouped data series.

**For more details refer to the following ****article****. **

## Input(s)

*in.data*- Data*in.mapping*- Mapping Table

## Output(s)

*out.forecast*- Forecasted Data

## Configurations

#### Dependent Column * *[single column selection]*

Select the dependent column for the creation of the models. Linear Regression and Arithmetic Forecasting can only be done with a numeric dependent column. Decision/Regression trees are automatically selected depending on the type of the dependent column: text, timestamp and integer columns lead to a decision tree, double and numeric columns lead to a regression tree.

#### Grouping Column * *[single column selection]*

Select the Column to group by. The models and forecasts will be computed for every group separately.

#### Training Signal Column * *[single column selection]*

Select the column which contains the signal if a row should be used for model creation or if it should be only be forecasted with the created models.

#### Training Signal Content * *[string]*

If a cell in the training signal column contains the value given in this config element, the row is used for training the model. Otherwise, the row is only used for forecasting.

#### Decision Tree Forecasts *[composed]*

Decision/Regression trees are automatically selected depending on the type of the dependent column: text, timestamp and integer columns lead to a decision tree, double and numeric columns lead to a regression tree.

###### Decision Tree Forecasts > Forecast Column Name * *[column name]*

The name that should be used for the forecast column in the output. It has to be different from all available columns in the forecast dataset.

###### Decision Tree Forecasts > Independent Columns * *[multiple columns selection]*

Select the independent columns for the creation of the decision tree models.

###### Decision Tree Forecasts > Maximal number of leafs in the decision trees * *[integer]*

Select the maximum number of leafs the computed decision trees may have.

#### Linear Regression Forecasts *[composed]*

Linear regression with ordinary least squares.

###### Linear Regression Forecasts > Forecast Column Name * *[column name]*

The name that should be used for the forecast column in the output. It has to be different from all available columns in the forecast dataset.

###### Linear Regression Forecasts > Independent Columns * *[multiple columns selection]*

Select the independent columns for the creation of the linear regression models.

#### Arithmetic Forecasts *[composed]*

Computes statistical metrics over the given training sets and uses these measures as forecast values, e.g. by using "Average" as forecast function, all to-be-forecasted data will have the average of the training dataset as forecasted value.

###### Arithmetic Forecasts > Forecast Column Name * *[column name]*

The name that should be used for the forecast column in the output. It has to be different from all available columns in the forecast dataset.

###### Arithmetic Forecasts > Arithmetic Forecast Function * *[single enum selection]*

The mathematical operation that should be used as forecast value.

#### ARIMAX Forecasts *[composed]*

Autoregressive Integrated Moving Average with eXogenous regressors.

###### ARIMAX Forecasts > Forecast Column Name * *[column name]*

###### ARIMAX Forecasts > Ordering Column * *[single column selection]*

Specify a column by which each group should be sorted in ascending order before forecasting.

###### ARIMAX Forecasts > Exogenous Regressors *[multiple columns selection]*

Select the exegenous regressors for the creation of the ARIMAX models.

###### ARIMAX Forecasts > Maximum value of p * *[integer]*

Maximum number of autoregressive lags.

###### ARIMAX Forecasts > Maximum value of d * *[integer]*

Maximum value for the degree of differencing.

###### ARIMAX Forecasts > Maximum value of q * *[integer]*

Maximum number of moving average lags.

###### ARIMAX Forecasts > Maximum value of Xlag *[integer]*

Maximum number of exegenous regressor lags.

###### ARIMAX Forecasts > Forecast missing exogenous regressors * *[boolean]*

If selected exogenous regressors with zeros in the end are assumed to be incomplete and are forecasted with an ARIMA model

#### Gradient Boosting Forecasts *[composed]*

Gradient boosting with decision trees.

###### Gradient Boosting Forecasts > Forecast Column Name * *[column name]*

###### Gradient Boosting Forecasts > Independent Columns * *[multiple columns selection]*

Select the independent columns for the creation of the decision tree models.

###### Gradient Boosting Forecasts > Tree Depth * *[integer]*

Specifies the maximum depth of the trees used in the Gradient Boosting Regression model.

###### Gradient Boosting Forecasts > Number of iterations * *[integer]*

Number of iterations used to train the GBT algorithm. In the current version of the processor, the maximum number of iterations is limited to 100.

###### Gradient Boosting Forecasts > Learning rate * *[double]*

Level of regularization applied in the tree models. A higher level of regularization reduces the number the variables which are selected by the model and makes the model more sparse. The model selects the most influential independent variables automatically.

###### Gradient Boosting Forecasts > Subsampling rate * *[double]*

Rate of columns (features) which are used in the model training process of the Gradient Boosted Tree Models. The lower the rate, the less variables are selected.

###### Gradient Boosting Forecasts > Loss Strategy for regression * *[single enum selection]*

The loss strategy to be applied in the process of training the GBT model (the single decision trees). Absolute error loss is especially suited when extreme values (outliers) are present in the data.

#### Support Vector Machine Forecasts *[composed]*

Regression forecasting using Support Vector Machine.

###### Support Vector Machine Forecasts > Forecast Column Name * *[column name]*

###### Support Vector Machine Forecasts > Independent Columns * *[multiple columns selection]*

Select the independent columns for the creation of the support vector machine model.

###### Support Vector Machine Forecasts > Error threshold *[double]*

The loss function error threshold. This parameter is used only for regression forecasting.

###### Support Vector Machine Forecasts > Margin Penalty * *[double]*

The soft margin penalty.

###### Support Vector Machine Forecasts > Tolerance * *[double]*

The tolerance of convergence test.

###### Support Vector Machine Forecasts > Classification multi-class strategy *[single enum selection]*

The multi-class strategy to use for, and only for, classification forecasting.

###### Support Vector Machine Forecasts > Kernel function * *[single enum selection]*

The kernel to exploit the kernel trick, the idea of implicitly mapping data to a high-dimensional feature space where some linear algorithm is applied that works exclusively with inner products.

###### Support Vector Machine Forecasts > Gaussian Kernel sigma *[double]*

The smooth / width parameter of Gaussian kernel. If not specified default value of 1.0 will be used.

###### Support Vector Machine Forecasts > Polynomial Kernel degree *[integer]*

The degree used by Polynomial kernel. If not specified default value of 3 will be used.

###### Support Vector Machine Forecasts > Polynomial Kernel scale *[double]*

The scale used by Polynomial kernel. If not specified default value of 1.0 will be used.

###### Support Vector Machine Forecasts > Polynomial Kernel offset *[double]*

The offset used by Polynomial kernel. If not specified default value of 1.0 will be used

#### Hidden Markov Models *[composed]*

Uses first order Hidden Markov Models (HMM) for sequence labeling. Each group determined by the grouping column is assumed to have multiple sequences of which some are marked with the training signal. The transition, emission and initial probabilities are estimated using maximum likelihood estimation.

###### Hidden Markov Models > Forecast Column Name * *[column name]*

###### Hidden Markov Models > Observations Column * *[single column selection]*

Select a column which contains the symbols output by the hidden markov process.

###### Hidden Markov Models > Column for Sequence Grouping * *[single column selection]*

Select a column which values identify the sequences within a group that should be trained on or labeled.

#### Number of data partitions *[integer]*

The number of data partitions the data should be arranged in. If set, data is re-arranged trough re-partitioning by the rows' hash values before performing 'group by' operation. Can improve performance.

#### Output Forecast Only * *[boolean]*

When this toggle is checked, the output will only contain the forecasted test data. Otherwise, the forecasts are also done on the training data (in-sample).

# Grouped Linear Regression

**UUID:** 00000000-0000-0000-0061-000000000001

## Description

Divides the input into several groups and computes and outputs a linear regression model for every group as a row in its output. Also outputs a group-optimized datastructure identical with the input but suitable for performant re-grouping in similar processors.

**For more details refer to the following ****article****. **

## Input(s)

*in.input*- Input

## Output(s)

*out.models*- Linear Regression Models*out.data*- Group Optimized Input

## Configurations

#### Dependent Column * *[single column selection]*

Select the dependent column for the creation of the linear regression models.

#### Independent Columns * *[multiple columns selection]*

Select the independent columns for the creation of the linear regression models.

#### Group by Column * *[single column selection]*

Select the Column to group by. A linear regression model will be computed for every group.

#### Number of data partitions *[integer]*

The number of data partitions the input should be arranged in. If set, data is re-arranged trough re-partitioning by the rows' hash values before performing 'group by' operation. Can improve performance.

#### Enable waring messages in the server log * *[boolean]*

In some cases this processor may flood the server log with warnings so emitting these warning messages can be disabled here.

# Grouped Root Mean Square

**UUID:** 00000000-0000-0000-0178-000000000001

## Description

Processor for computing the root mean square for data grouped by a time-window size.

**For more details refer to the following ****article****. **

## Input(s)

*in.inputData*- Input Data

## Output(s)

*out.outputData*- RMS per chosen Column

## Configurations

#### Timestamps * *[single column selection]*

Timestamps for the sensor-data.

#### Columns with Sensor-Data *[multiple columns selection]*

Columns for which the RMS should be computed. If not set, all columns with an appropriate type (Double, Int, Numeric) are used.

#### Size of Windows in seconds * *[integer]*

Size of the windows for which the RMS should be computed (in seconds).

# Improved Linear Regression (Forecasting)

**UUID:** 00000000-0000-0000-0054-000000000001

## Description

Perform a linear regression with one dependent and one or multiple independent variables. If you want to use this processor without forecasting please do just use a replication processor and add the same input twice. The learning objective is to minimize the squared error, with regularization. The specific squared error loss function used is: L = 1/2n ||A coefficients - y||^2^ This supports multiple types of regularization: *none (a.k.a. ordinary least squares) *L2 (ridge regression) *L1 (Lasso) *L2 + L1 (elastic net) The default with regularization 0.0 and elastic-net 0.0 is equal to OLS

**For more details refer to the following ****article****. **

## Input(s)

*in.input1*- Training Input Values*in.input2*- Input Values for forecast prediction

## Output(s)

*out.out_forecast*- Output Forecast*out.out_regression_results*- Output Linear Regression Results

## Configurations

#### Dependent variable * *[single column selection]*

Select the dependent variable.

#### Independent variables *[multiple columns selection]*

Select the independent explanatory variables. All selected variables MUST be contained in the forecasting data set as well.

#### New column name for the computed results. * *[column name]*

The name which should be used for the new column, where the results of the linear regression forecasting are stored.

#### Max Iteration *[integer]*

Set the maximum number of iterations. Default is 100.

#### Regularization Param *[double]*

The regularization parameter. For more information check the link. Default is 0.0. (in combination with elastic net being 0.0, this means that we use OLS.

#### ElasticNet mixing parameter *[double]*

This number between 0.0 <= n <= 1.0 indicates the ratio of the L2 to L1 penalty (ridge regression to LASSO), where 0.0 means that the penalty is only ridge regression and 1.0 means that the penalty is only LASSO. Default is 0.0.

#### Convergence tolerance of iterations *[double]*

Set the convergence tolerance of iterations. Smaller value will lead to higher accuracy with the cost of more iterations. Default is 1E-6.

#### Solver name *[single enum selection]*

Set the solver algorithm used for optimization. In case of linear regression, this can be "l-bfgs" (a limited-memory quasi-Newton optimization method), "normal" (Normal Equation as an analytical solution to the linear regression problem. This uses weighted least squares) and "auto". (solver algorithm is selected automatically). Default is "auto".

#### Should the intercept be fitted * *[boolean]*

Set if the intercept should be fitted to the model. Default is true.

# Inverted Filter

**UUID:** 00000000-0000-0000-0021-000000000001

## Description

Discards entries in a dataset that appear in an other dataset

**For more details refer to the following ****article****. **

## Input(s)

*in.filters*- Filters*in.data*- Target Data

## Output(s)

*out.output*- Filtered Output

## Configurations

#### Filter column * *[single column selection]*

Column that contains filters to exclude from other dataset

#### Match column * *[single column selection]*

Column in the target dataset that contains the values to match the filter

# Join Pattern Mining

**UUID:** 00000000-0000-0000-0150-000000000002

## Description

Processor finding equal patterns within the columns

**For more details refer to the following ****article****. **

## Input(s)

*in.firstTable*- First Input Data*in.secondTable*- Second Input Data

## Output(s)

*out.rules*- Examples_With_Join_Rules

## Configurations

#### Selected columns first table * *[multiple columns selection]*

Selected columns first table

#### Selected columns second table * *[multiple columns selection]*

Selected columns second table

#### String Length * *[integer]*

String Length

# Lexical Columization

**UUID:** 00000000-0000-0000-0020-000000000001

## Description

Generate columns from a column containing multiple values and writes a value from another column into the corresponding one.

**For more details refer to the following ****article****. **

## Input(s)

*in.input*- Lexical binarization input

## Output(s)

*out.output*- Lexical binarization output*out.outputSummary*- Distinct Summary Output

## Configurations

#### Selected column * *[single column selection]*

Column with multiple values to columize.

#### Value column * *[single column selection]*

Column containing values that will be written into the matching column.

#### Separator *[string]*

Specify pattern that serves as separator

#### Prefix *[string]*

Choose a prefix for the new binarized columns.

#### Maximum number of keywords *[integer]*

Maximum number of keywords to be chosen for binarization. The keywords with most occurrences are taken. Defaults to 20 if left blank

#### Case Sensitive * *[boolean]*

Choose whether binarization should be case sensitive

# Linear Support Vector Machine

**UUID:** 00000000-0000-0000-0142-000000000002

## Description

The linear SVM is a standard method for large-scale classification tasks.

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- Training Data*in.forecastData*- Forecast Data

## Output(s)

*out.out_forecast*- Forecast

## Configurations

#### Dependent attribute * *[single column selection]*

Select dependent attribute from the given training set columns. Must have at most two values (0 and 1), can have only one value.

#### Independants attributes *[multiple columns selection]*

Select independent attributes from the given training set columns

#### New forecast Column * *[column name]*

Name of the Column that gets added to the data set and contains the forecast

#### Threshold *[double]*

Choose the threshold of your model

#### Iterations *[integer]*

Choose the maximum number of iterations of your model

#### Step Size *[double]*

Choose the step size of your model

#### Regularization parameter *[double]*

Choose the Regularization parameter of your model

# Model Application

**UUID:** 00000000-0000-0000-0084-000000000000

## Description

Can load and apply models generated by other Processors.

**For more details refer to the following ****article****. **

## Input(s)

*in.fcInput*- Forecast Input

## Output(s)

*out.fcOutput*- Forecast Output

## Configurations

#### Model * *[model application]*

The model which should be used for forecasting the input data.

# Moving Average

**UUID:** 00000000-0000-0000-0080-000000000001

## Description

Processor for analyzing data by creating series of averages of different subsets of the data set. Which means, at least one column has to be defined, for which the moving average will be calculated. Right now, three modi are provided to calculate the moving average: simple, triangular and weighted (see their definitions for detailed descriptions). Additionally, it is possible to define a left and right window size to define the frame, which is used for calculating the average.

**For more details refer to the following ****article****. **

## Input(s)

*in.inputData*- Input Data

## Output(s)

*out.movingAverageOutput*- Moving Window output

## Configurations

#### Columns for analysis * *[multiple columns selection]*

Columns which will be analysed using the chosen movingaverage type.

#### Moving Average Type * *[single enum selection]*

Moving Average Types: simple (The simple algorithm calculates the average of values in one window.), triangular (The triangular algorithm calculates the average of all previously calculated simple moving average values in one window.) and weighted (The weighted algorithm calculates the average of all values in one window, but puts weights to the values depending on the distance to the position of the actual line).

#### Columns for grouping *[multiple columns selection]*

Column, which is used for grouping the data before any moving average calculations are done. If this is set it is mandatory to also specify at least one sorting column.

#### Enable sorting *[composed]*

Sorts the data by the specified columns prior to calculation.

###### Enable sorting > Column to sort by * *[single column selection]*

###### Enable sorting > Sort order for column * *[single enum selection]*

#### Size of left Window * *[integer]*

Maximum size of the left window from the current element. The left and right window will sum up to the window, which is used.

#### Size of right Window * *[integer]*

Maximum size of the right window from the current element. The left and right window will sum up to the window, which is used.

#### Smoothness for Windows enabled * *[boolean]*

Smoothness means: E.g. if the moving average has to be calculated for the second element with a symmetric window, this might not be possible for a window bigger than 3. Hence the window adjusts to its biggest possible size of 3 elements, if smoothness is enabled.

#### Padding for edge rows *[single enum selection]*

Specifies how edge rows for which no moving average can be calculated are treated in the result.

#### Result Prefix * *[string]*

Result Prefix, which is appended to the original column names, which have been used for the moving average.

# Multiclass Linear Support Vector Machine

**UUID:** 00000000-0000-0000-0142-000000000003

## Description

The Multiclass SVM is a standard method for large-scale classification tasks of several classes based on the One-Versus-One method.

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- Training Data*in.forecastData*- Forecast Data

## Output(s)

*out.outForecast*- Forecast

## Configurations

#### Dependent variable * *[single column selection]*

Select dependent variable from the given training set columns

#### Independent variables * *[multiple columns selection]*

Select independent variables from the given training set columns

#### New forecast column * *[column name]*

Name of the column that gets added to the data set and contains the forecast

#### Iterations *[integer]*

Choose the number of times the optimization function is running to find the extremum. High number of iterations can lead to better results and then to an overfitting

#### Step Size *[double]*

Choose the step size of the optimization function of the generated models. The default value is set to 0.1. Smaller values lead to better results but can run into local extremum. Bigger values may lead to not finding the exact extremum. The ideal value of this variable also depends on the size of your dataset.

#### Regularization parameter *[double]*

Choose the Regularization parameter of your model in order to penalize misclassified values and to simplify the model. Choose small values when the data has a high variance. It should be between 0 and 1.

# Network Rule Generation

**UUID:** 00000000-0000-0000-0157-000000000001

## Description

This processor generates association rules with a single item as left hand side and a single item as right hand side. It derives a network between the different items by using their likeness regarding transactions. An edge is formed whenever the likeness reaches "Min Similarity". The edges then are interpreted as association rules. Their confidence depends on the original data. As input it requires a list containing single-item transactions (one item to one user in each row). The generated rules can be applied using the Association Rule Generation.

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- Longlist of Transactions

## Output(s)

*out.rules*- Rules*out.edgesOfNetwork*- Edges of the Network

## Configurations

#### Users * *[single column selection]*

Column containing the User IDs.

#### Items * *[single column selection]*

Column containing the Item IDs.

#### Min similarity *[double]*

The minimal similarity between two items to form an edge in the network.

#### Min confidence *[double]*

The minimal confidence for the association rules.

#### Value for missing transactions * *[double]*

Value with which missing transactions are weighted when computing similarity. (Existing transactions are weighted 1.0.)

# No Production Period Detector

**UUID:** 00000000-0000-0000-0075-000000000001

## Description

The three power consumptions and the output counter of a packing machine are provided as a function of time at a sampling frequency of 4 Hz. This processor identifies the periods of time in which the machine is not producing due to an unplanned stop and allows to compute the Overall Equipment Effectiveness parameter.

**For more details refer to the following ****article****. **

## Input(s)

*in.input*- Input Values

## Output(s)

*out.out_all*- Output All Rows*out.out_stops*- Output Stops

## Configurations

#### P1 * *[single column selection]*

Select the column P1

#### P2 * *[single column selection]*

Select the column P2

#### P3 * *[single column selection]*

Select the column P3

#### Index * *[single column selection]*

Select the index column

#### Production Rate * *[single column selection]*

Select the column with the production rate

#### No Production Period Detection - Sample Rate *[integer]*

Set the sample rate, default is 4

#### Wavelength computation - Coefficient for difference Threshold *[double]*

Coefficient for multiplication with the sigma during wavelength computation. This is for checking if the wavelength should be set to 0 at a certain timepoint or not. Default is 0.66.

#### Maximum wavelength used during wavelength measurements *[integer]*

Maximum wavelength used during wavelength measurements. Default is 12.

#### Maximal Production Rate *[integer]*

Set the maximal production rate, higher values will be set to 0, default is 10

#### SlowDown - Minimal increasing trend *[double]*

trend_min defines the minimal inclination of the linear-fit (B) of the smoothed period to detect a significant increase approaching the begin of the STOP phase (default value 0.008, for 4Hz sampling).

#### SlowDown - Smooth length *[integer]*

The parameter smooth_len (A) allows to tune the smoothing of the input period signal. This parameter is expressed in “pixel” and the default value is 11 pixels

#### SlowDown - Number of frames to look before and after stops *[integer]*

Number of frames to look before and after stops for pattern detection. Default is 200.

#### SlowDown - Minimal Wavelength before the stop *[double]*

The parameter wl_end_min (C) allows to set a lower limit on the smoothed period right before the begin of the STOP phase (default value 5, for 4Hz sampling).

#### 2Steps - Minimal decreasing trend *[double]*

trend_min defines the minimal inclination of the linear-fit (B) of the smoothed period to detect a significant decrease after the end of the STOP phase (default value 0.01, for 4Hz sampling).

#### 2Steps - Smooth length *[integer]*

The parameter smooth_len (A) allows to tune the smoothing of the input period signal. This parameter is expressed in “pixel” and the default value is 11 pixels

#### 2Steps - Number of frames to look before and after stops *[integer]*

Number of frames to look before and after stops for pattern detection. Default is 200.

#### 2Steps - Initial minimal Wavelength *[double]*

The parameter wl_ini_min (C) allows to set a lower limit on the smoothed period right after the end of the STOP phase (default value 6, for 4Hz sampling).

#### PowerSave - Difference threshold *[double]*

The parameter n_sig (C) allows to define the level of significance of step-like changes in the data. The default value is 6 to ensure that only strong drops are considered.

#### PowerSave - Minimal length *[integer]*

The parameter ps_len_min (A) defines the lower bound of the time window after the begin of the STOP in which the three signals show simultaneously a significant drop. The default value is 28sâˆ™4Hz = 112 pixels.

#### PowerSave - Maximal length *[integer]*

The parameter ps_len_max (B) defines the upper bound of the time window after the begin of the STOP in which the three signals show simultaneously a significant drop. The default value is 34sâˆ™4Hz = 136 pixels.

#### BOXall - Level of significance for step-like changes *[double]*

The parameter n_sig_step (A) allows to define the level of significance of step-like changes (both up-ward and down-ward) in the data. The default value is 10 to ensure that only strong steps are considered.

#### BOXall - Maximal differential change *[double]*

The parameter n_sig_diff_max (B) allows to define the maximal differential change allowed across the three signals (how different they can be). The default value is 8, while smaller values would require stronger similarities across the three signals to detect this pattern

#### BOXall - Minimal number of steps *[integer]*

The parameter n_step_min allows to define the minimal number of steps (both up-ward and down-ward) that the three signals must show in order to detect this pattern. The default value is set to 10. A single peak counts as 2 steps (one up-ward followed by on down-ward) as well as a box-shaped profile.

#### BOXdiff - Difference threshold *[double]*

The parameter n_sig (C) allows to define the level of significance of step-like changes in the data. The default value is 20 to ensure that only strong drops are considered.

#### BOXdiff - Minimum length *[integer]*

The parameters box_len_min (A) and box_len_max (B) define the range of the time length in which two of the three signals show an off-set to greater values. The default values are 2.5sâˆ™4Hz = 10 pixels and 3.5sâˆ™4Hz = 14 pixels, respectively.

#### BOXdiff - Maximum length *[integer]*

The parameters box_len_min (A) and box_len_max (B) define the range of the time length in which two of the three signals show an off-set to greater values. The default values are 2.5sâˆ™4Hz = 10 pixels and 3.5sâˆ™4Hz = 14 pixels, respectively.

#### Church - Coefficient for the difference threshold *[double]*

The parameter n_sig (D) allows to define the level of significance of step-like changes in the data. The default value is 6 to ensure that only strong drops are considered.

#### Church - Coefficient for the difference threshold of the two church sides *[double]*

The parameter step_min (C) allows to define the minimal difference between the value at the feet of the tower (on the left side) and the value at the begin of the roof of the church (right side of the tower). The default value is set to 3.

#### Church - Maximum length *[integer]*

The parameter tower_len_min (A) allows to define the maximal time length of the feature recognisable as the tower of the church (default value 3 pixels, for 4Hz sampling).

#### Church - Minimal length of the roof *[integer]*

The parameter roof_len_min (B) allows to define the minimal time length of the feature recognisable as the roof of the church (default value 6 pixels, for 4Hz sampling).

# Python Script Data Generator

**UUID:** 00000000-0000-0000-0065-000000000001

## Description

Processor that executes a Python Script to produce data.

**For more details refer to the following ****article****. **

## Output(s)

*out.output*- Result of Python Script

## Configurations

#### Python Script * *[string]*

Python Script to execute. Script should return at least one result of any of the supported outputs - datasets, plot images or models.

## Python Environment: 'Python'

*Python environment preinstalled with packages usually used by datascientists.*

__Python 3.7.3__

- absl-py-0.8.0
- asn1crypto-0.24.0
- astor-0.8.0
- attrs-19.1.0
- Automat-0.7.0
- avro-python3-1.7.7
- constantly-15.1.0
- cryptography-2.6.1
- cssselect-1.1.0
- cycler-0.10.0
- entrypoints-0.3
- fastavro-0.22.5
- gast-0.3.2
- google-pasta-0.1.7
- grpcio-1.23.0
- h5py-2.10.0
- hyperlink-19.0.0
- idna-2.8
- incremental-17.5.0
- joblib-0.13.2
- Keras-2.3.0
- Keras-Applications-1.0.8
- Keras-Preprocessing-1.1.0
- keyring-17.1.1
- keyrings.alt-3.1.1
- kiwisolver-1.1.0
- lxml-4.4.1
- Markdown-3.1.1
- matplotlib-3.1.1
- nltk-3.4.5
- numpy-1.17.2
- od-python-framework-0.2.0
- pandas-0.25.1
- parsel-1.5.2
- patsy-0.5.1
- Pillow-6.1.0
- plotly-4.1.1
- protobuf-3.9.2
- pyasn1-0.4.7
- pyasn1-modules-0.2.6
- pycrypto-2.6.1
- PyDispatcher-2.0.5
- PyGObject-3.30.4
- PyHamcrest-1.9.0
- pyOpenSSL-19.0.0
- pyparsing-2.4.2
- python-dateutil-2.8.0
- pytz-2019.2
- pyxdg-0.25
- PyYAML-5.1.2
- queuelib-1.5.0
- retrying-1.3.3
- scikit-learn-0.21.3
- scipy-1.3.1
- Scrapy-1.7.3
- seaborn-0.9.0
- SecretStorage-2.3.1
- service-identity-18.1.0
- six-1.12.0
- statsmodels-0.10.1
- tensorboard-1.14.0
- tensorflow-1.14.0
- tensorflow-estimator-1.14.0
- termcolor-1.1.0
- Twisted-19.7.0
- w3lib-1.21.0
- Werkzeug-0.16.0
- wrapt-1.11.2
- xgboost-0.90
- zope.interface-4.6.0

#### Timeout for Python Script execution * *[integer]*

Time (in seconds) to wait for the Python Service to return the calculation results of the script. If this timeout is exceeded, the calculation will be interrupted. The timer starts on the processor submitting the Python script and the data to the Python Service.

#### Generate Empty Dataset Output * *[boolean]*

By default, processor requires a single dataset to be registered for output in the Python script. In cases where Python script doesn't generate & register any dataset output (e.g. Python script is used only to generate plots / images) this toggle can be used to generate an empty dataset output so the processor won't generate an error. If this processor is not last in the workflow and empty dataset is generated, some succeeding processors might trigger validation errors on having an empty dataset on input.

#### Add these columns *[manual column specification]*

The column information will be forwared to the output of the processor.

# Python-Script Dual Input

**UUID:** 00000000-0000-0000-0065-000000000003

## Description

Processor that executes a Python script on its inputs.

**For more details refer to the following ****article****. **

## Input(s)

*in.dataframe1*- First Input*in.dataframe2*- Second Input

## Output(s)

*out.output*- Result of Python Script

## Configurations

#### Input name for first inputin Script * *[string]*

Name, used in the Python Script to reference to the input data.frame present at First Input.

#### Input name for second input in Script * *[string]*

Name, used in the Python Script to reference to the input data.frame present at Second Input.

#### Python Script * *[string]*

Python Script to execute. Script should return at least one result of any of the supported outputs - datasets, plot images or models.

## Python Environment: 'Python'

*Python environment preinstalled with packages usually used by datascientists.*

__Python 3.7.3__

- absl-py-0.8.0
- asn1crypto-0.24.0
- astor-0.8.0
- attrs-19.1.0
- Automat-0.7.0
- avro-python3-1.7.7
- constantly-15.1.0
- cryptography-2.6.1
- cssselect-1.1.0
- cycler-0.10.0
- entrypoints-0.3
- fastavro-0.22.5
- gast-0.3.2
- google-pasta-0.1.7
- grpcio-1.23.0
- h5py-2.10.0
- hyperlink-19.0.0
- idna-2.8
- incremental-17.5.0
- joblib-0.13.2
- Keras-2.3.0
- Keras-Applications-1.0.8
- Keras-Preprocessing-1.1.0
- keyring-17.1.1
- keyrings.alt-3.1.1
- kiwisolver-1.1.0
- lxml-4.4.1
- Markdown-3.1.1
- matplotlib-3.1.1
- nltk-3.4.5
- numpy-1.17.2
- od-python-framework-0.2.0
- pandas-0.25.1
- parsel-1.5.2
- patsy-0.5.1
- Pillow-6.1.0
- plotly-4.1.1
- protobuf-3.9.2
- pyasn1-0.4.7
- pyasn1-modules-0.2.6
- pycrypto-2.6.1
- PyDispatcher-2.0.5
- PyGObject-3.30.4
- PyHamcrest-1.9.0
- pyOpenSSL-19.0.0
- pyparsing-2.4.2
- python-dateutil-2.8.0
- pytz-2019.2
- pyxdg-0.25
- PyYAML-5.1.2
- queuelib-1.5.0
- retrying-1.3.3
- scikit-learn-0.21.3
- scipy-1.3.1
- Scrapy-1.7.3
- seaborn-0.9.0
- SecretStorage-2.3.1
- service-identity-18.1.0
- six-1.12.0
- statsmodels-0.10.1
- tensorboard-1.14.0
- tensorflow-1.14.0
- tensorflow-estimator-1.14.0
- termcolor-1.1.0
- Twisted-19.7.0
- w3lib-1.21.0
- Werkzeug-0.16.0
- wrapt-1.11.2
- xgboost-0.90
- zope.interface-4.6.0

#### Timeout for Python Script execution * *[integer]*

Time (in seconds) to wait for the Python Service to return the calculation results of the script. If this timeout is exceeded, the calculation will be interrupted. The timer starts on the processor submitting the Python script and the data to the Python Service.

#### Generate Empty Dataset Output * *[boolean]*

By default, processor requires a single dataset to be registered for output in the Python script. In cases where Python script doesn't generate & register any dataset output (e.g. Python script is used only to generate plots / images) this toggle can be used to generate an empty dataset output so the processor won't generate an error. If this processor is not last in the workflow and empty dataset is generated, some succeeding processors might trigger validation errors on having an empty dataset on input.

#### Add these columns *[manual column specification]*

The column information will be forwared to the output of the processor.

# Python-Script Single Input

**UUID:** 00000000-0000-0000-0065-000000000002

## Description

Processor that executes an Python Script on its input.

**For more details refer to the following ****article****. **

## Input(s)

*in.dataframe*- Input

## Output(s)

*out.output*- Result of Python Script

## Configurations

#### Input name in Script * *[string]*

Name, used in the Python Script to reference to the input data.frame.

#### Python Script * *[string]*

Python Script to execute. Script should return at least one result of any of the supported outputs - datasets, plot images or models.

## Python Environment: 'Python'

*Python environment preinstalled with packages usually used by datascientists.*

__Python 3.7.3__

- absl-py-0.8.0
- asn1crypto-0.24.0
- astor-0.8.0
- attrs-19.1.0
- Automat-0.7.0
- avro-python3-1.7.7
- constantly-15.1.0
- cryptography-2.6.1
- cssselect-1.1.0
- cycler-0.10.0
- entrypoints-0.3
- fastavro-0.22.5
- gast-0.3.2
- google-pasta-0.1.7
- grpcio-1.23.0
- h5py-2.10.0
- hyperlink-19.0.0
- idna-2.8
- incremental-17.5.0
- joblib-0.13.2
- Keras-2.3.0
- Keras-Applications-1.0.8
- Keras-Preprocessing-1.1.0
- keyring-17.1.1
- keyrings.alt-3.1.1
- kiwisolver-1.1.0
- lxml-4.4.1
- Markdown-3.1.1
- matplotlib-3.1.1
- nltk-3.4.5
- numpy-1.17.2
- od-python-framework-0.2.0
- pandas-0.25.1
- parsel-1.5.2
- patsy-0.5.1
- Pillow-6.1.0
- plotly-4.1.1
- protobuf-3.9.2
- pyasn1-0.4.7
- pyasn1-modules-0.2.6
- pycrypto-2.6.1
- PyDispatcher-2.0.5
- PyGObject-3.30.4
- PyHamcrest-1.9.0
- pyOpenSSL-19.0.0
- pyparsing-2.4.2
- python-dateutil-2.8.0
- pytz-2019.2
- pyxdg-0.25
- PyYAML-5.1.2
- queuelib-1.5.0
- retrying-1.3.3
- scikit-learn-0.21.3
- scipy-1.3.1
- Scrapy-1.7.3
- seaborn-0.9.0
- SecretStorage-2.3.1
- service-identity-18.1.0
- six-1.12.0
- statsmodels-0.10.1
- tensorboard-1.14.0
- tensorflow-1.14.0
- tensorflow-estimator-1.14.0
- termcolor-1.1.0
- Twisted-19.7.0
- w3lib-1.21.0
- Werkzeug-0.16.0
- wrapt-1.11.2
- xgboost-0.90
- zope.interface-4.6.0

#### Timeout for Python Script execution * *[integer]*

Time (in seconds) to wait for the Python Service to return the calculation results of the script. If this timeout is exceeded, the calculation will be interrupted. The timer starts on the processor submitting the Python script and the data to the Python Service.

#### Generate Empty Dataset Output * *[boolean]*

By default, processor requires a single dataset to be registered for output in the Python script. In cases where Python script doesn't generate & register any dataset output (e.g. Python script is used only to generate plots / images) this toggle can be used to generate an empty dataset output so the processor won't generate an error. If this processor is not last in the workflow and empty dataset is generated, some succeeding processors might trigger validation errors on having an empty dataset on input.

#### Add these columns *[manual column specification]*

The column information will be forwared to the output of the processor.

# R-Script Data Generator

**UUID:** 00000000-0000-0000-0030-000000000001

## Description

Processor that executes an R Script to produce data.

**For more details refer to the following ****article****. **

## Output(s)

*out.output*- Result of R Script

## Configurations

#### R Script * *[string]*

R Script to execute. The script must have a return() statement. It will be wrapped inside a function call with no parameters.

#### Timeout (s) for R Script execution * *[integer]*

Time (in seconds) to wait for the R Server to return the calculation results of the script. If this timeout is exceeded, the calculation will be interrupted and the connection of this Processor to the R Server will be released. The timeout starts on the Processor submitting the R script and the data to the R Server.

#### Add these columns *[manual column specification]*

The column information will be forwared to the output of the processor.

# R-Script Dual Input

**UUID:** 00000000-0000-0000-0030-000000000003

## Description

Processor that executes an R Script on its inputs.

**For more details refer to the following ****article****. **

## Input(s)

*in.dataframe1*- First Input*in.dataframe2*- Second Input

## Output(s)

*out.output*- Result of R Script

## Configurations

#### Input name for first inputin Script * *[string]*

Name, used in the R Script to reference to the input data.frame present at First Input.

#### Input name for second input in Script * *[string]*

Name, used in the R Script to reference to the input data.frame present at Second Input.

#### R Script * *[string]*

R Script to execute. The script must have a return() statement. It will be wrapped inside a function call with no parameters.

#### Timeout (s) for R Script execution * *[integer]*

Time (in seconds) to wait for the R Server to return the calculation results of the script. If this timeout is exceeded, the calculation will be interrupted and the connection of this Processor to the R Server will be released. The timeout starts on the Processor submitting the R script and the data to the R Server.

#### Add these columns *[manual column specification]*

The column information will be forwared to the output of the processor.

# R-Script Single Input

**UUID:** 00000000-0000-0000-0030-000000000002

## Description

Processor that executes an R Script on its input.

**For more details refer to the following ****article****. **

## Input(s)

*in.dataframe*- Input

## Output(s)

*out.output*- Result of R Script

## Configurations

#### Input name in Script * *[string]*

Name, used in the R Script to reference to the input data.frame.

#### R Script * *[string]*

R Script to execute. The script must have a return() statement. It will be wrapped inside a function call with no parameters.

#### Timeout (s) for R Script execution * *[integer]*

Time (in seconds) to wait for the R Server to return the calculation results of the script. If this timeout is exceeded, the calculation will be interrupted and the connection of this Processor to the R Server will be released. The timeout starts on the Processor submitting the R script and the data to the R Server.

#### Add these columns *[manual column specification]*

The column information will be forwared to the output of the processor.

# Random Forest Classification Forecast

**UUID:** 00000000-0000-0000-0048-000000000001

## Description

Processor forecasting a classification out of the given data by modelling a random forest.

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- Training Data*in.forecastData*- Foracast Data

## Output(s)

*out.outputData*- Forecast Output*out.outputDebug*- Debug Output*out.outputMapping*- Mapping Output*out.outputFeatureImportance*- Feature Importance Output

## Configurations

#### Dependent attribute * *[single column selection]*

Dependent attribute

#### Independent attributes * *[multiple columns selection]*

Independent attributes

#### Forecast column name * *[column name]*

Name of the column that gets added to the data set and contains the forecast. Must not contain whitespaces!

#### Maximum number of trees *[integer]*

The maximum number of trees to be generated

#### Maximum number of trees shown in the result section (sorted by weight) *[integer]*

The maximum number of trees shown in the result section, sorted by weight. Default is 10.

#### Maximum depth of each tree *[integer]*

The depth each tree may have maximally, if not set, the default is 5.

#### Maximum number of bins *[integer]*

Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. Must be >= 2 and >= number of categories in any categorical feature. If not set, the maximum of 32 and the amount distinct values in categories is chosen. If the chosen number is too small, the validation error will contain a hint about the minimal value that may be entered here.

#### Impurity * *[single enum selection]*

Measure of homogeneity of the labels.

#### Always use the Threshold for Minimal Occurences of Categories * *[boolean]*

The threshold specifed in another option can be set to be used always instead of only when there are differing categories in the training/test input.

#### Minimal Occurence Threshold for Categories *[double]*

This option is only used if categorical independents occur in the forecast/test set which are not in the trainingset, or if explicitly set with the option "useMinCategoryThreshold". When a given independent category has values which do not occur very often they can be merged to a fake category, the default value is 0.01 (1%). In case there is no category with less than 1% the smallest catgegory is used as fake category, if there are more with equal amount of occurences, all of them are taken.

#### Name for the additional (fake) fallback category *[string]*

Name for the additional (fake) fallback category. Default is "FALLBACK"

#### Replace independent categories with fallback value in output * *[boolean]*

Replace independent categories which were replaced by the fallback value for the computation, also in the output of the Processor.

#### Output debug data * *[boolean]*

If enabled, this processor will output debug data.

# Random Forest Regression Forecast

**UUID:** 00000000-0000-0000-0048-000000000002

## Description

Processor forecasting a regression out of the given data by modelling a random forest. If you have less than 4 different feature columns please consider using a processor from the decision tree family. The random forest regression processor does not guarantee to yield correct results with less than 4 features.

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- Training Data*in.forecastData*- Foracast Data

## Output(s)

*out.outputData*- Data Output*out.outputMapping*- Mapping Output*out.outputFeatureImportance*- Feature Importance Output

## Configurations

#### Dependent attribute * *[single column selection]*

Dependent attribute

#### Independent attributes * *[multiple columns selection]*

Independent attributes

#### Forecast column name * *[column name]*

Name of the column that gets added to the data set and contains the forecast. Must not contain whitespaces!

#### Maximum number of trees *[integer]*

The maximum number of trees to be generated. Default is 10.

#### Maximum number of trees shown in the result section (sorted by weight) *[integer]*

The maximum number of trees shown in the result section, sorted by weight. Default is 10.

#### Maximum depth of each tree *[integer]*

The depth each tree may have maximally, if not set, the default is 5.

#### Maximum number of bins *[integer]*

Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. Must be >= 2 and >= number of categories in any categorical feature. If not set, the maximum of 32 and the amount distinct values in categories is chosen. If the chosen number is too small, the validation error will contain a hint about the minimal value that may be entered here.

#### Always use the Threshold for Minimal Occurences of Categories * *[boolean]*

The threshold specifed in another option can be set to be used always instead of only when there are differing categories in the training/test input.

#### Minimal Occurence Threshold for Categories *[double]*

This option is only used if categorical independents occur in the forecast/test set which are not in the trainingset, or if explicitly set with the option "useMinCategoryThreshold". When a given independent category has values which do not occur very often they can be merged to a fake category, the default value is 0.01 (1%). In case there is no category with less than 1% the smallest catgegory is used as fake category, if there are more with equal amount of occurences, all of them are taken.

#### Name for the additional (fake) fallback category *[string]*

Name for the additional (fake) fallback category. Default is "FALLBACK"

#### Replace independent categories with fallback value in output * *[boolean]*

Replace independent categories which were replaced by the fallback value for the computation, also in the output of the Processor.

# Sequence Analysis

**UUID:** 00000000-0000-0000-0136-000000000001

## Description

Processor sorting columns and returning the sequence as one column

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- trainingData

## Output(s)

*out.output*- Sequence Input with additional Sequence column*out.output_columns*- Sequence Input with additional Sequence column

## Configurations

#### Independent attributes * *[multiple columns selection]*

Independent attributes

#### Sequence Column Name * *[column name]*

Name of the Column that gets added to the data set and contains the Sequence. Must not contain whitespaces!

#### Trim sequence graph * *[double]*

>=1 no trimming

#### Frequent sequence threshold * *[double]*

>=1 no trimming

# Venn Diagram

**UUID:** 00000000-0000-0000-0081-000000000001

## Description

Processor for creating two outputs, which are required for venn diagramms. The first output (Report output) has an additional column to the input data, which contains all strings (only distinct) from the selected columns. The original columns (Selected Columns) contain a 1 if the corresponding string in the new column is contained and a 0 if this is not the case. The second output (User output) has two additional columns: non-aggregated and aggregated. Additionally, the original columns contain a 1 or 0 depending on whether they are part of a combination or not. Non-aggregation takes the number of letters, which are in the current combination (in every element of the combination), but not part of any other column. Aggregation takes the number of letters, which are present in every column of the current combination (only distinct). Example: Column1 consists of all letters of the name 'Stephanie' and Column2 constists of all letters of the name 'Sandra'. Additionally, there are Column3 (letters of the name 'Rainer'), Column4 (letters of the name 'Michael') and Column5 (letters of the name 'Manuel'). Assume that column 1 and 2 are selected. As a result, the Report output contains an additional column, which covers all distinct string values of the Selected Columns. The original columns contain a 1 or a 0 depending on whether a string is contained in the new column.

**For more details refer to the following ****article****. **

## Input(s)

*in.inputData*- Input Data

## Output(s)

*out.reportOutput*- Report output*out.userOutput*- User output

## Configurations

#### Selected Columns *[multiple columns selection]*

Columns, which will be used to calculate all information for a venn diagramm. It is only allowed to select 2-5 columns (hence, it is not allowed to e.g. select only one column). If no column is selected at all, all suitable (text) columns are used, these also need to be between 2 and 5, otherwise this processor does not work.

#### Name for Report Output * *[column name]*

The name for the column, which is added in the Report Output.

# Word2Vec

**UUID:** 00000000-0000-0000-0166-000000000001

## Description

Computes a Word2Vec model based on an input (text) corpus. Word2Vec is a neural network based approach to create word embeddings (high dimensional features vectors) from a given input corpus. It relies on the distributional hypothesis, which describes that words that occur within a specific context range are similar. More algorithmic details can be found in "Efficient Estimation of Word Representations in Vector Space" by Mikolov et al. 2013.

**For more details refer to the following ****article****. **

## Input(s)

*in.trainingData*- Training Data

## Output(s)

*out.outputModel*- Word2Vec Model

## Configurations

#### Column/-s *[multiple columns selection]*

Select column whose content should be used for training. Columns not selected get excluded.

#### Dimensions *[integer]*

Choose the number of vector dimensions in your model (Choose a value between 1 and 1000). More dimensions means longer training time and more accurate vector embeddings on large data sets. Many dimensions on small data sets can lead to overfitting.

#### Min Count *[integer]*

Choose the minimum number of how often a word must occur (Choose a value between 0 and 1000). The algorithm ignores all words with a total frequency lower than this.

#### Iterations *[integer]*

Choose the maximum number of iterations of your model (Choose a value between 1 and 50). It describes the number of iterations (epochs) over the corpus. We recommend it=1 if the corpus is large enough (e.g. English Wikipedia).

#### Window Size *[integer]*

Choose the window size (Choose a value between 1 and 50). The window size is the maximum distance between the current and predicted word within a sentence.

#### Learning Rate *[double]*

Choose the learning rate of your model (Choose a value between 0.001 and 0.5). The learning rate describes the weight adaption margin during an update step. This parameter should not be changed unless you are aware of what you are doing.

# | Integrated Workflow Execution (3 Ports) |

**UUID:** 00000000-0000-0000-1108-000000000001

## Description

In the place of this processor the selected workflow is added instead.

**The detailed article for this processor is will be created.**

## Input(s)

*in.in1*- Input 1*in.in2*- Input 2*in.in3*- Input 3

## Output(s)

*out.out1*- Output 1*out.out2*- Output 2*out.out3*- Output 3

## Configurations

#### Workflow ID * *[string]*

The Id of the workflow that should be executed with the given inputs. The inputs are mapped to the microservice inputs in the given workflow (based on the id in other config fields). The outputs are mapped back in this way, too. The variables of the workflow containing this processor are also used in the selected workflow (the standard variables of the selected workflow are ignored.

#### Workflow Version *[integer]*

The version of the workflow that should be integrated. When nothing is selected the latest version will be used.

#### ID of Microservice Input for Input 1 * *[string]*

The content of input 1 of this processor are forwarded to the microservice input (from the integrated workflow) with the given id.

#### ID of Microservice Input for Input 2 * *[string]*

The content of input 2 of this processor are forwarded to the microservice input (from the integrated workflow) with the given id.

#### ID of Microservice Input for Input 3 * *[string]*

The content of input 3 of this processor are forwarded to the microservice input (from the integrated workflow) with the given id.

#### ID of Microservice Output for Output 1 * *[string]*

The results of the mentioned microservice output (from the integrated workflow) are forwarded to output 1 of this processor.

#### ID of Microservice Output for Output 2 * *[string]*

The results of the mentioned microservice output (from the integrated workflow) are forwarded to output 2 of this processor.

#### ID of Microservice Output for Output 3 * *[string]*

The results of the mentioned microservice output (from the integrated workflow) are forwarded to output 3 of this processor.

# Collaborative Filtering (Deprecated!)

**UUID:** 00000000-0000-0000-0062-000000000001
**Deprecated**: *Please use the ALS Recommender processor instead.*
**Replaced by:** *ALS Recommender*
**Removed:** *true*

## Description

This processor generates product recommendations for users and user recommendations for products using the (spark.mllib.recommendation) ALS algorithm with default settings.

## Input(s)

*in.training*- Longlist of Ratings

## Output(s)

*out.output_product*- Product to User recommendations*out.output_user*- User to Product recommendations

## Configurations

#### User Column * *[single column selection]*

Column containing user IDs

#### Product Column * *[single column selection]*

Column containing product IDs

#### Rating Column * *[single column selection]*

Column containing ratings

#### Recommendation count * *[integer]*

Requested number of recommendations

#### Seed *[integer]*

Random Seed for model training

# Flow Analysis (Deprecated!)

**UUID:** 00000000-0000-0000-1136-000000000001
**Deprecated**: *Please use the Grouped Flow Analysis processor instead.*
**Replaced by:** *Grouped Flow Analysis*
**Removed:** *true*

## Description

Processor sorting columns and returning the values as flow

## Input(s)

*in.trainingData*- trainingData

## Output(s)

*out.output*- Sequence Input with additional Sequence column

## Configurations

#### Independent attributes * *[multiple columns selection]*

Independent attributes

#### Sequence Column Name * *[column name]*

Name of the Column that gets added to the data set and contains the Sequence. Must not contain whitespaces!

# Improved Flow Analysis (Deprecated!)

**UUID:** 00000000-0000-0000-1136-000000000002
**Deprecated**: *Please use the Grouped Flow Analysis processor instead.*
**Replaced by:** *Grouped Flow Analysis*
**Removed:** *true*

## Description

Processor sorting columns and returning the values as flow

## Input(s)

*in.input*- Input data

## Configurations

#### Independent attributes * *[column tuple]*

Independent attributes the first column is the amount, the second column is the order, the third and fourth columns are more detailed levels of the order column.

# Linear Regression (Deprecated!)

**UUID:** 00000000-0000-0000-0010-000000000001
**Deprecated**: *Please use the Improved Linear Regression (Forecasting) processor instead.*
**Replaced by:** *Improved Linear Regression (Forecasting)*
**Removed:** *true*

## Description

Perform a linear regression with one dependent and one or multiple independent variables based on (l)BFGS optimization and use the regression model computed to forcast on a second data set.

## Input(s)

*in.data*- Input

## Configurations

#### Dependent variable * *[single column selection]*

Select the dependent variable y.

#### Independent variables *[multiple columns selection]*

Select the independent explanatory variables. All selected variables MUST be contained in the forecasting data set as well.

# Linear Regression Forecast (Deprecated!)

**UUID:** 00000000-0000-0000-0010-000000000002
**Deprecated**: *Please use the Improved Linear Regression (Forecasting) processor instead.*
**Replaced by:** *Improved Linear Regression (Forecasting)*
**Removed:** *true*

## Description

Perform a linear regression with one dependent and one or multiple independent variables based on (l)BFGS optimization and use the regression model computed to forcast on a second data set.

## Input(s)

*in.regression_data*- Regression Data*in.forecast_data*- Forecast Data

## Output(s)

*out.out*- Forecast

## Configurations

#### Dependent variable * *[single column selection]*

Select the dependent variable y.

#### Independent variables *[multiple columns selection]*

Select the independent explanatory variables. All selected variables MUST be contained in the forecasting data set as well.

#### Forecast Column Name * *[column name]*

Name of the Column that gets added to the data set and contains the forecast. Must not contain whitespaces! If the column name is already present, the column is overwritten with the forecast values.

#### Create Result * *[boolean]*

When switched off, the processor will not create a result and just do the forecast. Use this options when you are not interested in the model but only the forecast. Will speed up execution time.

# Old Repartitioning Filesystem Save (Deprecated!)

**UUID:** 00000000-0000-0000-0005-000000000002
**Deprecated**: *An improved version is available which can also save Parquet-based files! Also, please use the Manipulate Partitions to perform coalesce or repartitioning tasks!*
**Replaced by:** *Dataset Save*
**Removed:** *true*

## Description

Saves the input data set to a .csv file. Performs a repartitioning to reduce the file split count on the file system.

## Input(s)

*in.data*- Input

## Configurations

#### Name of Data Set (mandatory for NEW data sets) *[string]*

Symbolic name for the data set to be saved by this processor. Will be shown in the "data sets" interface.

#### Existing Data Set (mandatory for APPEND and REPLACE) *[data selection]*

Specifies the data set, having a matching schema, which is appended to the input data.

#### Save procedure * *[single enum selection]*

Indicates whether to save input as an independent new data Set, append it to an existing data set or replace an existing data set.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Feedback sent

We appreciate your effort and will try to fix the article