Motivation
With ONE DATA there are many ways to process data within a workflow using different processors. But sometimes, it is necessary or easier to use custom computation methods. An R-Script, for instance, is a good way to customize the processing of data because the programming language is predestined for such tasks. If you want to know why, information can be found on the offcial site of the R-Project.
Since R-Scripts are a very useful tool for data science tasks it is possible to include them in ONE DATA workflows with the R-Script processors. In this article, we will focus on the R-Script Single Input Processor.
Overview
The R-Script Single Input Processor takes a dataset as input, executes a R-Script on it and forwards the result to ONE DATA. An overview on how R-Scripts are processed in ONE DATA and useful tips on how to install R packages can be found here.
Input
The processor can operate on any valid dataset produced by ONE DATA. Column names in the input dataset should not contain any special characters or consist only of numbers. Otherwise ONE DATA will not recognize the input dataset.
If the input is valid, the column names are displayed at the bottom of the editor in the processor configuration. Also the R-Script is inserted here:
In the "Input Name In Script" textfield, it is possible to specify the name of the input dataset in the R-Script. The default value is "input". The dataset can be saved into a R data frame using:
dataset <- input
Configuration
The processor configuration gives some additional options on how the data should be processed and some definitions for the script execution itself.
Timeout For Script Execution
Time (in seconds) to wait for the R Server to return the script calculation results. If this timeout is exceeded, the calculation will be interrupted and the connection of this Processor to the R Server will be released and a ProcessorExecutionError is thrown. The timeout starts as soon as the Processor is submitting the R script and the data to the R Server. When it is exceeded, the
The default value is 300 seconds.
Manual TI
With this configuration option it is possible to specify what scale and representation type the columns of the output dataset have, in order to provide the correct type inference in ONE DATA.
Possible scale types: nominal, interval, ordinal, ratio. Further information on scale types can be found here.
Possible representation types: string, int, double, datetime, numeric
If it is not possible to convert the values of a column to the specified representation type, the processor will take the type that fits best for their representation. If types still do not fit the purpose, it is recommended to use the Data type Conversion Processor.
Output
The output of the R-Script Processor is the dataset that was produced by the R-Script in the configuration. There are two things to note on how the output has to be specified:
- The output of the R-Script needs to have the type dataframe in R. Make sure to convert the output to type dataframe.
- The last executed statement of the R-Script needs to include the return() command in R and the data that should be returned as dataframe.
return (as.data.frame( "insert name of output data here" ))
Example
In this example we want to get the best rated books out of a books dataset.
Example Input
The following table represents a snippet of the dataset that we will use.
bookID | title | authors | average _rating | isbn | isbn13 | language _code | #_num _pages | ratings _count | text _reviews _count |
1 | Harry Potter and the Half-Blood Prince (Harry Potter #6) | J.K. Rowling | 4.56 | 0439785960 | 978043 9785969 | eng | 652 | 1944099 | 26249 |
2 | Harry Potter and the Order of the Phoenix (Harry Potter #5) | J.K. Rowling | 4.49 | 0439358078 | 978043 9358071 | eng | 870 | 1996446 | 27613 |
3 | Harry Potter and the Sorcerer's Stone (Harry Potter #1) | J.K. Rowling | 4.47 | 0439554934 | 978043 9554930 | eng | 320 | 5629932 | 70390 |
Example Script and Configuration
From the input we want to extract all books that have a rating higher than or equal 4.5, and only get the columns "authors", "title", "average_rating" and "ratings_count". This is the corresponding R-Script for it:
output <- subset(books, average_rating>=4.5, select=c("authors", "title", "average_rating", "ratings_count")) return(output)
In the processor configuration we choose "books" as name for the input dataset. We select the default timeout and do not configure Manual TI.
Workflow
The workflow loads the book dataset with a Data Table Load Processor, passes it to the R-Scipt Single Input Processor and saves the output to a Result Table.
Result
Here is a snippet of the results:
Related Articles
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article