This Processor is similar to the Flexible ETL Load Processor (Deprecated Processor) with the addition that it does not need a Data Table to be created from a Connection beforehand. The Connection itself will be used.
TABLE OF CONTENTS
- Input
- Configuration
- Output
- Example
- Troubleshooting
- Handling Oracle Binary Columns
- Hive Connection Performance Improvement
Overview
The Database Connection Load Processor operates on a special type of dataset, namely a pre-defined ETL (Extract, Transform, Load) source. The ETL source can be re-used within this Processor and therefore sensible information (e.g., user name and password) must not be repeatedly inserted. It loads the requested data for use in the Workflow.
Input
The Processor does not have an input port.
Configuration
- Connection: Instructions on how to create a Connection can be found here.
- Table Name or custom SQL: Be aware, that the remote table name is not necessarily identical to the ONE DATA Data Table name.
Output
The loaded Data Table can be used as input in further Processors.
Example
Workflow
Example Configuration
The "oracle19" is the Connection name of a ORACLE 19 database Connection inside ONE DATA. Within this Connection, we have access to a "PRODUCTS" table.
The configuration of additional parameters is optional. However, in the following example we can see how different parameters interact with each other:
- fetchSize: number of rows per fetch
- configuration of partitioning for parallel read via numberPartitions, lowerBound, upperBound, partitionColumn
Result
Troubleshooting
It might happen that the database returns certain column types, that are not supported by ONE DATA and the following error message occurs "Please cast the problematic column manually within the provided select query.
"
In this case it is important to determine the problematic column and to cast it within the SELECT statement into something that ONE DATA understands, e.g. STRING.
Some problematic ORACLE column types can be found here under "Additional Information for ORACLE".
Handling Oracle Binary Columns
ORACLE supports several binary column types, e.g. RAW, LONG_RAW, BLOB, etc.
Their content can be properly read via the Data Table Load Processor as seen in the following example.
However, when reading such columns via the Database Connection Load Processor, The configured select query is executed directly on the target database and the result shows exactly what was provided. For binary columns, ONE DATA shows their STRING value, which might be the "base64" encoded visualization. The following figure shows the previous result returned by the ORACLE database.
In order to display the decoded binary content, the following options are available:
- Set the "Auto Cast to String" toggle to TRUE (while keeping the original select query). In this case, all returned columns are properly cast to STRING and this also includes a proper decoding.
- Decoding and casting of the binary column via an SQL statement, e.g.
SELECT UTL_RAW.cast_to_varchar2(RAW_COL), ... FROM TABLE_NAME
The result is seen in the following figure (casting of two columns)
Hive Connection Performance Improvement
Connections to Hive metastores have been adapted to improve their performance. To increase the performance of reading data from Hive tables we use the Hive support in Spark SQL, which reads the Hive table's location from Hive's metastore and then uses it to read data in parallel directly from MapR-FS.
To show the effect this change has, these use cases were tested with the following results:
Use Case | New Time | Old Time |
---|---|---|
Simple row count on 2 billion rows | less than a minute | 19 minutes |
Replicate big hive table (2 billion rows) by saving it as parquet in ONE DATA | 20 minutes | 2-3 hours |
Queries with "String contains match" were also faster than before |
Prerequesites
So that Spark SQL can be used in the Database Connection Load Processor when using Hive Connections, it needs to be enabled first. This is done via instance configuration. More information on this can be found here.
Restriction
Be aware that it is only possible to enable this improvement for one Hive metastore Connection (Spark limitation). In case of more than one, the others will be connected as-is via JDBC.
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article