Release 3.6.0 (40) - File Connection Handling (46.24.0 // 1.166.0)

Modified on Thu, 30 Jul 2020 at 09:58 AM

What is new:

  • Implementation of ORC Support
  • Whitelisting of filesystem directories for filesystem connections
  • Handling of files in connections
    • Merge Rules for filesystem connections
    • Showing samples in filesystem connection datasets
  • Improvements
    • Minor Improvements regarding Debug Mode
    • Showing dynamic name in the title of the tab
    • Showing name of results in Integrated Workflow Processor
    • Optimization of breadcrumbs in combination with logo
    • Improved updating of statistics in Dataset detail view


Features

In the following all features are listed which are (partly) included in the release. If no other open parts (stories) are left, the feature is finished and otherwise it is clarified as ongoing.

Finished Features


Finalize implementation of ORC support
Goal
  • Unit and integration coverage
  • Issues with ORC column schema limitations addressed
  • Possible issues with encoding of string cells analyzed and addressed (if necessary)
Finished parts in the release

Optional: enable ORC support in ONE DATA

ORC is far more limited than Parquet is, allowing only Alphanumeric, dot and underscore

It needs to evaluated whether a substitution similar to Parquet is feasible and then implemented. Alternatively, only a forward substition is implemented


Read .csv from HDFS
Goal

Connection can read .csv-files from a local filesystem.

It is possible to create a dataset from a connection. The dataset can be used properly in a Dataset Load Processor.

Finished parts in the release

Filesystem Whitelist per Domain

Allow the superadmin to configure which filesystem directories are allowed to be used via filesystem connections per domain.
The current whitelisting implementation (configured via server's config.properties file) only allows whitelisting directories globally for all domains.
The new per-domain whitelists should work alongside the global whitelist.
A user is allowed to use a specific filesystem directory if at least one of the following two statements is true:

  1. The directory is whitelisted via the global config.properties setting
  2. The directory is whitelisted by the superadmin for a domain x and the user is a member of x.


Handling files in connections
Goal

Connection can handle multiple files using the same schema.
It is possible to create a dataset from the merged file and use it in the dataset load processor.

Finished parts in the release

Merge rules for file system connections

MVP with ability to merge different filesystem files into a single files
and load it as a dataset.

Show samples in File system connection datasetsDetail detailed-dataset-overview
  • Show a sample of the dataset in the dataset details page with at least 5 rows
Include Merge Rules into File system connections
  • Merge rules will be created in the connection settings that define which files will be merged and what the name of the resulting single file will be
  • In the connection's file list page, the merged files and the original files are shown
  • the connection's file list can be filtered for either only actual files or only merged files or both
  • It should be possible to merge 2 - n files into one file
    • no warning if 0 or 1 file is met by merge rule
  • It should be possible to create a Data Table from the merged file
  • The merged file can be used with the Data Table Load Processor; the usage should be the same as with a Data Table from a single unmerged file
  • The split files must have the same schema to merge them to one single Data Table
  • Merging should work with file sizes of at least 50 GB (for example: 2 files with 25 GB each and 1000 files with 50 MB each)
  • Error message should be shown, if files with different schemas are be merged
  • for all files an extra column containing the relative filepath will be added.
    • The Name of the column was not defined yet
    • in case of name collision a prefix/suffix has to be added
  • The regex/merge rule is applied on the relative file path
  • Search for a regex string is available.

Ongoing Features


Deal with technical debts and workarounds in Debug-Mode
Goal

Deal with technical debts and workarounds in Debug-Mode

Finished parts in the release

Bugs and Defects related to Debug Mode

Improvements


Rework graph implementation
Goal

Provide an alternative representation of the graph using node and edge lists, instead of the existing representation (node contains the list of successors), since this makes the substitution algorithm unnecessarily complex.



[UX-UI] Show meaningful (dynamic) HTML title instead of ONE DATA
Goal

Instead of showing "ONE DATA" in the title of a tab, the name of the workflow, report, dataset, model, production line is shown.


[UX-UI] Showing name of Result Table processors in Integrated Workflow Execution processor
Goal

Add a section in the Integrated Workflow Execution processor to show the name of the different result tables processors as a list.


[UX-UI] In case of having a long breadcrumb text, it will overlap the logo
Goal

The branding (= logo in the header) is shown on the list of project page. Inside a project the logo shall not be shown.


[UX-UI] Dataset DetailView: Statistics are not updated when changing data types
Goal

In the Data Table detailview, when changing the the data types of a Data Table under the section DATA TABLE SAMPLES the dataset statistics are updated automatically.


Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article