Optimizing the performance of SyncSets

Modified on Sun, 10 Apr 2022 at 11:06 PM

This article is outdated, you can access the most current version here: https://doc.onedata.de/apps/apps-docs/odml-documentation/AppBuilder/sync-sets/SyncSets.html#optimizing-the-performance-of-syncsets

Article Content

  1. Introduction
  2. Technical Approaches for Computing SyncSets
    1. Client-Side Filtering
    2. Server-Side Filtering
  3. Optimizing Performance With Server-Side Filtering
    1. Create Database Indexes with Curl
    2. Create Database Indexes with Postman
    3. Summary


Introduction

SyncSets are a way to propagate filters between different datasources and elements. They allow you and the users of your applications to slice and dice individual subsets of the datasets and take a closer look at the data that matters.

If you want to know how to integrate SyncSets to your App, have a look a the SyncSets article. Here, we will show how you can optimize the performance of your app when using SyncSets. 

Therefore the next subsection will give an introduction to the different technical approaches for SyncSets supported by ONE DATA Apps. The subsequent sections will show how the performance of the SyncSet can be tweaked by using server-side filtering with appropriate database indices.


Technical Approaches for Computing SyncSets

A SyncSet links two or more datasources by a number of columns that share a common value range. The target datasource is then filtered by the values present in the columns of the source datasource. ONE DATA Apps supports two approaches to compute the filtered results of a SyncSet, which will be explained in the following sections.


Client-Side Filtering

By default the SyncSet computation is done on the client-side. This means that the distinct values of the filtered datasources are computed on the ONE DATA server and then sent to the Apps client running in the users web browser. Then the Apps Client requests filtering of the target datasource by these distinct values from the ONE DATA server according to the received values. This communication of distinct values via the users web browser is depicted in the following figure:




The main advantage of this approach is that it supports a wide variety of datasource types. You could for example connect a Parquet-based data table with a data table that is based on a connection to a PostgreSQL database. While this gives some flexibility when building your app, using client-side filtering also has some drawbacks.


It relies on filtering based on distinct values as described above. Thus it is limited by the maximum number of distinct values that the ONE DATA server can compute. Every ONE DATA instance has a specific value configured that defines the maximum number of distinct values that can be retrieved. The default value for this configuration option is 120. Therefore you should only use client-side filtering if you know that the number of distinct values in the columns specified in the joins-section of your SyncSets will be small. Apart from that, using client-side filtering can negatively impact the overall loading speed of your app when using complex or cascading SyncSets, as the datasources need to be filtered one after another.


Advantages of client-side filteringLimitations of client side filtering
  • Can be used with datatables and connections of all storage types.
  • Is limited by the distinct value limit configured for the ONE DATA server.
  • May negatively impact performance especially when using complex SyncSets.


In summary client-side filtering is a good choice when using data from various sources. This can be especially helpful in a prototyping-phase. It is not the right choice if you expect to have lots of different values in the columns of the SyncSet or want to cascade your SyncSets.


Server-Side Filtering

In order to mitigate the shortcomings of client-side filtering, you can use server-side filtering in your app. It can be enabled by setting the following flag in the global-section of your app configuration:

  "global" {
        "serverSideFiltering": {
          "enabled": true
        }
  } 


When using server-side filtering the evaluation of the SyncSets is done on the ONE DATA server allowing for improved performance. As the computation is done at the backend, the SyncSets are not limited by the distinct value limit. Thus it is possible to filter based on columns with large amounts of distinct values. Apart from that it is not affected by the dependencies between your datasources imposed by the SyncSets. Thus the data can be computed in parallel speeding up the overall loading time of your app.
The main limitation for using server-side filtering is that it can only be used with data tables that have the PostgreSQL storage type. Apart from that it is necessary that the ONE DATA instance is configured to support execution of data requests using native SQL instead of Spark. Please contact your ONE DATA server administrator to enable this option if it is not available.


Advantages of server-side filteringLimitations of server-side filtering
  • Suitable for SyncSets based on large datasources.
  • Increased loading times of your app due to parallel filtering.
  • It is not affected by the distinct value limit of the ONE DATA instance.
  • Can only be used if all your datasources are stored with the PostgreSQL storage type.


In summary, using server-side filtering can improve the overall performance of your app and is not affected by the limitations of client-side filtering. It requires that all your data tables are based on tables in a PostgreSQL database. If this is possible in your use-case, server-side filtering is the right choice. To get the best results, please read the next section on optimizing your data storage for server-side filtering.


Optimizing Performance With Server-Side Filtering

When using SyncSets with server-side filtering, the filter conditions imposed by the SyncSets are mapped to a SQL-query executed on the relational database where the data is stored. This requires that the distinct values of the columns defined in joins of your SyncSets are computed by the database management system (DBMS). This is a very expensive operation - especially if your dataset contains a large number of rows. Thus it can slow down the performance of your app and may also impact other apps that rely on data from the same database.
By using appropriate database indices this performance bottleneck can be mitigated. This section will describe how you can define indices on your data tables in order to speed up SyncSet evaluation.

An index is a data structure maintained by the DBMS storing information about the data and where it is located in order to allow fast access to it. When an index is defined, the distinct values can be read directly from it. Thus it is not necessary to load the complete data and compute the distinct values from it. As the number of distinct values is likely rather small compared to the complete data, this leads to a significant speed-up of the query execution.


If a SyncSet is used with server-side filtering and no index is defined on the join-columns, a warning similar to the following one is shown in the App Builders error console:


In this case it is recommended to define an index on the columns listed in the warning. If you have access to the PostgreSQL database used by your ONE DATA instance, you can connect to it with a tool like DBeaver and create an index with the following command. More information about how to create database indices in PostgreSQL can be found in the official documentation.

CREATE INDEX index_name ON your_schema.table_name (column1, column2);  


If you do not have access to the database, you can create an index using the ONE DATA API.

On order to do this, you need the following information:

  • Your ONE DATA access token. It can be retrieved following the description in this article.
  • The ONE DATA API root of your instance. You can find this information by clicking on any request in the developer console (e.g. me) and copy the URL until /api. Please note that there can be also some path after the domain name that is also part of the API root, e.g. /onedata.
    The following figure shows where you can find the API root URL in the Network tab of the Google Chrome developer console (open it by pressing F12).


  • The id of the data table you want to create the indexes on. You can find it in the data table information header when opening the data table in the Data Hub. You will get redirected to the details view of the table, where the id is listed in the "Data Table Information" section (example: f6f54798-0674-42fd-a5bf-18530514ff54).
  • The column names you want to create the indexes for.


Having gathered this information, you can now request the ONE DATA API to create the indexes. Either you do it via curl in the command line, or you use a graphical tool like Postman if you prefer that. In the following two sections we will explain how to do that using both tools.


Create Database Indexes with Curl

An example of a curl-command creating an index on two columns is shown in the following snippet:

curl --location --request PUT 'https://internal.onedata.de/api/v1/data/content/db-index?dataId=f6f54798-0674-42fd-a5bf-18530514ff54' \
--header 'Authorization: <your ONE DATA access token>' \  
--header 'Content-Type: application/json' \  
--verbose \  
--data-raw '{  
   "unique": false, 
   "columns": [ 
     {"name": "country"}, 
     {"name": "region"} 
   ]
 }'  


The data table ID that needs to be set after "?dataId=" in the command.

The ONE DATA access token is the one you extracted following the steps in the above mentioned article (with the leading Bearer).
Finally you need to define the columns you want to create the index on. To do this you can add {"name":"yourColumnName"} under columns. In the snippet above an index is created on the columns country and region. You can specify a single column for the index or multiple ones - depending on the columns listed in the warning shown in the App Builders error console. Once you executed the command it will output similar to this:

...  
< HTTP/1.1 200  
< Access-Control-Allow-Origin: *  
< Vary: Origin  
< Access-Control-Max-Age: 31536000  
< Access-Control-Allow-Credentials: true  
< Access-Control-Expose-Headers: Authorization  
< Access-Control-Expose-Headers: x-auth-token  
< Authorization: Bearer eyJhbGciO...  
...  


If the output says HTTP/1.1 200 then the index was successfully created and will be used by server-side filtering for improved performance. Also the warning in the error console should disappear after a short time.


Create Database Indexes with Postman

The same goal can be achieved using Postman. Basically it works the same way, the only difference is that you have a user interface to enter the information.

First, create a new request clicking the "New" button in the upper left corner. Then you can fill in the required information.

As request method, "PUT" needs to be selected. In the textbox we specify the API root URL, followed by the path "/data/content/db-index". And in the "Params" menu below, we add the data table id as parameter.

Note that as soon as the parameter is inserted, it will be added to the request URL above.

In the "Authorization" tab, we choose Bearer Token as type and enter the token in the textbox at right hand side (without the "Bearer" prefix).


Last but not least, within the "Body" tab we need to specifiy the columns to be indexed. We choose tho body type "raw" in combination with JSON format and enter the columns JSON object in the text field below like that:


Then you can click "Send" to send the index creation request to ONE DATA. If you receive a response with status code 200 the request succeeded. If not, check which error message you received from the server and act accordingly.


Summary

This section showed how you can improve and optimize your apps performance by using server-side filtering. By defining appropriate database indices the performance of your app can be speed up even further.


Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article