One Data Python Framework (ODPF)

Modified on Thu, 30 Jun 2022 at 01:33 PM

Content

  1. Overview
  2. Usage of Global Variables Within a Python Script
    1. OD Variables
    2. Examples
  3. Input
  4. Output
    1. Dataset
    2. Image
    3. Model
  5. Dataset Structure
    1. Available Operations
    2. Restrictions
  6. Example


Overview

This article covers the functionality of the ONE DATA Python Framework that is available for the Python Scripts within the ONE DATA (OD) Python Processors. 

It can be used to:

  • access datasets passed from OD
  • support of OD Variables
  • push any of the following structures to OD:
    • data tables
    • images
    • Models

Additionally, any output to Pythons stdout (e.g. print() calls) and stderr (e.g. raised Exceptions) are also passed back to the OD server.


Usage of Global Variables Within a Python Script

The OD Python Framework registers the following global variables which are accessible from Python scripts:

  • od_inputDict structure with keys being dataset names (as defined in the OD Python Processors) and values being instances of the framework's dataset structure.
  • od_output: Contains a set of methods for adding supported output structures.

In the next sections, the listed possibilities are explained more in-depth.


More information about Variables in ONE DATA can be found here.



OD Variables


ONE DATA Variables can be used in the Python script by directly using @variable_name@, where "variable_name" is the technical name of the Variable as defined in OD. 

When running the Python Processor, @variable_name@ will be considered as the Variable's value in plain text form.


Example:

Assume "variable_name" to be an OD Variable of type string with value "abc". Then:

variable_value = @variable_name@

Would be equivalent to:

variable_value = abc


This means to use OD Variables in Python they might have to be converted to Python variables first. Below are examples of how to do this.



Examples

String OD Variable

NameTechnical NameData TypeValue
String VariablestringVariablestring"string"
variable_value = "@stringVariable@" # "string": str


Integer OD Variable

NameTechnical NameData TypeValue
Integer VariableintVariableint100
variable_value = @intVariable@ # 100: int, no conversion necessary


Double OD Variable

NameTechnical NameData TypeValue
Double VariabledoubleVariabledouble100.0
variable_value = @doubleVariable@ # 100.0: float, no conversion necessary


Boolean OD Variable

NameTechnicalNameData TypeValue
Boolean VariablebooleanVariablebooleantrue
variable_value = "@booleanVariable@" == "true" # True: bool

DateTime OD Variable 


Converting datetime Variables differs based on the format of the Variable. All of the following examples require

from datetime import datetime # additional timezone import required for long format


Format
Name
Technical Name
Data Type
Value
string
Date Variable
dateVariable
datetime
09/18/2018 00:00:00.0
variable_value = datetime.strptime("@dateVariable@", "%m/%d/%Y %X.%f") # 2018-09-18 00:00:00 : datetime


Format
Name
Technical Name
Data Type
Value
long
Date Variable
dateVariable
datetime
1537228800000
variable_value = datetime.fromtimestamp(@dateVariable@/1000, timezone.utc) # 2018-09-18 00:00:00+00:00 : datetime, requires additional timezone import from datetime


Format
Name
Technical Name
Data Type
Value
Spark_Timestamp
Date Variable
dateVariable
datetime
2018-09-18 00:00:00
variable_value = datetime.fromisoformat("@dateVariable@") # 2018-09-18 00:00:00 : datetime


Format
Name
Technical Name
Data Type
Value
Oracle_Date
Date Variable
dateVariable
datetime
TO_DATE('2018-09-18 00:00:00','YYYY-MM-DD HH24:MI:SS')
variable_value = datetime.fromisoformat("@dateVariable@"[9:-26]) # 2018-09-18 00:00:00 : datetime


Format
Name
Technical Name
Data Type
Value
Oracle_Timestamp
Date Variable
dateVariable
datetime
TO_TIMESTAMP('2018-09-18 00:00:00','YYYY-MM-DD HH24:MI:SS')
variable_value = datetime.fromisoformat("@dateVariable@"[14:-26]) # 2018-09-18 00:00:00 : datetime


Datatable OD Variable

NameTechnical NameData TypeValue
Datatable VariabledatasetVariabledatasetThe value is the UUID of a dataset (e.g.  b9edb619-3a26-4113-be5d-74241d1fa0f6)
import uuid

variable_value = str("@datasetVariable@") # "b9edb619-3a26-4113-be5d-74241d1fa0f6": str
variable_value = uuid.UUID("@datasetVariable@") # UUID('{b9edb619-3a26-4113-be5d-74241d1fa0f6}'): UUID

Input

Only datasets can be passed from OD to Python scripts. To get access to a registered dataset use:

dataset = od_input['dataset-name']

where "dataset-name" is a name registered in the OD Python Processor configuration for a specific input dataset.

The returned value is an instance of the frameworks dataset structure.


Output

Dataset

To pass a dataset created in a Python script back to OD, use:

od_output.add_data("dataset-name", dataset, col_names)

where

  • "dataset-name" is the name under which it will be available in an OD Processor (e.g. for registering with Processors output)
  • "dataset"is one of the following:
    • an ODPF dataset structure
    • 2D matrix of data as list of rows where each row is a list of column values
    • Pandas DataFrame
  • "col_names" is an optional list of column names (required for 2D matrix content)


Image

To pass an image from a Python script back to OD, use:

from onelogic.odpf import ImageType
image_type = ImageType.PNG # or ImageType.JPG
od_output.add_image("image-name", image_type, image_content)

where

  • "image-name" is the name under which the image will be available in a OD Processor (e.g. for adding to a report)
  • "image-type" is the type of the image (one of the values in onelogic.odpf.ImageType: JPG or PNG)
  • "image-content"is one of the following:
    • a byte array with the image content
from PIL import Image

roi_img = Image.new('RGB', (60, 30), color='red')
image_bytearr = io.BytesIO()
roi_img.save(image_bytearr, format='PNG')
image_content = image_bytearr.getvalue()


  • a matplotlib's Figure object as a result of plot creation
import pandas as pd      

df = pd.DataFrame({'lab':['A', 'B', 'C'], 'val':[10, 30, 20]})
ax = df.plot.bar(x='lab', y='val', rot=0)
image_content = ax.get_figure()


Model

To pass Model data from a Python script to OD, use:

od_output.add_model("model-name", model_content)

where

  • "model-name" is a name under which the Model will be available in a OD Processor (e.g. for exporting / saving it for later use)
  • "model-content" is the content of the Model (as string)


Model Groups

To pass Model Group data from a Python script to OD, use:

od_output.add_model("model-name", "model-content", "model-group-name")

where

  • "model-name" is a name under which the Model will be available in a OD Processor (e.g. for exporting / saving it for later use)
  • "model-content" is the content of the Model (as string)
  • "model-group-name" is the name of the Model Group, to which the Model will be assigned to.


For now, Model in the OD Python Framework can be an arbitrary string. This can change in future versions!


Dataset Structure

The Input and Output datasets are represented as onelogic.odpf.ODDataset.

Within Python scripts, datasets can use any Python / Numpy / Pandas data type available. As OD does not support all the various types mentioned before, datasets used for OD input / output are deserialized / serialized in following manner:

  • od_input
OD Type
ODDataset Column Type
INT
np.int64 / np.float64 (if None values present)
DOUBLE / Numeric
np.float64
DATETIME
np.datetime64
STRING
np.object
  • od_output
ODDataset Column Type
OD Type
any integer type
INT
any floating point type
DOUBLE
any datetime type
DATETIME
np.object / string type
STRING


Available Operations

Create new dataset

A new dataset can be created in two ways:

  • 2D matrix representation of data (list of rows where each row is a list of column values) and list of column names
from onelogic.odpf.common import ODDataset
from datetime import datetime

dataset = ODDataset([[1, 2.0, "test", datetime.now()],
[2, 3.0, "sample", datetime.now()]],
["int_col", "double_col", "str_col", "timestamp_col"])


from onelogic.odpf.common import ODDataset
from datetime import datetime
from pandas import DataFrame

d = {'int_col': [1, 2], 'double_col': [2.0, 3.0], 'str_col': ['test', 'sample'],
'timestamp_col': [datetime.now(), datetime.now()]}
dataset = ODDataset(DataFrame(data=d))


Current restrictions:

  • If content is passed as 2D matrix, column names must be specified and have the same size as each row
  • Data types in columns must be of supported type


Get list of column names

To retrieve a list of the dataset's column names call:

column_names = dataset.column_names()


Get dataset as 2D matrix

To retrieve values of the dataset as a list of rows where each row is a list of column values (in same order as column names), call:

matrix = dataset.get_as_matrix()


Get dataset as Pandas DataFrame

To retrieve the values of a dataset as Pandas DataFrame call:

matrix = dataset.get_as_pandas()


Important: Columns in the returned Pandas DataFrame are in arbitrary order, so to access values of specific columns, the column name should be used instead of indexes.


Restrictions

get_as_* Operations

Calling any of the get_as_* operations on an ODDataset returns a copy of the dataset's actual state. After this, any changes done to a 2D matrix or a DataFrame version of the dataset are not synchronized with the original!


A copy of the inner ODDataset representation is created only if necessary. Once a copy is created, it is stored separately from the original inner representation within the ODDataset. Below is the table of get_as_* operations behaviour based on the ODDataset origin:


ODDataset origin
Get as 2D matrix
Get as Pandas DataFrame
ODDataset constructed with 2D matrix input
no copy
copy
ODDataset constructed with DataFrame
copy
no copy
ODDataset from od_input
copy
no copy


Example

Following Python script is a simple example which shows the usage of input / output global variables to read / pass data between the OD server and the script.

import io
from PIL import Image
from onelogic.odpf import ImageType
from onelogic.odpf.common import ODDataset
from datetime import datetime
import pandas as pd

print("Hello, world!")

# print input dataset as list of lists
print(od_input['input'].get_as_matrix())

# print input dataset as Pandas DataFrame
print(od_input['input'].get_as_pandas())

od_output.add_model("model", "This is a Model content.")

df = pd.DataFrame({'lab':['A', 'B', 'C'], 'val':[10, 30, 20]})
ax = df.plot.bar(x='lab', y='val', rot=0)
od_output.add_image("plot", ImageType.JPG, ax.get_figure())

roi_img = Image.new('RGB', (60, 30), color='red')
image_bytearr = io.BytesIO()
roi_img.save(image_bytearr, format='PNG')
image_bytearr = image_bytearr.getvalue()
od_output.add_image("image", ImageType.PNG, image_bytearr)

od_output.add_data("dataset", ODDataset([[1, 2.0, "test", datetime.now()],
[2, 3.0, "sample", datetime.now()]],
["int_col", "double_col", "str_col", "timestamp_col"]))

The following happens during / after script execution:

  1. The string "Hello World!" is printed out.
  2. The input dataset "input" is printed to stdout as 2D matrix and then as Pandas DataFrame.
  3. The Model "model" with content "This is a Model" is added to the output of the script.
  4. A JPG image containing a bar plot of a sample dataset is added to the scripts output under the name "plot".
  5. A PNG image with a 60 x 30 red rectangle is added to the output under the name "image"
  6. A new dataset (2 rows; 4 columns) with all supported data types is added to the scripts output as "dataset" 


Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article