Content
Overview
This article covers the functionality of the ONE DATA Python Framework that is available for the Python Scripts within the ONE DATA (OD) Python Processors.
It can be used to:
- access datasets passed from OD
- support of OD Variables
- push any of the following structures to OD:
- data tables
- images
- Models
- data tables
Additionally, any output to Pythons stdout (e.g. print() calls) and stderr (e.g. raised Exceptions) are also passed back to the OD server.
Usage of Global Variables Within a Python Script
The OD Python Framework registers the following global variables which are accessible from Python scripts:
- od_input: Dict structure with keys being dataset names (as defined in the OD Python Processors) and values being instances of the framework's dataset structure.
- od_output: Contains a set of methods for adding supported output structures.
In the next sections, the listed possibilities are explained more in-depth.
More information about Variables in ONE DATA can be found here.
OD Variables
ONE DATA Variables can be used in the Python script by directly using @variable_name@, where "variable_name" is the technical name of the Variable as defined in OD.
When running the Python Processor, @variable_name@ will be considered as the Variable's value in plain text form.
Example:
Assume "variable_name" to be an OD Variable of type string with value "abc". Then:
variable_value = @variable_name@
Would be equivalent to:
variable_value = abc
This means to use OD Variables in Python they might have to be converted to Python variables first. Below are examples of how to do this.
Examples
String OD Variable
Name | Technical Name | Data Type | Value |
String Variable | stringVariable | string | "string" |
variable_value = "@stringVariable@" # "string": str
Integer OD Variable
Name | Technical Name | Data Type | Value |
Integer Variable | intVariable | int | 100 |
variable_value = @intVariable@ # 100: int, no conversion necessary
Double OD Variable
Name | Technical Name | Data Type | Value |
Double Variable | doubleVariable | double | 100.0 |
variable_value = @doubleVariable@ # 100.0: float, no conversion necessary
Boolean OD Variable
Name | TechnicalName | Data Type | Value |
Boolean Variable | booleanVariable | boolean | true |
variable_value = "@booleanVariable@" == "true" # True: bool
DateTime OD Variable
Converting datetime Variables differs based on the format of the Variable. All of the following examples require
from datetime import datetime # additional timezone import required for long format
Format | Name | Technical Name | Data Type | Value |
string | Date Variable | dateVariable | datetime | 09/18/2018 00:00:00.0 |
variable_value = datetime.strptime("@dateVariable@", "%m/%d/%Y %X.%f") # 2018-09-18 00:00:00 : datetime
Format | Name | Technical Name | Data Type | Value |
long | Date Variable | dateVariable | datetime | 1537228800000 |
variable_value = datetime.fromtimestamp(@dateVariable@/1000, timezone.utc) # 2018-09-18 00:00:00+00:00 : datetime, requires additional timezone import from datetime
Format | Name | Technical Name | Data Type | Value |
Spark_Timestamp | Date Variable | dateVariable | datetime | 2018-09-18 00:00:00 |
variable_value = datetime.fromisoformat("@dateVariable@") # 2018-09-18 00:00:00 : datetime
Format | Name | Technical Name | Data Type | Value |
Oracle_Date | Date Variable | dateVariable | datetime | TO_DATE('2018-09-18 00:00:00','YYYY-MM-DD HH24:MI:SS') |
variable_value = datetime.fromisoformat("@dateVariable@"[9:-26]) # 2018-09-18 00:00:00 : datetime
Format | Name | Technical Name | Data Type | Value |
Oracle_Timestamp | Date Variable | dateVariable | datetime | TO_TIMESTAMP('2018-09-18 00:00:00','YYYY-MM-DD HH24:MI:SS') |
variable_value = datetime.fromisoformat("@dateVariable@"[14:-26]) # 2018-09-18 00:00:00 : datetime
Datatable OD Variable
Name | Technical Name | Data Type | Value |
Datatable Variable | datasetVariable | dataset | The value is the UUID of a dataset (e.g. b9edb619-3a26-4113-be5d-74241d1fa0f6) |
import uuid
variable_value = str("@datasetVariable@") # "b9edb619-3a26-4113-be5d-74241d1fa0f6": str
variable_value = uuid.UUID("@datasetVariable@") # UUID('{b9edb619-3a26-4113-be5d-74241d1fa0f6}'): UUID
Input
Only datasets can be passed from OD to Python scripts. To get access to a registered dataset use:
dataset = od_input['dataset-name']
where "dataset-name" is a name registered in the OD Python Processor configuration for a specific input dataset.
The returned value is an instance of the frameworks dataset structure.
Output
Dataset
To pass a dataset created in a Python script back to OD, use:
od_output.add_data("dataset-name", dataset, col_names)
where
- "dataset-name" is the name under which it will be available in an OD Processor (e.g. for registering with Processors output)
- "dataset"is one of the following:
- an ODPF dataset structure
- 2D matrix of data as list of rows where each row is a list of column values
- Pandas DataFrame
- "col_names" is an optional list of column names (required for 2D matrix content)
Image
To pass an image from a Python script back to OD, use:
from onelogic.odpf import ImageType
image_type = ImageType.PNG # or ImageType.JPG
od_output.add_image("image-name", image_type, image_content)
where
- "image-name" is the name under which the image will be available in a OD Processor (e.g. for adding to a report)
- "image-type" is the type of the image (one of the values in onelogic.odpf.ImageType: JPG or PNG)
- "image-content"is one of the following:
- a byte array with the image content
from PIL import Image
roi_img = Image.new('RGB', (60, 30), color='red')
image_bytearr = io.BytesIO()
roi_img.save(image_bytearr, format='PNG')
image_content = image_bytearr.getvalue()
- a matplotlib's Figure object as a result of plot creation
import pandas as pd
df = pd.DataFrame({'lab':['A', 'B', 'C'], 'val':[10, 30, 20]})
ax = df.plot.bar(x='lab', y='val', rot=0)
image_content = ax.get_figure()
Model
To pass Model data from a Python script to OD, use:
od_output.add_model("model-name", model_content)
where
- "model-name" is a name under which the Model will be available in a OD Processor (e.g. for exporting / saving it for later use)
- "model-content" is the content of the Model (as string)
Model Groups
To pass Model Group data from a Python script to OD, use:
od_output.add_model("model-name", "model-content", "model-group-name")
where
- "model-name" is a name under which the Model will be available in a OD Processor (e.g. for exporting / saving it for later use)
- "model-content" is the content of the Model (as string)
- "model-group-name" is the name of the Model Group, to which the Model will be assigned to.
For now, Model in the OD Python Framework can be an arbitrary string. This can change in future versions!
Dataset Structure
The Input and Output datasets are represented as onelogic.odpf.ODDataset.
Within Python scripts, datasets can use any Python / Numpy / Pandas data type available. As OD does not support all the various types mentioned before, datasets used for OD input / output are deserialized / serialized in following manner:
- od_input
OD Type | ODDataset Column Type |
INT | np.int64 / np.float64 (if None values present) |
DOUBLE / Numeric | np.float64 |
DATETIME | np.datetime64 |
STRING | np.object |
- od_output
ODDataset Column Type | OD Type |
any integer type | INT |
any floating point type | DOUBLE |
any datetime type | DATETIME |
np.object / string type | STRING |
Available Operations
Create new dataset
A new dataset can be created in two ways:
- 2D matrix representation of data (list of rows where each row is a list of column values) and list of column names
from onelogic.odpf.common import ODDataset
from datetime import datetime
dataset = ODDataset([[1, 2.0, "test", datetime.now()],
[2, 3.0, "sample", datetime.now()]],
["int_col", "double_col", "str_col", "timestamp_col"])
from onelogic.odpf.common import ODDataset
from datetime import datetime
from pandas import DataFrame
d = {'int_col': [1, 2], 'double_col': [2.0, 3.0], 'str_col': ['test', 'sample'],
'timestamp_col': [datetime.now(), datetime.now()]}
dataset = ODDataset(DataFrame(data=d))
Current restrictions:
- If content is passed as 2D matrix, column names must be specified and have the same size as each row
- Data types in columns must be of supported type
Get list of column names
To retrieve a list of the dataset's column names call:
column_names = dataset.column_names()
Get dataset as 2D matrix
To retrieve values of the dataset as a list of rows where each row is a list of column values (in same order as column names), call:
matrix = dataset.get_as_matrix()
Get dataset as Pandas DataFrame
To retrieve the values of a dataset as Pandas DataFrame call:
matrix = dataset.get_as_pandas()
Important: Columns in the returned Pandas DataFrame are in arbitrary order, so to access values of specific columns, the column name should be used instead of indexes.
Restrictions
get_as_* Operations
Calling any of the get_as_* operations on an ODDataset returns a copy of the dataset's actual state. After this, any changes done to a 2D matrix or a DataFrame version of the dataset are not synchronized with the original!
A copy of the inner ODDataset representation is created only if necessary. Once a copy is created, it is stored separately from the original inner representation within the ODDataset. Below is the table of get_as_* operations behaviour based on the ODDataset origin:
ODDataset origin | Get as 2D matrix | Get as Pandas DataFrame |
ODDataset constructed with 2D matrix input | no copy | copy |
ODDataset constructed with DataFrame | copy | no copy |
ODDataset from od_input | copy | no copy |
Example
Following Python script is a simple example which shows the usage of input / output global variables to read / pass data between the OD server and the script.
import io
from PIL import Image
from onelogic.odpf import ImageType
from onelogic.odpf.common import ODDataset
from datetime import datetime
import pandas as pd
print("Hello, world!")
# print input dataset as list of lists
print(od_input['input'].get_as_matrix())
# print input dataset as Pandas DataFrame
print(od_input['input'].get_as_pandas())
od_output.add_model("model", "This is a Model content.")
df = pd.DataFrame({'lab':['A', 'B', 'C'], 'val':[10, 30, 20]})
ax = df.plot.bar(x='lab', y='val', rot=0)
od_output.add_image("plot", ImageType.JPG, ax.get_figure())
roi_img = Image.new('RGB', (60, 30), color='red')
image_bytearr = io.BytesIO()
roi_img.save(image_bytearr, format='PNG')
image_bytearr = image_bytearr.getvalue()
od_output.add_image("image", ImageType.PNG, image_bytearr)
od_output.add_data("dataset", ODDataset([[1, 2.0, "test", datetime.now()],
[2, 3.0, "sample", datetime.now()]],
["int_col", "double_col", "str_col", "timestamp_col"]))
The following happens during / after script execution:
- The string "Hello World!" is printed out.
- The input dataset "input" is printed to stdout as 2D matrix and then as Pandas DataFrame.
- The Model "model" with content "This is a Model" is added to the output of the script.
- A JPG image containing a bar plot of a sample dataset is added to the scripts output under the name "plot".
- A PNG image with a 60 x 30 red rectangle is added to the output under the name "image"
- A new dataset (2 rows; 4 columns) with all supported data types is added to the scripts output as "dataset"
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article