Programming ETL

TROCCO
- TROCCO
- COMETA
English

Programming ETL

Article summary

Did you find this summary helpful?

Thank you for your feedback!

summary

This help page explains DataSetting>Programming ETL inSTEP 2 of****ETL Configuration.
Programming ETL allows you to execute a program you have written against data retrieved from the Data Source.
Programming ETL allows for flexible conversion processes that cannot be performed with Template ETL.

Plan Restrictions

Programming ETL is available only for accounts subscribed to the Advanced plan or higher.

programming-etl-2024-08-29-16-6-0

Languages Supported

Python 3.9
Ruby 3.2
Ruby 2.7

About external libraries

By default, only the standard library is available.
If you have an external library you would like to use, please contact our Customer Success with the following information

Official URL of the external library
Version to be used

For standard libraries, there is no need to make a new request for additional libraries.
Before requesting the addition of a library, please check from the official documentation below whether the library in question is included in the standard library.

constraints

Only conversion processes can be written in single line units.
- It is not possible to carry over the results of one line of processing to a subsequent line.
- Therefore, it is not possible to describe aggregate processing that spans multiple lines. Please understand that we are not responsible for any loss or damage to your property.
The memory limit for program execution is approximately 1 GB.
The generation of the schema when creating or editing ETL Configuration is timed out after 10 minutes of execution.
On the other hand, there is no timeout setting for Programming ETL during ETL Job Setting.
- A processing time of about 1 minute per line will run without problems.

Processing Order of Data Setting during Job Setting

Template ETL processing is performed first on the Data Destination data.
Programming ETL is then processed.

Setup procedure

Programs written in Programming ETL have two roles.

When creating or editing ETL Configuration, the schema is used to create the schema.
When an ETL Job is executed, processing is performed on the data acquired from the Data Source.

The following describes the procedure for creating a schema when creating or editing an ETL Configuration.

1. input data generation

Click Generate Input Data.
Data with Template ETL settings reflected will be generated in JSON format in the form of input data for schema generation.
For details, see Format of Input Data for Schema Generation below.

Save input data

Clicking Save Input Data saves the input data at that point.
After editing input data, you may want to leave the screen and continue editing again.

2. source code description

Write the program in the source code.
For details, see the role of each element in the source code below.

3. schema generation

Click Generate Schema.
The program is executed on the input data generated in step 1.
When execution is complete, data in JSON format is displayed in the output data during schema generation.
The schema will also be updated when Programming ETL is enabled.

Schema generation timeout

Schema generation is designed to time out after 10 minutes of execution.
In the unlikely event of a timeout, edit the source code so that the execution time is within 10 minutes.

Note that this timeout specification is set only for the schema generation process.
There is no time-out setting for Programming ETL during ETL Job Setting.

Editing the schema

Edit the updated Programming ETL enabled schema accordingly.

If you added a column for date/time string in the source code

If you add a date/time string column in the source code, it is recognized as a string type by default.
If you want the date to be recognized as a timestamp type, select the timestamp type in the schema when Programming ETL is enabled, and enter the date and time format as appropriate.

5. schema storage

Click to preview or confirm****changes.
Whichever you click, your Programming ETL settings will be saved.
If you want to see the data with Programming ETL executed, click Preview Changes.

Schema when Programming ETL is enabled

If Programming ETL is disabled, transfers are based on the schema set in the Column Setting.
On the other hand, if Programming ETL is enabled, the transfer is based on the schema when Programming ETL is enabled.
After editing the schema when Programming ETL is enabled, if you have changed the settings regarding columns on the Template ETL side, please perform the configuration procedure again.

Format of input data for schema generation

Input data is stored in a variable with an array (list) structure called rows.
Each element of rows corresponds to the data in each row, and each element is a hash (dictionary) structure with columns as keys.

rows : [
  {<row1_columnA>: <value>, 
   <row1_columnB>: <value>, 
   <row1_columnC>: <value>
  },
  {<row2_columnA>: <value>, 
   <row2_columnB>: <value>, 
   <row2_columnC>: <value>
  },
  {<row3_columnA>: <value>, 
   <row3_columnB>: <value>, 
   <row3_columnC>: <value>
  }
]

Role of each element in the source code

In the following, Python is used as an example.

from etl_base import EtlBase

class Etl(EtlBase):
  def __init__(self):
    pass

  def transform_row(self, row):
    row["columnA"] = "hoge"    # Existing column's value is changed.
    row["new_column"] = "fuga" # New column is added.
    del row["columnC"]         # Existing column is deleted.
    return row

  def before_action(self):
    pass

`EtlBase` and `Etl` classes

The EtlBase and Etl classes are the necessary elements for executing Programming ETL.
Do not delete or rename.

`transform_row` method

Please describe the process in this method as appropriate.
When the program is executed, each element of the rows is passed to the transform_row method one row at a time.
By default, each element of rows is passed to row. The following description assumes that row data is passed to row.

Rewriting the value of a key that exists in a row changes the value of the corresponding column.
Adding a new key to row adds a new column with the key name as the column name.
Deleting a key that exists in a row deletes the corresponding column.

Note that the schema is created and transferred based on the return value of the transform_row method.
Therefore, RETURN the processed ROW.
If a NULL value is returned, the transfer of the corresponding line is skipped. ( None for Python and nil for Ruby)

`before_action` method

It is executed only once, before the transform_row method.
Within the before_action method, instance variables can be assigned.
The value can also be used within the transform_row method.
Use this function when you want to perform arbitrary arithmetic operations before sequential processing of rows.

def before_action(self):
  self.hoge = "hoge"

supplementary information

Manual editing of input data for schema generation

Input data for schema generation can also be edited manually.
By editing the input data for schema generation as needed and generating a schema, it is possible to test whether the written program is properly executed.

However, when the****schema is generated, it is also updated when Programming ETL is enabled.
If the structure of the data retrieved from the Data Source at the time the transfer job is executed differs from the schema used when Programming ETL is enabled, the transfer will fail.
Therefore, manual editing of input data for schema generation should be used only as a test.

Standard output/Error output

If an output function/method such as print is written in the source code, its output result will be displayed in the standard output/error output.
Note that error messages will also be displayed in the standard output/error output if the schema generation fails.

Was this article helpful?

What's Next

About input file format settings

Table of contents

summary
Languages Supported
constraints
Setup procedure
Specifications related to programming

Programming ETL

summary

Languages Supported

constraints

Setup procedure

1. input data generation

2. source code description

3. schema generation

Editing the schema

5. schema storage

Specifications related to programming

Format of input data for schema generation

Role of each element in the source code

EtlBase and Etl classes

transform_row method

before_action method

supplementary information

Manual editing of input data for schema generation

Standard output/Error output

What's Next

`EtlBase` and `Etl` classes

`transform_row` method

`before_action` method