Programming ETL
  • 17 Jul 2024
  • PDF

Programming ETL

  • PDF

Article summary

summary

This help page describes the Data****Settings>Programming ETL in Transfer Settings STEP 2.
Programming ETL allows you to execute programs you have written on data retrieved from the source.
Programming ETL allows for flexible conversion processes that cannot be performed with template ETL.

Plan Restrictions

Programming ETL is available only for contracted accounts on the Advanced plan and above.

programming-etl_001.png

Languages Supported

  • Python 3.9
  • Ruby 3.2
  • Ruby 2.7
About external libraries

By default, only the standard library is available.
If you have an external library you would like to use, please contact our Customer Success with the following information

  • Official URL of the external library
  • Version to be used

For standard libraries, there is no need to make a new request for additional libraries.
Before requesting the addition of a library, please check from the official documentation below to see if the library is included in the standard library.

constraints

  • Only conversion processes can be written in single line units.
    • It is not possible to carry over the results of one line of processing to a subsequent line.
    • Therefore, it is not possible to describe aggregate processing that spans multiple lines. Please understand that we are not responsible for any loss or damage to your property.
  • The memory limit for program execution is approximately 1 GB.
  • Schema generation when creating or editing transfer settings is designed to time out after 10 minutes have elapsed since execution.
  • On the other hand, there is no special timeout setting for programming ETL processing during transfer job execution.
    • A processing time of about 1 minute per line will run without problems.
Processing order of data setting at job execution

Template ETL processing is first performed on the data retrieved from the transfer source.
Programming ETL is then processed.

Setup procedure

Programs written in programming ETL have two roles.

  • When creating or editing transfer settings, the schema is used to create the schema.
  • When a transfer job is executed, processing is performed on the data acquired from the transfer source.

The following describes the procedure for creating a schema when creating or editing transfer settings.

1. input data generation

Click Generate Input Data.
Data with the template ETL settings reflected will be generated in JSON format in the form of input data for schema generation.
For details, see Format of Input Data for Schema Generation below.

Save input data

Clicking Save Input Data saves the input data at that point.
After editing input data, you may want to leave the screen and continue editing again.

2. source code description

Write the program in the source code.
For details, see the role of each element in the source code below.

3. schema generation

Click Generate Schema.
The program is executed on the input data generated in step 1.
When execution is complete, data in JSON format is displayed in the output data during schema generation.
The schema will also be updated when programming ETL is enabled.

Schema generation timeout

Schema generation is designed to time out after 10 minutes of execution.
In the unlikely event of a timeout, edit the source code so that the execution time is within 10 minutes.

Note that this timeout specification is set only for the schema generation process.
No timeout settings are provided for programming ETL processing during transfer job execution.

Editing the schema

Edit the updated programming ETL enable schema accordingly.

If you added a column for date/time string in the source code

If you add a date/time string column in the source code, it is recognized as a string type by default.
If you want to recognize it as timestamp type, select timestamp type in the schema when programming ETL is enabled, and enter the date and time format accordingly.

5. schema storage

Click to preview or confirm****changes.
Whichever you click, the programming ETL settings are saved.
If you want to see the data with the programming ETL executed, click Preview Changes.

Schema when programming ETL is enabled

If programming ETL is disabled, transfers are based on the schema set in the column definition.
On the other hand, if programming ETL is enabled, the transfer is based on the schema when programming ETL is enabled.
After editing the schema when programming ETL is enabled, if you have changed the settings related to columns on the template ETL side, please perform the configuration procedure again.

Format of input data for schema generation

  • Input data is stored in a variable with an array (list) structure called rows.
  • Each element of rows corresponds to the data in each row, and each element is a hash (dictionary) structure with columns as keys.
rows :. [
  {<row1_columnA>: <value>, 
   <row1_columnB>: <value>, 
   <row1_columnC>: <value>
  },
  {<row2_columnA>: <value>, 
   <row2_columnB>: <value>, 
   <row2_columnC>: <value>
  },
  {<row3_columnA>: <value>, 
   <row3_columnB>: <value>, 
   <row3_columnC>: <value>
  }
]

Role of each element in the source code

In the following, Python is used as an example.

from etl_base import EtlBase

class Etl(EtlBase):.
  def __init__(self):.
    pass (e.g. skipping a move, passing an examination, ticket to allow entry, etc.)

  def transform_row(self, row):.
    row ["columnA"] = "hoge" # Existing column's value has changed.
    row ["new_column"] = "fuga" # New column is added.
    del row ["columnC"] # Existing column is deleted.
    return row

  def before_action(self):.
    pass (e.g. skipping a move, passing an examination, ticket to allow entry, etc.)

EtlBase and Etl classes

The EtlBase and Etl classes are the necessary elements to perform programming ETL.
Do not delete or rename.

transform_row method

Please describe the process in this method as appropriate.
When the program is executed, each element of the rows is passed to the transform_row method one row at a time.
By default, each element of rows is passed to row. The following description assumes that row data is passed to row.

  • Rewriting the value of a key that exists in a row changes the value of the corresponding column.
  • Adding a new key to row adds a new column with the key name as the column name.
  • Deleting a key that exists in a row deletes the corresponding column.

Note that the schema is created and transferred based on the return value of the transform_row method.
Therefore, RETURN the processed ROW.
If a NULL value is returned, the transfer of the corresponding line is skipped. ( None for Python and nil for Ruby)

before_action method

It is executed only once, before the transform_row method.
Within the before_action method, instance variables can be assigned.
The value can also be used within the transform_row method.
Use this function when you want to perform arbitrary arithmetic operations before sequential processing of rows.

def before_action(self):.
  self.hoge = "hoge"

supplementary information

Manual editing of input data for schema generation

Input data for schema generation can also be edited manually.
By editing the input data for schema generation as needed and generating a schema, it is possible to test whether the written program is properly executed.

However, when the schema is generated, it is also updated when programming ETL is enabled.
If the structure of the data retrieved from the transfer source at the time of transfer job execution differs from the schema when programming ETL is enabled, the transfer will fail.
Therefore, manual editing of input data for schema generation should be used only as a test.

Standard output/Error output

If an output function/method such as print is written in the source code, its output result will be displayed in the standard output/error output.
Note that error messages will also be displayed in the standard output/error output if the schema generation fails.


Was this article helpful?