About input file format settings

Prev Next

summary

In the Data Source File/Storage Connector, the input file format can be set in STEP 1 of ETL Configuration, and detailed settings for the input file can be configured in Input Option in STEP 2.
This page describes the input file formats and Connectors supported by TROCCO, as well as the various Input Option setting items.

Input file format

In STEP 1 of ETL Configuration editing, you can choose from the following

  • CSV/TSV
  • JSON Lines
  • JSONPath
  • LTSV
  • Microsoft Excel (xlsx or xls file)
  • XML (beta version)
About XML file transfer

The XML file format transfer function is available as a beta version.
Specifications are subject to change when the official version is released.

Connectors covered in this article

Setting items

Setting items vary depending on the input file format.

CSV/TSV

Embulk's csv-parser-plugin is used.

STEP2 Input Option

item name default value Description.
delimiter , CSV Data Setting
quotation marks " You can set quotation marks.
escape character \ Escape characters can be set.
String to be replaced by NULL Not set You can set the string to be replaced by NULL.
If you select "Set", you can enter the string to be replaced.
Number of header rows to skip 0 Sets the number of rows to skip.
For example, if the first line is the item name and you do not want the item name line to be included in the ETL Configuration Data Setting, set 1.
Remove white space from value if no quotation marks are present No, sir. You can set whether or not whitespace should be removed from the value if it is not quoted.
Processing method for irregular quarts ACCEPT_ONLY_RFC4180_ESCAPED Sets how to handle the presence of irregular quotes in quoted fields.
See quotes_in_quoted_fields incsv-parser-pluginfor details.
comment line marker - If the character set here is at the beginning of a line, skip that line.
Processing rows with fewer columns Treat as invalid record If you choose to process as invalid records, when there are records with insufficient columns, the processing of the corresponding rows will be skipped.
If NULL completion of missing columns is selected, when a record with missing columns exists, NULL values are completed and processing continues.
Processing rows with more columns Treat as invalid record If you choose to process as invalid records, when a record with an excess number of columns exists, the processing of the corresponding row is skipped.
If you choose to ignore columns, when there is a record with an excess number of columns, the process continues ignoring the overflowing columns.
Maximum amount of data that can be enclosed in quotation marks 131072 Maximum amount of Data Setting (in bytes) that can be quoted.
If there is a larger amount of data than the value set here, the processing of that row is skipped.
Whether or not to abort the transfer if an invalid record is found. Abort transfer If you choose to abort the transfer, the transfer will be aborted when an invalid record is found.
If you choose to continue processing, when an invalid record is found, the NULL value is completed and processing continues.
Default time zone UTC Time zone can be set to be used in a timestamp type column when the read Data itself has no information about time zone.
Date Initialization 1970-01-01 In the date column, you can set a default value if the date is not recognized.
new line CRLF Set the rules for line breaks among CRLF, LF, and CR.
character encoding - You can set the character encoding method.
If not entered, it will be automatically guessed during Automatic Data Setting.

JSON Lines

embulk-parser-jsonl.

STEP2 Input Option

item name default value Description.
Whether or not to abort the transfer if an invalid record is found. Abort transfer If you choose to abort the transfer, the transfer will be aborted when an invalid record is found.
If you choose to continue processing, when an invalid record is found, the NULL value is completed and processing continues.
Default time zone UTC Time zone can be set to be used in a timestamp type column when the read Data itself has no information about time zone.
new line CRLF You can set rules for line breaks among CRLF, LF, and CR.
character encoding - You can set the character encoding method.
If not entered, it will be automatically guessed during Automatic Data Setting.

JSONPath

embulk-parser-jsonpath.

STEP1 Basic settings

item name default value Description.
JSONPath - See Operators for information on how to write JSONPath.
*To specify all, specify $. *If you want to specify all, specify $.

STEP2 Input Option

item name default value Description.
root - This is the same setting as "JSONPath" in STEP1 Basic Settings.
Default time zone UTC Time zone can be set to be used in a timestamp type column when the read Data itself has no information about time zone.

LTSV

STEP2 Input Option

item name default value Description.
new line CRLF You can set rules for line breaks among CRLF, LF, and CR.
character encoding - You can set the character encoding method.
If not entered, it will be automatically guessed during Automatic Data Setting.

Microsoft Excel

embulk-parser-poi_excel.

STEP1 Basic settings

item name default value Description.
Sheet name - Enter the name of the sheet to be transferred.
Number of header rows to skip 1 Sets the number of rows to skip.
For example, if the first line is the item name and you do not want the item name line to be included in the ETL Configuration Data Setting, set 1.
Column Setting - Set the column name, column type, and method of retrieving values.
How to get the value,
  • If you choose to use cash
    • The transfer process is accelerated because the results of calculations are used at the time the Excel Files are saved locally.
  • When recalculation on transfer is selected
    • Recalculate formulas in cells during transfer.
Time zone for date and time columns Asia/Tokyo You can set the time zone to be used for timestamp type columns.
Function with different calculation result each time

If you select "Use Cache" in the column configuration, ETL Configuration will use the values at the time the Excel Files are saved locally for ETL Configuration.
Therefore, functions that change the result of calculation each time, such as date/time functions (e.g., TODAY function) and random number generation functions (e.g., RAND function), are not recalculated when the transfer is executed. Please note

Select Recalculate on Transfer if the formulas in the cells need to be recalculated when the transfer is executed.

XML (beta version)

embulk-parser-xml.

STEP1 Basic settings

item name default value Description.
Path of the root (XPath format) - Enter the root path to the node on which the Xpath search will be based.
If you want all nodes in the XML file to be searched, enter /.
Column Setting - Enter the path of the node you wish to retrieve in Xpath format.
For XPath notation, see XPath Syntax.
The column type can be selected from the following
  • string
  • long
  • timestamp
  • double
  • boolean
  • json
  • If timestamp is selected, also enter the format and time zone.
    About XML namespace

    You cannot specify a path in combination with a prefix defined in the XML namespace.