About input file format settings
  • 17 Jul 2024
  • PDF

About input file format settings

  • PDF

Article summary

summary

In the source file/storage connector, the input file format can be set in STEP 1 of the transfer settings, and detailed settings for the input file can be configured in the input options in STEP 2.
This page describes the input file formats and connectors supported by TROCCO, as well as various input option setting items.

Input file format

In STEP 1 of the Edit Transfer Settings, you can choose from the following
* CSV/TSV
* JSON Lines
* JSONPath
* LTSV
* Microsoft Excel (xlsx or xls file)
* XML (beta version)

About XML file transfer

The XML file format transfer function is available as a beta version.
Specifications are subject to change when the official version is released.

Connectors covered in this article

Setting items

Setting items vary depending on the input file format.

CSV/TSV

Embulk's csv-parser-plugin is used.

STEP2 Input Options

item namedefault valueDescription.
delimiter,Sets the delimiter character for CSV data.
quotation marks"You can set quotation marks.
escape character\blowEscape characters can be set.
String to be replaced by NULLNot setYou can set the string to be replaced by NULL.
If you select " Set", you can enter the string to be replaced.
Number of header lines to skip0Sets the number of rows to skip.
For example, if the first line is the item name and you do not want the item name line to be included in the transfer data, set 1.
Remove white space from value if no quotation marks are presentNo, sir.You can set whether or not whitespace should be removed from the value if it is not quoted.
Processing method for irregular quartsACCEPT_ONLY_RFC4180_ESCAPEDSets how to handle the presence of irregular quotes in quoted fields.
See quotes_in_quoted_fields incsv-parser-plugin for details.
comment line marker-If the character set here is at the beginning of a line, skip that line.
Processing rows with fewer columnsTreat as invalid recordIf you choose to process as invalid records, when there are records with insufficient columns, the processing of the corresponding rows will be skipped.
If NULL completion of missing columns is selected, when a record with missing columns exists, NULL values are completed and processing continues.
Processing rows with more columnsTreat as invalid recordIf you choose to process as invalid records, when a record exists that has an excess number of columns, the processing of the corresponding row is skipped.
If you choose to ignore columns, when there is a record with an excess number of columns, the process continues ignoring the overflowing columns.
Maximum amount of data that can be enclosed in quotation marks131072Sets the maximum amount of data (in bytes) that can be quoted.
If there is a larger amount of data than the value set here, the processing of that row is skipped.
Whether or not to abort the transfer if an invalid record is found.Abort transferIf you choose to abort the transfer, the transfer will be aborted when an invalid record is found.
If you choose to continue processing, when an invalid record is found, the NULL value is completed and processing continues.
Default time zoneUTCYou can set the time zone to be used in a timestamp-type column when the read data itself has no information about the time zone.
Date Initialization1970-01-01In the date column, you can set a default value if the date is not recognized.
new lineCRLFSet the rules for line breaks among CRLF, LF, and CR.
character encoding-You can set the character encoding method.
If not entered, it will be automatically guessed during automatic data setup.

JSON Lines

embulk-parser-jsonl.

STEP2 Input Options

item namedefault valueDescription.
Whether or not to abort the transfer if an invalid record is found.Abort transferIf you choose to abort the transfer, the transfer will be aborted when an invalid record is found.
If you choose to continue processing, when an invalid record is found, the NULL value is completed and processing continues.
Default time zoneUTCYou can set the time zone to be used in a timestamp-type column when the read data itself has no information about the time zone.
new lineCRLFYou can set rules for line breaks among CRLF, LF, and CR.
character encoding-You can set the character encoding method.
If not entered, it will be automatically guessed during automatic data setup.

JSONPath

embulk-parser-jsonpath.

STEP1 Basic settings

item namedefault valueDescription.
JSONPath-See Operators for information on how to write JSONPath.
*To specify all, specify $. *If you want to specify all, specify $.

STEP2 Input Options

item namedefault valueDescription.
root-This is the same setting as "JSONPath" in STEP1 Basic Settings.
Default time zoneUTCYou can set the time zone to be used in a timestamp-type column when the read data itself has no information about the time zone.

LTSV

STEP2 Input Options

item namedefault valueDescription.
new lineCRLFYou can set rules for line breaks among CRLF, LF, and CR.
character encoding-You can set the character encoding method.
If not entered, it will be automatically guessed during automatic data setup.

Microsoft Excel

embulk-parser-poi_excel.

STEP1 Basic settings

item namedefault valueDescription.
Sheet name-Enter the name of the sheet to be transferred.
Number of header lines to skip1Sets the number of rows to skip.
For example, if the first line is the item name and you do not want the item name line to be included in the transfer data, set 1.
Column Setting-Set the column name, column type, and method of retrieving values.
How to get the value,
  • If you choose to use cash
    • The transfer process is accelerated because the calculation results are used at the time the Excel file is saved locally.
  • When recalculation on transfer is selected
    • Recalculate formulas in cells during transfer.
Time zone for date and time columnsAsia/TokyoYou can set the time zone to be used for timestamp type columns.
Function with different calculation result each time

If " Use cache" is selected in the column settings, the values at the time the Excel file is saved locally will be used for the transfer.
Therefore, functions that change the result of calculation each time, such as date/time functions (e.g., TODAY function) and random number generation functions (e.g., RAND function), are not recalculated when the transfer is executed. Please note

Select Recalculate on Transfer if the formulas in the cells need to be recalculated when the transfer is executed.

XML (beta version)

embulk-parser-xml.

STEP1 Basic settings

item namedefault valueDescription.
Path of the root (XPath format)-Enter the root path to the node on which the Xpath search will be based.
If you want all nodes in the XML file to be searched, enter /.
Column Setting-Enter the path of the node you wish to retrieve in Xpath format.
For XPath notation, see XPath Syntax.
The column type can be selected from the following
  • string
  • long
  • timestamp
  • double
  • boolean
  • json
  • If timestamp is selected, also enter the format and time zone.
    About XML namespace

    You cannot specify a path in combination with a prefix defined in the XML namespace.


    Was this article helpful?