About input file format settings
    • PDF

    About input file format settings

    • PDF

    Article summary

    summary

    In the Data Source File/Storage Connector, the input file format can be set in STEP 1 of ETL Configuration, and detailed settings for the input file can be configured in Input Option in STEP 2.
    This page describes the input file formats and Connectors supported by TROCCO, as well as the various Input Option setting items.

    Input file format

    In STEP 1 of ETL Configuration editing, you can choose from the following

    • CSV/TSV
    • JSON Lines
    • JSONPath
    • LTSV
    • Microsoft Excel (xlsx or xls file)
    • XML (beta version)
    About XML file transfer

    The XML file format transfer function is available as a beta version.
    Specifications are subject to change when the official version is released.

    Connectors covered in this article

    Setting items

    Setting items vary depending on the input file format.

    CSV/TSV

    Embulk's csv-parser-plugin is used.

    STEP2 Input Option

    item namedefault valueDescription.
    delimiter,CSV Data Setting
    quotation marks"You can set quotation marks.
    escape character\Escape characters can be set.
    String to be replaced by NULLNot setYou can set the string to be replaced by NULL.
    If you select "Set", you can enter the string to be replaced.
    Number of header rows to skip0Sets the number of rows to skip.
    For example, if the first line is the item name and you do not want the item name line to be included in the ETL Configuration Data Setting, set 1.
    Remove white space from value if no quotation marks are presentNo, sir.You can set whether or not whitespace should be removed from the value if it is not quoted.
    Processing method for irregular quartsACCEPT_ONLY_RFC4180_ESCAPEDSets how to handle the presence of irregular quotes in quoted fields.
    See quotes_in_quoted_fields incsv-parser-pluginfor details.
    comment line marker-If the character set here is at the beginning of a line, skip that line.
    Processing rows with fewer columnsTreat as invalid recordIf you choose to process as invalid records, when there are records with insufficient columns, the processing of the corresponding rows will be skipped.
    If NULL completion of missing columns is selected, when a record with missing columns exists, NULL values are completed and processing continues.
    Processing rows with more columnsTreat as invalid recordIf you choose to process as invalid records, when a record with an excess number of columns exists, the processing of the corresponding row is skipped.
    If you choose to ignore columns, when there is a record with an excess number of columns, the process continues ignoring the overflowing columns.
    Maximum amount of data that can be enclosed in quotation marks131072Maximum amount of Data Setting (in bytes) that can be quoted.
    If there is a larger amount of data than the value set here, the processing of that row is skipped.
    Whether or not to abort the transfer if an invalid record is found.Abort transferIf you choose to abort the transfer, the transfer will be aborted when an invalid record is found.
    If you choose to continue processing, when an invalid record is found, the NULL value is completed and processing continues.
    Default time zoneUTCTime zone can be set to be used in a timestamp type column when the read Data itself has no information about time zone.
    Date Initialization1970-01-01In the date column, you can set a default value if the date is not recognized.
    new lineCRLFSet the rules for line breaks among CRLF, LF, and CR.
    character encoding-You can set the character encoding method.
    If not entered, it will be automatically guessed during Automatic Data Setting.

    JSON Lines

    embulk-parser-jsonl.

    STEP2 Input Option

    item namedefault valueDescription.
    Whether or not to abort the transfer if an invalid record is found.Abort transferIf you choose to abort the transfer, the transfer will be aborted when an invalid record is found.
    If you choose to continue processing, when an invalid record is found, the NULL value is completed and processing continues.
    Default time zoneUTCTime zone can be set to be used in a timestamp type column when the read Data itself has no information about time zone.
    new lineCRLFYou can set rules for line breaks among CRLF, LF, and CR.
    character encoding-You can set the character encoding method.
    If not entered, it will be automatically guessed during Automatic Data Setting.

    JSONPath

    embulk-parser-jsonpath.

    STEP1 Basic settings

    item namedefault valueDescription.
    JSONPath-See Operators for information on how to write JSONPath.
    *To specify all, specify $. *If you want to specify all, specify $.

    STEP2 Input Option

    item namedefault valueDescription.
    root-This is the same setting as "JSONPath" in STEP1 Basic Settings.
    Default time zoneUTCTime zone can be set to be used in a timestamp type column when the read Data itself has no information about time zone.

    LTSV

    STEP2 Input Option

    item namedefault valueDescription.
    new lineCRLFYou can set rules for line breaks among CRLF, LF, and CR.
    character encoding-You can set the character encoding method.
    If not entered, it will be automatically guessed during Automatic Data Setting.

    Microsoft Excel

    embulk-parser-poi_excel.

    STEP1 Basic settings

    item namedefault valueDescription.
    Sheet name-Enter the name of the sheet to be transferred.
    Number of header rows to skip1Sets the number of rows to skip.
    For example, if the first line is the item name and you do not want the item name line to be included in the ETL Configuration Data Setting, set 1.
    Column Setting-Set the column name, column type, and method of retrieving values.
    How to get the value,
    • If you choose to use cash
      • The transfer process is accelerated because the results of calculations are used at the time the Excel Files are saved locally.
    • When recalculation on transfer is selected
      • Recalculate formulas in cells during transfer.
    Time zone for date and time columnsAsia/TokyoYou can set the time zone to be used for timestamp type columns.
    Function with different calculation result each time

    If you select "Use Cache" in the column configuration, ETL Configuration will use the values at the time the Excel Files are saved locally for ETL Configuration.
    Therefore, functions that change the result of calculation each time, such as date/time functions (e.g., TODAY function) and random number generation functions (e.g., RAND function), are not recalculated when the transfer is executed. Please note

    Select Recalculate on Transfer if the formulas in the cells need to be recalculated when the transfer is executed.

    XML (beta version)

    embulk-parser-xml.

    STEP1 Basic settings

    item namedefault valueDescription.
    Path of the root (XPath format)-Enter the root path to the node on which the Xpath search will be based.
    If you want all nodes in the XML file to be searched, enter /.
    Column Setting-Enter the path of the node you wish to retrieve in Xpath format.
    For XPath notation, see XPath Syntax.
    The column type can be selected from the following
  • string
  • long
  • timestamp
  • double
  • boolean
  • json
  • If timestamp is selected, also enter the format and time zone.
    About XML namespace

    You cannot specify a path in combination with a prefix defined in the XML namespace.


    Was this article helpful?