MySQL to BigQuery Transfer (CDC)

Prev Next

summary

Only data that has changed can be identified and transferred to BigQuery using the CDC (Change Data Capture) method from MySQL's binlog.
This will result in reduced transfer volume and faster transfers.
Full Data Transfer must be done on the first run. Subsequently, in Incremental Data Transfer, only the Incremental Data Transfer is linked using the CDC method.

essential condition

MySQL Support Chart

Master Read Replica Supported Versions
Amazon Aurora MySQL RDS x 5.6.2+
Amazon MySQL RDS 5.6.2+
Google CloudSQL MySQL x 5.6.2+
MariaDB 5.6.2+
MySQL 5.6.2+

MySQL Parameter Settings

[mysqld]
binlog_format=ROW
binlog_row_image=FULL
expire_logs_days=7
binlog_expire_logs_seconds=604800
log_bin=mysql-binlog
log_slave_updates=1

The supplemental explanation is as follows

  • It may be necessary to restart the DB for the binlog settings to take effect.

  • Column: binlog_expire_logs_seconds
    MySQL 8.0.1 or later must be specified.
    Seven days is specified, but can be shortened.
    It must be sufficiently longer than the frequency of each transfer.

  • Column: expire_logs_days
    Set if MySQL 8.0.0 or lower.
    Seven days is specified, but can be shortened.
    It must be sufficiently longer than the frequency of each transfer.

  • Column: log_bin
    You are free to change it as you wish.

  • Column: log_slave_updates
    Set when retrieving binlog from Read Replica.

  • RDS MySQL
    The expire_logs_days should be set using the mysql.rds_set_configuration procedure.
    Reference: https://docs.aws.amazon.com/ja_jp/AmazonRDS/latest/UserGuide/USER_LogAccess.Concepts.MySQL.html

  • Cloud SQL
    Enable point-in-time recovery.
    Reference: https://cloud.google.com/sql/docs/mysql/backup-recovery/backups?authuser=2#tips-pitr

Authority of connecting users

The following three authorizations are required

SELECT

  • Get Data
  • Table schema information reference

Example: The

GRANT SELECT ON `<database_name>`.`<table_name>` to `<username>``@'%'`;

REPLICATION CLIENT、REPLICATION SLAVE

Example: The

GRANT REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO `<username>`@'%';

Primary Key in Data Source table

The Primary Key must be set in the ETL Configuration for the Data Source table.
Multiple Primary Keys (Composite Primary Key) are also supported.

Conversion function during transfer

MySQL to BigQuery CDC transfer supports the following

  • Column Drop
  • Column Data Conversion
  • Change column name

Data transformation of columns is CAST in BigQuery. So we follow BigQuery's CAST rules.
For more information, please see below.
https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_rules?hl=ja

metadata

A column named _trocco_fetched_at is added and linked to the DWH.
This column has as its value the time the transfer was initiated by TROCCO's ETL Job.
The string _trocco_fetched_at cannot be used as the column name due to the convenience of adding this column.

Data Source MySQL SQL support

  • INSERT
  • UPDATE
  • DELETE
    • Deleted data will be logically deleted on the DWH side.

SQL with limitations

  • ALTER TABLE ... RENAME COLUMN old_col_name TO new_col_name
    If schema tracking is enabled, data will be linked with the changed name.
    Columns that are already linked will not be deleted and will have a value of NULL.
  • ALTER TABLE ... ADD col_name date_type
    Only when schema tracking is enabled, the transfer is automatic.
    When adding columns, please configure them in the admin panel before adding MySQL columns.
    If the columns are already linked after they are added, data may be missing.
  • ALTER TABLE ... DROP col_name
    Data linked to the DWH will not be deleted and will be NULL after deletion.
  • ALTER TABLE ... MODIFY col_name column_definition
    If the changed type is type-changeable, it can continue to be transferred.
    If an error occurs, a Full Data Transfer must be performed.

Unsupported SQL

The following SQL cannot be caught by binlog, so the data will not be transferred correctly to the DWH.

  • TRUNCATE
  • DROP
  • LOAD
  • RENAME
  • CASCADING DELETES
  • CASCADING UPDATES
  • ALTER TABLE ... SET DEFAULT
    If the data is already linked and is NULL, it will remain NULL and will not be updated.

MySQL and BigQuery schema types

The data type conversion function allows you to convert data types that can be changed to the type you specify.
The default response is as follows Binary data is converted to hexadecimal and treated as a STRING.

MySQL Data Types BigQuery Data Types
BINARY
BLOB
LONGBLOB
MEDIUMBLOB
TINYBLOB
VARBINARY
CHAR
ENUM
LONGTEXT
MEDIUMTEXT
SET
TIME
TINYTEXT
VARCHAR
STRING
BIT
TINYINT
BOOLEAN
BIGINT
INT
MEDIUMINT
SMALLINT
YEAR
INTEGER
DECIMAL
NUMERIC
DOUBLE
FLOAT
FLOAT
DATE DATE
TIMESTAMP
DATETIME
TIMESTAMP
GEOMETRY
GEOMETRYCOLLECTION
JSON
LINESTRING
MULTILINESTRING
MULTIPOINT
MULTIPOLYGON
POINT
POLYGON
STRING (STRUCT is currently not supported.)
CURVE
MULTICURVE
MULTISURFACE
SURFACE Not supported.

Configuration Flow

1. create ETL Configuration

Select Data Source MySQL binlog and Data Destination BigQuery and create ETL Configuration.
image.png

*Be sure to press "Save BigQuery Column Settings" in "STEP2 Data Preview and Advanced Settings" to save the settings.

2. initial Full Data Transfer

For the first time only, Full Data Transfer is performed to transfer all the current MySQL tables to BigQuery.
On the Run New Job screen, check the Do Full Data Transfer checkbox and run the job.
image.png

3. Incremental Data Transfer Schedule Configuration

Please register your schedule from the Schedule Setup screen.
A Job will be started at the Job Setting frequency and the differences will be captured.

If Full Data Transfer is required

Full Data Transfer is required in the following cases

  • Initial Transfer
  • Runtime error on type conversion on BigQuery.
  • I want to erase all data, not logical deletion.
    Delete the transferred table before executing.
  • If you execute the query described in "Data Source MySQL SQL Response

technical overview

Overview of the technical aspects of CDC
target table: the table where the final data will be placed.

  1. TROCCO reads the MySQL binlog and transfers the binlog of the configured table onto the BQ.
    image.png

  2. With binlog data, there are duplicates, so we create a set of the most recent records for each record.
    image.png

  3. Merge the tables created in step 2 with the tables already linked.
    image.png

  4. Replace the table merged in 3 with the target table.
    image.png