PostgreSQL to BigQuery Transfer (CDC)

Prev Next

summary

Using the Change Data Capture (CDC) method from PostgreSQL's WAL, only data that has changed can be identified and transferred to BigQuery. This will result in reduced transfer volume and faster transfers.
Full Data Transfer must be done on the first run. Subsequently, in Incremental Data Transfer, only the Incremental Data Transfer is linked using the CDC method.

When using PostgreSQL CDC

  • You must create your own replication slots for each ETL Configuration (for each table to be forwarded by CDC) by following the instructions.

    • Automatic creation and deletion of replication slots is not supported at this time.
    • Replication slots are not automatically deleted when ETL Configuration is deleted. Be sure to remember to delete the file.
    • If you forget to delete the replication slots, the WAL will overwhelm the storage and may cause the DB to crash.
  • When using PostgreSQL CDC, please make sure that there is enough room for DB storage.

    • It is recommended to monitor storage size.

essential condition

PostgreSQL Support Table

Master Read Replica Supported Versions
Amazon Aurora PostgreSQL RDS x 10.6+
Amazon PostgreSQL RDS x 9.4+
Google CloudSQL PostgreSQL x x
Heroku x x
PostgreSQL x 9.4+

PostgreSQL Parameter Settings

rds.logical_replication= 1 # Set only for RDS.
wal_level=logical # Set if non-RDS.
max_replication_slots= 5 # Set the same number as the tables to be linked.
max_wal_senders= 5 # Set the same number as the tables to be linked.

*It may be necessary to restart DB to reflect WAL settings.

Slot Creation

  • A separate Slot must be created for each table to be linked (for each ETL Configuration).
    • After logging in as SUPERUSER, create a slot using the SQL below.
      *Currently, only wasl2json is supported.
SELECT *
FROM pg_create_logical_replication_slot('trocco_<table_name>', 'wal2json');
  • The following SQL can be used to check if data can be retrieved from the Slot.
SELECT COUNT(*)
FROM pg_logical_slot_peek_changes('<replication_slot_name>', null, null);

Authority of connecting users

  • The following permissions are required

CONNECT

Example:
GRANT CONNECT ON DATABASE <database_name> TO <username>;

USAGE

Example:
GRANT USAGE ON SCHEMA <schema_name> TO <username>

SELECT

Example:
GRANT SELECT ON <schema_name>.<table_name> TO <username>

rds_replication (RDS only)

Example:
GRANT rds_replication TO <username>

ALTER DEFAULT PRIVILEGES
After granting the SELECT, all schemas in the database are accessible.

Example:
ALTER DEFAULT PRIVILEGES IN SCHEMA <schema_name> GRANT SELECT ON TABLES TO <username>

Primary Key in Data Source table

  • The Primary Key must be set in the ETL Configuration for the Data Source table.
    • Multiple Primary Keys (Composite Primary Key) are also supported.

Conversion function during transfer

PostgreSQL to BigQuery CDC transfers support the following

  • Column Drop
  • Column Data Conversion
  • Change column name

Data transformation of columns is CAST in BigQuery. So we follow BigQuery's CAST rules.
For more information, please see below.
https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_rules?hl=ja

metadata

  • The following columns will be added and linked to the DWH. Therefore, the following cannot be used for column names
column name Description.
_trocco_fetched_at Time the transfer was initiated by torcco's ETL Job.
  • The following also cannot be used as column names because they are used to link data to temporary tables.
column name Description.
_trocco_deleted Logical deletion flag. When true, it means that the data has been deleted.
_trocco_seq Guarantee the order of WALs in the transfer

Data Source PostgreSQL SQL support

  • INSERT
  • UPDATE
  • DELETE

Schema change (ALTER) behavior

Schema change operation SQL Example Behavior when skimmer tracking is enabled Behavior when schema tracking is disabled
column addition ALTER TABLE <table_name> ADD <column_name> <data_type> Columns are added to BigQuery.
The column type is determined by TROCCO.
Note that if a column has a default value, please perform Full Data Transfer since the default value is not reflected in BigQuery.
Additional columns will not be added to BigQuery.
By adding the target column setting in ETL Configuration edit in TROCCO, the column will be added to the BigQuery side.
Only records that have been updated after setting a new column will have their values captured.
Full Data Transfer must be performed to capture all values.
column deletion ALTER TABLE <table_name> DROP <column_name> Deleted columns are not deleted from BigQuery. Same behavior as when skimmer tracking is enabled
column type change ALTER TABLE <table_name> ALTER <column_name> TYPE <data_type> If the modified type is type-changeable, it can continue to be transferred. If an error occurs, a Full Data Transfer must be performed. Same behavior as when schema tracking is enabled
Column Name Change ALTER TABLE <table_name> RENAME <old_column_name> TO <new_column_name> A new column is added to BigQuery with the changed column name.
Only records updated after the column name change will have the new column value updated.
The value of the column before the column name change is not carried over to the new column after the change. Full Data Transfer must be performed in order to capture everything.
Also, columns prior to column renaming are not deleted from BigQuery.
The changed columns will not be added on BigQuery.
The column before the change will have the value NULL for the updated record.

Unsupported SQL

  • The following SQL cannot be caught by WAL and data will not be transferred to DWH correctly.
    ・TRUNCATE
    ・DROP
    ・LOAD
    ・RENAME
    ・CASCADING DELETES
    ・CASCADING UPDATES
    ・ALTER TABLE ... SET DEFAULT
    If the data is already linked and is NULL, it will remain NULL and will not be updated.

PostgreSQL and BigQuery schema types

  • In the data type conversion function, data types that can be changed can be converted to the specified type.
    • The default response is as follows
PostgreSQL Data Types BigQuery Data Types
CHARACTER VARYING
CHARACTER
CIDR
CITEXT
MACADDR
TEXT
TIME WITH TIMEZONE
TIME WITHOUT TIMEZONE
UUID
TIME
INTERVAL
STRING
BOOLEAN BOOLEAN
BIGINT
BIG SERIAL
INTEGER
SERIAL
NUMERIC
DECIMAL
SMALL INT
SMALL SERIAL
INTEGER
DOUBLE PRECISION
REAL
MONEY
FLOAT
DATE DATE
TIMESTAMP
DATETIME
TIMESTAMP WITH TIMEZONE
TIMESTAMP WITHOUT TIMEZONE
TIMESTAMP
GEOGRAPHY
GEOMETRY
HSTORE
JSON
JSONB
POINT
TSRANGE
TSTZRANGE
BOX
PATH
STRING (STRUCT is currently not supported.)

Configuration Flow

1. create ETL Configuration

Select Data Source PostgreSQL WAL and Data Destination BigQuery and create ETL Configuration.
image.png

2. initial Full Data Transfer is performed

For the first time only, Full Data Transfer is performed to transfer the current PostgreSQL table to BigQuery.
On the Run New Job screen, check the Do Full Data Transfer checkbox and run the job.
image.png

3. schedule ETL Configuration for Incremental Data Transfer

Please register your schedule from the Schedule Setup screen.
A Job will be started at the Job Setting frequency and the differences will be captured.

If Full Data Transfer is required

Full Data Transfer is required in the following cases

  • Initial Transfer
  • Runtime error with type conversion on BigQuery
  • I want to erase all data, not logical deletion.
    Delete the transferred table before executing.
  • If you execute the query described in "SQL Support for Data Source PostgreSQL