PostgreSQL to BigQuery Transfer (CDC)
  • 17 Jul 2024
  • PDF

PostgreSQL to BigQuery Transfer (CDC)

  • PDF

Article summary

summary

Using the Change Data Capture (CDC) method from PostgreSQL's WAL, only data that has changed can be identified and transferred to BigQuery. This will result in reduced transfer volume and faster transfers.
All cases must be transferred on the first run. Then, in the transfer, only the differences are linked using the CDC method.

When using PostgreSQL CDC

  • You must create your own replication slots for each transfer setting (for each table to be transferred by CDC) by following the procedure.

    • Automatic creation and deletion of replication slots is not supported at this time.
    • Replication slots are not automatically deleted when transfer settings are deleted. Be sure to remember to delete the file.
    • If you forget to delete the replication slots, the WAL will overwhelm the storage and may cause the DB to crash.
  • When using PostgreSQL CDC, please make sure that there is enough room for DB storage.

    • It is recommended to monitor storage size.

essential condition

PostgreSQL Support Table

masterRead ReplicaSupported Versions
Amazon Aurora PostgreSQL RDSsymbol used as a placeholder (either because a number of other words could be used in that position or because of censorship)an unknown10.6+
Amazon PostgreSQL RDSsymbol used as a placeholder (either because a number of other words could be used in that position or because of censorship)an unknown9.4+
Google CloudSQL PostgreSQLan unknownan unknown
Herokuan unknownan unknown
PostgreSQLsymbol used as a placeholder (either because a number of other words could be used in that position or because of censorship)an unknown9.4+

PostgreSQL Parameter Settings

rds.logical_replication= 1 # Set only for RDS.
wal_level=logical # Set if non-RDS.
max_replication_slots= 5 # Set the same number as the tables to be linked.
max_wal_senders= 5 # Set the same number as the tables to be linked.

*It may be necessary to restart DB to reflect WAL settings.

Slot Creation

  • A separate Slot must be created for each table to be linked (for each transfer setting).
    • After logging in as SUPERUSER, create a slot using the SQL below.
      *Currently, only wasl2json is supported.
SELECT *
FROM pg_create_logical_replication_slot('trocco_<table_name>', 'wal2json');
  • The following SQL can be used to check if data can be retrieved from the Slot.
SELECT COUNT(*)
FROM pg_logical_slot_peek_changes('<replication_slot_name>', null, null);

Authority of connecting users

  • The following permissions are required

CONNECT

Example:
GRANT CONNECT ON DATABASE <database_name> TO <username>;

USAGE

Example:
GRANT USAGE ON SCHEMA <schema_name> TO <username>

SELECT

Example:
GRANT SELECT ON <schema_name>. <table_name> TO <username>.

rds_replication (RDS only)

Example:
GRANT rds_replication TO <username>.

ALTER DEFAULT PRIVILEGES
After granting the SELECT, all schemas in the database are accessible.

Example:
ALTER DEFAULT PRIVILEGES IN SCHEMA <schema_name> GRANT SELECT ON TABLES TO <username>

Primary Key of the source table

  • Primary Key must be set for the source table.
    • Multiple Primary Keys (Composite Primary Key) are also supported.

Conversion function during transfer

PostgreSQL to BigQuery CDC transfer supports the following
* Drop columns
* Column data conversion
* Change column name

Data transformation of columns is CAST in BigQuery. So we follow BigQuery's CAST rules.
For more information, please see below.
https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_rules?hl=ja

metadata

  • The following columns will be added and linked to the DWH. Therefore, the following cannot be used for column names
column nameDescription.
_trocco_fetched_atTime the transfer was initiated by torcco's transfer job
  • The following also cannot be used as column names because they are used to link data to temporary tables.
column nameDescription.
_trocco_deletedLogical deletion flag. When true, it means that the data has been deleted.
_trocco_seqGuarantee the order of WALs in the transfer

SQL support for the source PostgreSQL

  • INSERT
  • UPDATE.
  • DELETE

Schema change (ALTER) behavior

Schema change operationSQL ExampleBehavior when skimmer tracking is enabledBehavior when schema tracking is disabled
column additionALTER TABLE <table_name> ADD <column_name> <data_type>Columns are added to BigQuery.
The column type is determined by TROCCO.
Note that if a column has a default value, please transfer all cases since the default value is not reflected in BigQuery.
Additional columns will not be added to BigQuery.
By adding the target column setting in TROCCO's Transfer Settings Editor, the column will be added to the BigQuery side.
Only records that have been updated after setting a new column will have their values captured.
To capture all values, please transfer all cases.
column deletionALTER TABLE <table_name> DROP <column_name>Deleted columns are not deleted from BigQuery.Same behavior as when skimmer tracking is enabled
column type changeALTER TABLE <table_name> ALTER <column_name> TYPE <data_type>If the modified type is type-changeable, it can continue to be transferred. If an error occurs, the entire case must be transferred.Same behavior as when schema tracking is enabled
Column Name ChangeALTER TABLE <table_name> RENAME <old_column_name> TO <new_column_name>A new column is added to BigQuery with the changed column name.
Only records updated after the column name change will have the new column value updated.
The value of the column before the column name change is not carried over to the new column after the change. To capture all, please transfer all cases.
Also, columns prior to column renaming are not deleted from BigQuery.
The changed columns will not be added on BigQuery.
The column before modification will have the value NULL for the updated record.

Unsupported SQL

  • The following SQL cannot be caught by WAL and data will not be transferred to DWH correctly.
    TRUNCATE
    DROP
    LOAD
    RENAME
    CASCADING DELETES
    CASCADING UPDATES
    ALTER TABLE ... SET DEFAULT
    If the data is already linked and is NULL, it will remain NULL and will not be updated.

PostgreSQL and BigQuery schema types

  • In the data type conversion function, data types that can be changed can be converted to the specified type.
    • The default correspondence is as follows
PostgreSQL Data TypesBigQuery Data Types
CHARACTER VARYING
CHARACTER
CIDR
CITEXT
MACADDR
TEXT
TIME WITH TIMEZONE
TIME WITHOUT TIMEZONE
UUID
TIME
INTERVAL
STRING
BOOLEANBOOLEAN
BIGINT
BIG SERIAL
INTEGER
SERIAL
NUMERIC
DECIMAL
SMALL INT
SMALL SERIAL
INTEGER
DOUBLE PRECISION
REAL
MONEY
FLOAT
DATEDATE
TIMESTAMP
DATETIME
TIMESTAMP WITH TIMEZONE
TIMESTAMP WITHOUT TIMEZONE
TIMESTAMP
GEOGRAPHY
GEOMETRY
HSTORE
JSON
JSONB
POINT
TSRANGE
TSTZRANGE
BOX
path
STRING (STRUCT is currently not supported.)

Configuration Flow

1. create transfer settings

Select the source PostgreSQL WAL and destination BigQuery and create the transfer settings.
image.png

2. performs the first full transfer

For the first time only, all transfers are performed and all current PostgreSQL tables are transferred to BigQuery.
On the "Execute New Job" screen, check the " Transfer All " checkbox and execute the job.
image.png

3. set up a schedule for differential transfer

Register a schedule from the Schedule Setup screen.
The job is started at the set frequency and the differences are captured.

If you need to forward all cases

All transfers are required in the following cases

  • Initial Transfer
  • Runtime error with type conversion on BigQuery
  • I want to erase all data, not logical deletion.
    Delete the transferred table before executing.
  • If you execute the query described in "SQL support for the forwarding PostgreSQL

Was this article helpful?