PostgreSQL to BigQuery Transfer (CDC)
  • 07 Dec 2022
  • Dark
    Light
  • PDF

PostgreSQL to BigQuery Transfer (CDC)

  • Dark
    Light
  • PDF

Article Summary

Note

This is a machine-translated version of the original Japanese article.
Please understand that some of the information contained on this page may be inaccurate.

summary

Using the CDC (Change Data Capture) method from PostgreSQL's WAL, only changed data can be identified and transferred to BigQuery. This reduces the amount of transfer and speeds up the transfer.
On the first run, you must transfer all records. After that, in the transfer, only the differences are linked using the CDC method.

Using PostgreSQL CDC

  • For each forwarding configuration (for each table to be transferred by CDC), you will need to create your own replication slot according to the procedure.

    • At this time, automatic creation and deletion of replication slots is not supported.
    • If you delete a forwarding configuration, the replication slot is not automatically deleted. Don't forget to delete it.
    • If you forget to delete the replication slot, the WAL may overwhelm the storage and cause the DB to crash.
  • When using PostgreSQL CDC, make sure that the DB storage is sufficient.

    • We recommend that you monitor your storage size.

prerequisite

PostgreSQL correspondence table


MasterRead ReplicaSupported Versions
Amazon Aurora PostgreSQL RDSx10.6+
Amazon PostgreSQL RDSx9.4+
Google CloudSQL PostgreSQLxx
Herokuxx
PostgreSQLx9.4+

PostgreSQL parameter settings

rds.logical_replication= 1 # RDS の場合のみ、設定してください。
wal_level=logical # RDS 以外の場合、設定してください。
max_replication_slots= 5 # 連携するテーブルと同じ数を設定してください。
max_wal_senders= 5 # 連携するテーブルと同じ数を設定してください。

* It may be necessary to restart the DB to reflect the WAL settings.

Creating a Slot

  • You need to create a separate Slot for each table you want to link (for each forwarding configuration).
    • After logging in with SUPERUSER, create a slot with the following SQL.
      * Currently, only WASL2JSON is supported.
SELECT *
FROM pg_create_logical_replication_slot('trocco_<table_name>', 'wal2json');
  • You can check whether data can be retrieved from Slot with the following SQL.
SELECT COUNT(*)
FROM pg_logical_slot_peek_changes('<replication_slot_name>', null, null);

Permissions for the connecting user

  • The following permissions are required:

CONNECT

例:
GRANT CONNECT ON DATABASE <database_name> TO <username>;

USAGE

例:
GRANT USAGE ON SCHEMA <schema_name> TO <username>

SELECT

例:
GRANT SELECT ON <schema_name>.<table_name> TO <username>

rds_replication (RDS only)

例:
GRANT rds_replication TO <username>

ALTER DEFAULT PRIVILEGES
After granting a SELECT, make all schemas in the database accessible.

例:
ALTER DEFAULT PRIVILEGES IN SCHEMA <schema_name> GRANT SELECT ON TABLES TO <username>

Primary Key of the source table

  • The transfer source table must have a Primary Key setting.
    • Multiple Primary Keys (Composite Primary Keys) are also supported.

Conversion function during transfer

Metadata

  • The following columns are added and linked to DWH. Therefore, the following cannot be used in column names:
Column NameDescription
_trocco_fetched_atThe time the transfer was initiated by the Torcco transfer job
  • In addition, since it is used for data linkage to a temporary table, the following cannot be used as column names.
Column NameDescription
_trocco_deletedSoft delete flag. true means that the data has been deleted
_trocco_seqGuarantee WAL order in transfer

Source PostgreSQL SQL correspondence

  • INSERT
  • UPDATE
  • DELETE

Behavior during schema change (ALTER)

when

Schema changeoperationsSQL examplesbehaviorskimmer tracking is enabledBehavior whenschema following is disabled
Add ColumnALTER TABLE ... ADD col_name date_typeThe column is added to BigQuery.
The column type determined by trocco is used.
If a default value is set for a column, the default value will not be reflected in BigQuery, so please transfer all records.
No additional columns are added to BigQuery.
By adding the target column setting in trocco's transfer setting edit, the column will be added to the BigQuery side.
Only records that have been updated since the new column was set are populated.
To import all values, transfer all records.
Delete ColumnALTER TABLE ... MODIFY col_name column_definitionDeleted columns are not removed from BigQuery.Same behavior as when skimmer tracking is enabled
Column type changeALTER TABLE ... DROP col_nameIf the changed type is modifiable, it can continue to be transferred. If an error occurs, it is necessary to transfer all records.Same behavior as when schema following is enabled
Rename a columnALTER TABLE ... RENAME COLUMN old_col_name TO new_col_nameA new column is added to BigQuery with the new column name.
Only records that are updated after the column name is updated with the value of the new column.
The value of the column before the column name is not carried over to the new column after the change. To import everything, please transfer everything.
Also, the column before the column name is not deleted from BigQuery.
The modified column is not added to the BigQuery.
The old column has a value of NULL for the updated record.

Unsupported SQL

  • The SQL below cannot be caught by the WAL, so the data is not transferred correctly to DWH.
    ・TRUNCATE
    ・DROP
    ・LOAD
    ・RENAME
    ・CASCADING DELETES
    ・CASCADING UPDATES
    ・ALTER TABLE ... SET DEFAULT
    If the data is already linked and it is NULL, it will remain NULL and will not be updated.

PostgreSQL and BigQuery schema types

  • In the data type conversion function, data types that can be changed can be converted to a specified type.
    • By default, the following is supported.

data

PostgreSQLtypesBigQuery data types
CHARACTER VARYING
CHARACTER
CIDR
CITEXT
MACADDR
TEXT
TIME WITH TIMEZONE
TIME WITHOUT TIMEZONE
UUID
TIME
INTERVAL
STRING
BOOLEANBOOLEAN
BIGINT
BIG SERIAL
INTEGER
SERIAL
NUMERIC
DECIMAL
SMALL INT
SMALL SERIAL
INTEGER
DOUBLE PRECISION
REAL
MONEY
FLOAT
DATEDATE
TIMESTAMP
DATETIME
TIMESTAMP WITH TIMEZONE
TIMESTAMP WITHOUT TIMEZONE
TIMESTAMP
GEOGRAPHY
GEOMETRY
HSTORE
JSON
JSONB
POINT
TSRANGE
TSTZRANGE
BOX
PATH
STRING(STRUCT is not currently supported.) )

Setup flow

1. Create forwarding settings

Select the source PostgreSQL WAL and destination BigQuery to create the forwarding settings.
image.png

2. Perform the first transfer of all records

For the first time only, all records are transferred and all current PostgreSQL tables are transferred to BigQuery.
On the Execute New Job screen, check the box to Transfer all records and execute the job.
image.png

3. Schedule the differential transfer

Register the schedule from the schedule setting screen.
The job starts at the frequency you set and the delta is captured.

When all transfers are required

All transfers are required in the following cases: If you encounter any other errors, please contact Trocco Support.

  • First time transfer
  • Type conversion runtime errors occur in BigQuery
  • I want to erase all data instead of soft delete
    Delete the transferred table before running.
  • When executing the query described in "Source PostgreSQL SQL Support SQL"

Was this article helpful?