Skip to content

[Bug]: MySQL GTID mode results in null _metadata_log_file and _metadata_log_position in BigQuery, making record ordering impossible #3554

@chiakitakeda

Description

@chiakitakeda

Related Template(s)

Cloud Datastream to BigQuery (DataStream_to_BigQuery / job label: cloud_datastream_to_bigquery)

Template Version

2025-11-11-00_rc00

What happened?

Expected behavior

When using MySQL Datastream in GTID mode, _metadata_log_file and _metadata_log_position should be populated in BigQuery — the same as in binlog mode — so that the order of records can be determined for the MERGE operation.

Actual behavior

When MySQL Datastream operates in GTID mode, the Avro source_metadata schema differs from binlog mode. The template hardcodes "log_file" and "log_position" as the field names to look up (L150-151), but in GTID mode Avro, these fields are named "binlog_file" and "binlog_position". Because getSourceMetadata() returns null when the field is absent from the schema (L243-L251), both _metadata_log_file and _metadata_log_position are written as null in BigQuery.

Binlog mode Avro:

{
  "read_method": "mysql-cdc-binlog",
  "source_metadata": {
    "log_file": "mysql-bin-changelog.072901",
    "log_position": 33822
  }
}

GTID mode Avro:

{
  "read_method": "cdc",
  "source_metadata": {
    "gtid": "92e3e612-66e2-11f0-a5a7-0e004f023b83:22073",
    "binlog_file": "mysql-bin-changelog.073728",
    "binlog_position": 147862
  }
}

In other words, the ordering metadata is present in GTID-mode Avro, but the template does not extract it because it only supports the binlog-mode field names.

Impact

  • _metadata_log_file and _metadata_log_position are always null for records ingested via GTID mode
  • When multiple CDC changes for the same key occur within the same timestamp boundary, the template cannot use binlog ordering to deterministically rank them, which can break deduplication and upsert correctness in the generated MERGE logic.

Suggested fix

When source_metadata.log_file / source_metadata.log_position are absent, the template should also check for:

  • source_metadata.binlog_file
  • source_metadata.binlog_position

and map them to:

  • _metadata_log_file
  • _metadata_log_position

This would preserve ordering semantics for MySQL GTID-mode Datastream sources.

Note

Note: The related issue where GTID mode triggers the Oracle default sort branch (_metadata_scn) can be mitigated by setting the datastreamSourceType=mysql override parameter.
However, even with that workaround applied, _metadata_log_file and _metadata_log_position remain null. This report focuses on that remaining problem.

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingp2

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions