Skip to content

deps(cilium): upgrade Cilium to v1.19.3#2148

Open
nddq wants to merge 1 commit intomainfrom
cilium/v1.19-rebase
Open

deps(cilium): upgrade Cilium to v1.19.3#2148
nddq wants to merge 1 commit intomainfrom
cilium/v1.19-rebase

Conversation

@nddq
Copy link
Copy Markdown
Member

@nddq nddq commented Mar 30, 2026

Description

Upgrade Cilium from v1.18.0-pre.1 to v1.19.3 and adapt the codebase for all breaking API changes.

This PR is now a single squashed commit covering the dependency upgrade and all required adaptations. The DNS and tcpretrans plugin rewrites that were previously part of this branch have already been merged separately (#2152, #2153).

Changes by area

Logging (logrus → slog)

  • Add zap-backed slog.Handler and logr.Logger bridges in pkg/log/zap.go with thread-safe init via sync.Once
  • Migrate all DI-injected loggers from logrus.FieldLogger to *slog.Logger across agent, operator, and Hubble components
  • InitializeMetrics() now takes *slog.Logger with nil-safety guard
  • Wire SetDefaultSlog() at startup so slog.Default() routes through zap → Application Insights

Hubble control plane

  • Adapt to v1.19 Hive cell changes: remove ServiceCache cell (replaced by frontends table), update InitK8sSubsystem signature, replace NoOpOrchestrator with fakedp.FakeOrchestrator
  • Set IdentityUpdater in IPCache config to prevent nil-panic on label injection paths
  • Remove skipUnknownCGroupIDs parameter from Hubble parser constructor (removed in v1.19)
  • DI cell broken into sub-cells for fault isolation

Operator

  • Migrate CRD helpers to v1.19 API (CreateUpdateCRD, GetPregeneratedCRD now take *slog.Logger)
  • Replace local resource constructors with imports from cilium/operator/k8s behind _linux build tag
  • Add in-memory kvstore client for identity GC cell
  • Remove obsolete KVstore flag registrations

Build & CI

  • Update golangci-lint tool path and CI workflow for v2
  • Remove orphaned ITracer mock (pkg/plugin/common/mocks/)

Related Issue

Fixes #1788

Checklist

  • I have read the contributing documentation.
  • I signed and signed-off the commits (git commit -S -s ...).
  • I have correctly attributed the author(s) of the code.
  • I have tested the changes locally.
  • I have followed the project's style guidelines.
  • I have updated the documentation, if necessary.
  • I have added tests, if applicable.

Testing Completed

  • Deployed to AKS cluster (retinaTest-v119) with Hubble mode, all 3 agent pods + operator healthy (1/1)
  • Application Insights telemetry verified: 536+ traces ingested via slog → zap → zapai pipeline, all severity levels (INFO/WARN/ERROR) flowing with correct structured fields (module, plugin, appInsightsID)
  • Prometheus metrics endpoint serving on :10093/metrics
  • All eBPF plugins started successfully: dropreason, dns, tcpretrans, packetforward, packetparser, linuxutil
  • Hubble relay and UI running with quay.io/cilium/ v1.19.3 images
  • go build ./..., make lint (0 issues), unit tests all pass

@nddq nddq requested a review from a team as a code owner March 30, 2026 00:43
@nddq nddq requested review from jimassa and mainred March 30, 2026 00:43
@nddq nddq force-pushed the cilium/v1.19-rebase branch 27 times, most recently from 0df8cbd to f1bcf41 Compare March 30, 2026 23:48
Copy link
Copy Markdown
Member

@SRodi SRodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for driving this @nddq I agree the IG incompatibility forces us to move DNS/tcpretrans off inspektor-gadget as part of getting to Cilium v1.19.2.

One suggestion to make this easier to review/debug (without overcomplicating rollout): could we split this into two PRs, but keep the plugin names the same and do a straight replacement (no dual-plugin naming / feature flags)?

Proposed split:

  1. PR: Replace the existing dns + tcpretrans plugins with the ebpf-native implementations and remove the inspektor-gadget dependency (keep external behavior/plugin names the same).
  2. PR: Upgrade github.com/cilium/cilium to v1.19.2 (and any required Go/tooling + Hubble/operator adaptation changes).

Rationale: this isolates the “plugin implementation change” from the broader Cilium/control-plane dependency churn, making review and failure attribution simpler. Rollback is still straightforward at the Retina release level if needed (we can do a release straight after the plugins change), so I don’t think we need to introduce extra flags or temporary plugin naming.

@nddq
Copy link
Copy Markdown
Member Author

nddq commented Mar 31, 2026

@SRodi Sounds good! I'll make that change

@nddq
Copy link
Copy Markdown
Member Author

nddq commented Apr 1, 2026

@SRodi let's try and get this in first #2049 to give us the ebpf testing framework for the other two plugins

github-merge-queue bot pushed a commit that referenced this pull request Apr 9, 2026
# Description

**PR 2 of 2** for removing inspektor-gadget from retina plugins (split
from #2148).

Replaces the IG-based DNS tracer with a native eBPF socket filter and
adds comprehensive kernel-level and Go unit tests.

- **BPF program** (`dns.c`): Socket filter on AF_PACKET captures DNS
queries/responses (UDP + TCP, IPv4 + IPv6). Uses `bpf_skb_load_bytes`
helpers compatible with `BPF_PROG_TEST_RUN`. Query type extracted in BPF
via QNAME walking.
- **Per-interface dedup filter**: Socket binds to all interfaces
(`ifindex=0`) with a BPF-side filter that only captures `PACKET_HOST` on
non-default interfaces (veths) while capturing both directions on the
default interface (eth0). This avoids double-counting while preserving
visibility into same-node pod-to-CoreDNS and NodeLocal DNS traffic.
- **Go plugin** (`dns_linux.go`): Perf buffer event loop with gopacket
DNS parsing. Populates `dns_config` BPF map with default interface index
at Init. No runtime compilation — pre-compiled via `bpf2go@v0.18.0`.
- **Test infrastructure** (`ebpftest/packet.go`):
`TCPPacketOpts.Payload`, IPv6 `BuildUDPPacket`, DNS packet builders
(`BuildDNSQueryPacket`, `BuildDNSResponsePacket`,
`BuildDNSTCPQueryPacket`).
- **eBPF + unit tests**: 28 tests covering IPv4/IPv6, TCP/UDP, mDNS,
edge cases, counter deltas, mock enricher, and `parseDNSPayload`.

## Related Issue

Partial fix for #1788
Split from #2148

## Checklist

- [x] I have read the [contributing
documentation](https://retina.sh/docs/Contributing/overview).
- [x] I signed and signed-off the commits (`git commit -S -s ...`).
- [x] I have correctly attributed the author(s) of the code.
- [x] I have tested the changes locally.
- [x] I have followed the project's style guidelines.
- [ ] I have updated the documentation, if necessary.
- [x] I have added tests, if applicable.

## Testing Completed

All 28 tests pass (14 eBPF + 14 unit). Validated on AKS cluster
(`retinaTest-v119`, 3-node, k8s v1.33.6):

| Scenario | Result |
|---|---|
| Pod → CoreDNS same node | Captured, count=1 per query |
| Pod → CoreDNS different node | Captured, count=1 per query |
| NodeLocal DNS (169.254.20.10) | Both hops captured (pod→cache,
cache→CoreDNS) — expected |
| Multi-NIC | Not tested (single-NIC AKS nodes) |
| Double-counting | None detected |

Hubble and standard mode DNS metrics flowing with correct pod-level
labels.

## Additional Notes

- After both plugin PRs merge, inspektor-gadget can be fully removed in
the Cilium v1.19 upgrade PR
- The old IG tracer used `ifindex=0` with no dedup, which caused
double-counting on multi-interface paths. The new BPF dedup filter
resolves this while maintaining same-node and NodeLocal DNS visibility.

---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
github-merge-queue bot pushed a commit that referenced this pull request Apr 13, 2026
# Description

**PR 1 of 2** for removing inspektor-gadget from retina plugins (split
from #2148 per review feedback)

Cilium v1.19.2 pins `cilium/ebpf` at a version that removed
`CollectionSpec.RewriteConstants()`, which inspektor-gadget relies on.
Upgrading IG is not viable — v0.42.0+ removed the built-in
`trace/tcpretrans` gadget entirely. This PR replaces the IG-based
implementation with a native cilium/ebpf one ahead of the Cilium
upgrade.

- **BPF program**: New `tracepoint/tcp/tcp_retransmit_skb` program
(kernel 5.8+ for `bpf_ktime_get_boot_ns`) with `BPF_CORE_READ` for CO-RE
field relocation. Handles the kernel 6.17 tracepoint struct rename via
CO-RE type flavors. TCP flags read from `tcp_skb_cb`. Supports IPv4 and
IPv6.
- **Go plugin**: Perf buffer event loop with worker goroutines,
replacing the IG tracer/gadget-context pattern. `SetupChannel` now wires
up the external channel (was a no-op under IG), enabling the advanced
metrics module to receive events directly.
- **Build**: Pre-compiled via `bpf2go@v0.18.0` with per-arch targets
(amd64 + arm64) and embedded in the binary — no runtime compilation.
`Compile()` and `Generate()` retained as no-ops for plugin manager
lifecycle compatibility.
- **`ToFlow` IPv6 labeling** (`pkg/utils/flow_utils.go`): now derives
`IpVersion` from `sourceIP.To4()` instead of hardcoding
`IPVersion_IPv4`, so IPv6 retransmission flows are labeled correctly.
Backward-compatible for every existing IPv4 caller.
- **No dependency changes**: Works with existing `cilium/ebpf` v0.18.0
in `go.mod`.

## Related Issue

Partial fix for #1788
Split from #2148
Related: #2162 (chart fix surfaced while validating this PR — latent
helm-upgrade reload bug for the operator ConfigMap)

## Checklist

- [x] I have read the [contributing
documentation](https://retina.sh/docs/Contributing/overview).
- [x] I signed and signed-off the commits (`git commit -S -s ...`).
- [x] I have correctly attributed the author(s) of the code.
- [x] I have tested the changes locally.
- [x] I have followed the project's style guidelines.
- [x] I have updated the documentation, if necessary.
- [x] I have added tests, if applicable.

## Testing Completed

### Standard / Advanced mode (AKS cluster `retinaTest-v119`, 3-node, k8s
v1.33.6, kernel 5.15.0-1102-azure)

Image:
`acnpublic.azurecr.io/microsoft/retina/retina-agent:f4adc1e0-fix1-linux-amd64`

Plugin initialized and attached on all three agent pods:

```
level=info caller=common/common_linux.go:79 msg="perf reader created" Map=PerfEventArray(retina_tcpretrans_events)#100 PageSize=4096 BufferSize=65536
level=info caller=tcpretrans/tcpretrans_linux.go:106 msg="tcpretrans plugin initialized"
level=info caller=pluginmanager/pluginmanager.go:174 msg="starting plugin tcpretrans"
```

`BufferSize=65536` = 4096 × 16 pages, confirming the starting buffer
size is live (down from an initial oversized default).

Advanced metric flowing with real pod-level labels, counter incrementing
in real time:

```
networkobservability_adv_tcpretrans_count{direction="egress",ip="10.224.1.183",namespace="kube-system",podname="konnectivity-agent-5858c6b5d7-882m2"} 6 → 7 → 9
networkobservability_adv_tcpretrans_count{direction="egress",ip="10.224.2.41",namespace="kube-system",podname="metrics-server-b957f9d87-tg22p"} 1
```

Cross-checked against the node-level TCP stats from linuxutil on the
same pod:
`networkobservability_tcp_connection_stats{statistic_name="TCPLostRetransmit"}
= 55` — different data source (kernel TCP stats vs kernel tracepoint),
same order of magnitude.

### Advanced mode with remote context (source + destination labels)

With `remoteContext=true` and `destinationLabels` set on the
`tcp_retransmission_count` entry in the `MetricsConfiguration` CRD:

```
networkobservability_adv_tcpretrans_count{
  source_ip="10.224.1.183",source_namespace="kube-system",source_podname="konnectivity-agent-5858c6b5d7-882m2",
  destination_ip="51.143.116.92",destination_namespace="kubernetes-apiserver",destination_podname="kubernetes-apiserver",
  direction="EGRESS"
} 4
```

No plugin changes were needed for remote context — the rewrite already
populates both sides of the flow through `utils.ToFlow`, the enricher
handles both sides, and the metrics module is already wired to emit both
label sets when configured. Discovering this path is what led to filing
#2162.

### Build / vet / test

```bash
go build ./...                                                  # clean
go vet ./pkg/utils/... ./pkg/plugin/tcpretrans/...              # clean
gofumpt -l pkg/utils/flow_utils.go pkg/plugin/tcpretrans/       # clean
go test -tags=unit,dashboard ./pkg/plugin/... ./pkg/module/...  # all pass
```

All existing `utils.ToFlow` callers (dropreason, dns, packetparser,
latency_test.go) still pass — the `IpVersion` derivation is
backward-compatible for every IPv4 caller.

## Additional Notes

- PR 2 of 2 (#2153, `plugin/dns-ebpf`) rewrites the DNS plugin with the
same approach.
- After both merge, inspektor-gadget can be fully removed in the Cilium
v1.19 upgrade PR.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
@nddq nddq force-pushed the cilium/v1.19-rebase branch 6 times, most recently from 4be36a4 to bddee40 Compare April 14, 2026 18:08
Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
@nddq nddq force-pushed the cilium/v1.19-rebase branch from bddee40 to 92771e9 Compare April 16, 2026 16:39
@nddq nddq changed the title deps(cilium): upgrade Cilium to v1.19.2 deps(cilium): upgrade Cilium to v1.19.3 Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Upgrade cilium to stable 1.19

2 participants