Rasmus Olsson

3 workflows to diagnose a Kubernetes Pod

May 07, 2025

When something goes wrong in production (memory growth, CPU spikes, stuck threads, networking issues), you often need diagnostics from a running Kubernetes Pod.

In this post I will go through three workflows I see most teams end up with:

  1. Use the tools already inside the application container
  2. Inject an ephemeral debug container (break-glass toolbox)
  3. Use a planned diagnostics setup (sidecar or diagnostics profile)

I’ll keep this post Kubernetes-first and use .NET memory dumps as the example, but the same three approaches apply to other runtimes and tools too (Java, Node.js, Go, Python, native apps).

Why diagnostics is tricky in Kubernetes

Many production images are intentionally minimal:

  • no shell
  • no package manager
  • no debugging utilities

This is good for security and size, but it changes how you debug. The “VM approach” (SSH in and install tools) usually doesn’t apply.

Option 1: Use tools already inside the application container

This is the simplest workflow and also the fastest when it works.

Typical flow:

  1. kubectl exec into the running container
  2. Run the diagnostic tool (dump, profile, trace)
  3. Copy artifacts out

Example (using .NET as the diagnostic tooling):

kubectl exec -n <ns> -it <pod> -- sh ps aux dotnet-dump collect --process-id 1 --output /tmp/app.dmp

When option 1 is realistic

  • Your image is not distroless and has a shell
  • The diagnostic tools are already included, or your environment allows installing them
  • You can write artifacts somewhere (for example /tmp or a mounted volume)

Common reasons it fails in production

  • Minimal images (no shell/tools)
  • Locked-down egress (you can’t download tooling at runtime)
  • Read-only filesystem
  • Copying files out may be harder than expected (some methods require tooling such as tar)

Option 1 is great if you already planned for it, but many orgs intentionally avoid shipping tooling inside app containers.

Option 2: Inject an ephemeral debug container (break-glass toolbox)

Ephemeral containers let you attach a temporary container to an already running pod. This is a good fit when:

  • your app container is minimal
  • you need tools now
  • you want to avoid rebuilding images or restarting the pod

Conceptually, you add a toolbox container that has the tools you need (shell, process tools, runtime tooling).

Example injection:

kubectl debug -n <ns> -it pod/<pod> \ --image=busybox:1.36 \ --target=<app-container> -- sh

Important note about installing tools at runtime

A common instinct is to inject a container (for example the .NET SDK image) and then run dotnet tool install -g dotnet-dump.

In my experience this will often not work in production, because outbound egress to NuGet is usually blocked, and incident workflows should not depend on live downloads.

Instead, the usual approach is to publish your own small diagnostics toolbox image to your container registry (with the tools already installed), and use that image when you inject the ephemeral container.

Recommended approach: inject a prebuilt diagnostics toolbox image

Example injection (your own image in your registry):

kubectl debug -n <ns> -it pod/<pod> \ --image=<your-registry>/dotnet-diagnostics:8.0 \ --target=<app-container> -- sh

Inside the debug container you can then collect a diagnostic artifact:

ps aux dotnet-dump collect --process-id 1 --output /tmp/app.dmp

Then copy it out:

kubectl cp -n <ns> <pod>:/tmp/app.dmp ./app.dmp

Fallback: use an SDK image (only if your cluster allows downloads)

If your cluster allows outbound egress, you can use an SDK image and install tools at runtime:

kubectl debug -n <ns> -it pod/<pod> \ --image=mcr.microsoft.com/dotnet/sdk:8.0 \ --target=<app-container> -- bash dotnet tool install -g dotnet-dump export PATH="$PATH:/root/.dotnet/tools" dotnet-dump collect --process-id 1 --output /tmp/app.dmp

Pros

  • Works even if the application image is minimal
  • No rebuild required
  • Very flexible for incident response

Cons

  • Requires permissions (RBAC) to create ephemeral containers
  • Some orgs treat it as a security-sensitive “break-glass” action
  • The debug container uses resources (usually small, but it matters if the pod is already near limits)
  • If the workload is OOM-killing fast, you might not have enough time for a full dump

Option 2 is often the most practical approach when it is allowed, because it adapts well to minimal images.

Option 3: Planned diagnostics (self-service, auditable, automatable)

Option 2 is great when you have cluster permissions and you need to act fast, but it is still a manual workflow. Someone needs to inject a container, run tools, and move artifacts around.

Option 3 is different. The goal is to make diagnostics a standard product capability of the platform:

  • Repeatable workflow (same steps every time)
  • Controlled access (who can trigger, rate limits)
  • Auditable trail (who triggered what, when, why)
  • Safe artifact handling (storage, retention, encryption)
  • Self-service for developers, ideally without asking SRE to jump in

Instead of relying on kubectl debug rights, you build a controlled diagnostics path into the platform.

Building blocks

  1. A diagnostics mechanism inside the workload

    • Example: a sidecar like dotnet-monitor (or an equivalent tool for your runtime)
    • Alternative: a small “diagnostics container” that can collect artifacts from the app process
  2. A safe place for artifacts

    • A shared volume in the pod (for temporary storage)
    • Upload to controlled storage (S3/GCS/Azure Blob) with encryption and retention rules
  3. A trigger interface

    • A Kubernetes Job that performs the collection
    • A small internal "diagnostics controller" service
    • A developer-friendly entry point (CLI or Slack command)

Pattern A: Diagnostics profile enabled via GitOps

This is a good fit for strict orgs because the diagnostics capability is enabled through a PR, giving you a clean audit trail.

  • Normal deployment has no diagnostics sidecar
  • A diagnostics overlay exists (Helm values or Kustomize overlay)
  • Developers (or on-call devs) can open a PR that enables diagnostics for a specific service and environment
  • After the dump is collected, diagnostics is disabled again via PR

Example (Helm values style):

# values-diagnostics.yaml diagnostics: enabled: true # example only tool: dotnet-monitor sharedVolume: type: emptyDir sizeLimit: 5Gi upload: enabled: true destination: blob://prod-diagnostics/<service>/ retentionDays: 7

Execution flow:

  1. Developer opens PR: "Enable diagnostics profile for payments-api in prod"
  2. ArgoCD sync rolls pods with diagnostics enabled
  3. Developer triggers a dump using a standard command (see Pattern C below)
  4. Dump is uploaded, developer receives a link/artifact id
  5. Developer opens PR to disable diagnostics again

Pattern B: Always-on sidecar, locked down

If you need faster response, you can keep the diagnostics container always present, but make triggering and access strict:

  • NetworkPolicy blocks all access except a specific namespace or identity
  • RBAC limits who can port-forward or run the trigger job
  • Rate limits prevent repeated dump collection
  • Artifacts are uploaded to controlled storage and deleted from the node quickly

Pattern C: Developer self-service trigger (the important part)

To make option 3 truly self-service, you want developers to have a single supported way to request diagnostics without needing elevated kubectl access.

One approach that works well in stricter organisations is to let a CLI trigger a GitHub Actions workflow. That workflow can require an approval (for example via GitHub Environments) before it is allowed to run in prod.

The workflow then executes the diagnostics on behalf of the developer, using a controlled identity.

In example:

  • Developer runs a CLI command
  • CLI triggers a GitHub Action with inputs (service, env, dump type, incident id)
  • Someone approves (only for sensitive environments like prod)
  • The workflow creates a Kubernetes Job that:

    • talks to the diagnostics container (or runs collection logic)
    • stores the dump on a shared volume
    • compresses and uploads it to controlled storage
    • prints an artifact id or link in the logs

Example:

# Developer triggers a dump though CLI diag dump payments-api --env prod --type full --incident INC-1234 # Under the hood the CLI triggers a GitHub Actions workflow, for example: # - workflow: "Diagnostics - collect dump" # - inputs: service=payments-api, env=prod, type=full, incident=INC-1234 # - requires approval for prod # After approval, the GitHub Action runs and creates a job in the cluster # (example) kubectl -n diagnostics create job diag-dump-payments-api \ --image=<your-registry>/diag-runner:1.0

What the developer gets back:

  • The GitHub Action output (or job logs) contains something like:

    • "Dump collected"
    • "Uploaded to blob://prod-diagnostics/payments-api/2026-01-07/app.dmp.gz"
    • "Retention: 7 days"
    • "Incident id: INC-1234"

You can also expose the same idea via Slack:

  • /diag dump payments-api prod full INC-1234
  • Bot triggers the GitHub Action (or calls the diagnostics controller)
  • Approval happens in the same place you already use for production changes
  • Bot returns the artifact link when it is done

The key point is that developers are not doing ad-hoc kubectl debugging. They are using a supported interface that enforces policy and leaves a clean audit trail (who requested, who approved, and what was collected).

Pros

  • Works even in locked-down clusters (no reliance on kubectl debug)
  • Good audit trail (request + approval + execution logs)
  • Makes artifact handling (encryption, retention, access logging) a solved platform problem
  • Self-service for developers without giving broad Kubernetes permissions

Cons

  • Requires up-front platform work (storage, RBAC, retention, the trigger mechanism)
  • Needs careful security design (dumps may contain sensitive data)
  • You still want guardrails (rate limits, require incident id in prod, prefer gcdump by default)

Conclusion

We've reviewed 3 different workflows to diagnose a Kubernetes pod.

I can see why option 3 is attractive. A structured workflow with approvals, an audit trail, and a predictable place for artifacts is a nice end state. At the same time, it does take some work to build and maintain. In environments where the cluster is less locked down, you’re a small team, and diagnosing a Kubernetes pod is unlikely to be a frequent task, option 1 or 2 might be a better fit, together with a wiki page describing the usual steps to follow.

Happy Coding!

please share