TechTrailCamp
← Back to Blog

DevOps Pipeline Troubleshooting Guide

A broken CI/CD pipeline doesn't just block one developer. It blocks the entire team. Deployments stall, PRs pile up, and developers start working around the pipeline instead of through it. If your team has ever said "just merge it and we'll fix the tests later," your pipeline has a trust problem. Here's how to diagnose and fix the most common pipeline failures.

Flaky Tests: The Silent Killer

A test that passes 90% of the time is worse than a test that always fails. Failing tests get fixed immediately. Flaky tests get retried, ignored, and eventually erode all confidence in the test suite.

How to diagnose: Track test pass rates over time. Most CI platforms (GitHub Actions, GitLab CI, Jenkins) let you see test history. Any test with less than 100% pass rate over the last 30 runs is flaky.

How to fix:

  • Quarantine flaky tests immediately — move them to a separate suite that doesn't block PRs
  • Common culprits: time-dependent assertions, shared mutable state between tests, race conditions in async tests, hard-coded ports or file paths
  • Run flaky tests in a loop locally (for i in {1..50}; do npm test; done) to reproduce the failure

Dependency Resolution Failures

The build worked yesterday but fails today with a dependency error. Nobody changed anything. What happened?

Common causes:

  • Unpinned dependencies"lodash": "^4.17.0" pulled in a breaking minor release. Always commit your lockfile (package-lock.json, poetry.lock, go.sum).
  • Registry outages — npm, PyPI, or Maven Central had a bad day. Use a private registry or cache (Artifactory, Nexus, CodeArtifact) as a fallback.
  • Docker base image changesFROM node:18 pulled a new patch that broke your build. Pin to specific digests: FROM node:18.19.1@sha256:abc...

Environment Drift

The code works locally, passes CI, but fails in staging. Or it works in staging but fails in production. The environment isn't the same as the one you tested against.

How to fix:

  • Use identical Docker images across all environments — build once, deploy the same artifact everywhere
  • Externalize all environment-specific config through environment variables, not different config files per environment
  • Use infrastructure-as-code (Terraform, CloudFormation) to ensure environments are provisioned identically
  • Run integration tests against a staging environment that mirrors production's infrastructure

Our Azure & DevOps training covers building environment parity with Azure DevOps pipelines and ARM templates.

Permission and Credential Errors

Pipeline fails with "access denied" or "unauthorized." These are some of the most frustrating failures to debug because they often have no helpful error message.

Checklist:

  • Are CI secrets (API keys, tokens) expired? AWS access keys, GitHub tokens, and Docker Hub tokens all have expiration dates.
  • Did someone rotate credentials without updating the CI secrets?
  • Are IAM roles correctly assumed? Check the trust policy on your CI role.
  • For GitHub Actions: did a workflow change from pull_request to pull_request_target? The security context changes significantly.

Best Practices for Reliable Pipelines

  • Keep pipelines fast — under 10 minutes. Parallelize test suites, cache dependencies aggressively, and skip unnecessary steps on irrelevant changes.
  • Fail early — run linting and type checking before tests. Run unit tests before integration tests. Don't waste 20 minutes on a full test suite when the code doesn't even compile.
  • Version your pipeline config — treat .github/workflows/ and Jenkinsfile with the same rigor as application code. Review changes, test in branches.
  • Monitor your pipeline — track build times, failure rates, and flaky test counts. Set alerts when build times increase by more than 20%.

If your team is spending more time fighting the pipeline than writing code, something needs to change. Our AWS & DevOps training teaches you how to build CI/CD pipelines that are fast, reliable, and self-healing. For teams that need immediate help fixing a broken pipeline, our DevOps work assistance provides hands-on support.

Build Pipelines Your Team Can Trust

Master CI/CD, infrastructure-as-code, and DevOps best practices with 1:1 / Batch architect-led training.

Explore AWS & DevOps Training