avatar-andy

Running Kubernetes Jobs in CI/CD Made Easy

If you’ve tried to run K8s Jobs from within your CI/CD system, you know it’s tricky. You spray several kubectl commands and expect that should do the job. But getting all the details right is challenging. I’m going to show you how to do it in an elegant and robust way.

Triggering Jobs and handling them correctly requires us to address several concerns:

Let me stress out the importance of the last one (handling exit codes). Within CI/CD environment, you’ll find that you want to run workflow steps conditionally, depending on the previous operation’s status. This enables deeper integration into your existing CI/CD flows.

Job and kubectl handling

First, let’s create a Job resource (job.yaml):

apiVersion: batch/v1
kind: Job
metadata:
  name: myjob
spec:
	backoffLimit: 0
  template:
    spec:
      containers:
      - name: myjob
        image: bash:latest
        command: ["/bin/sh", "-c"]
        args:
          - echo "Starting job..";
            sleep 1;
            echo "Working (1/3)..";
            sleep 1;
            echo "Working (2/3)..";
            sleep 1;
            echo "Working (3/3)..";
            sleep 1;
            echo "Done!";
      restartPolicy: Never

Note the backoffLimit: 0. This instructs the Job to be executed only once. If you increase this value (and it’s non-zero by default), K8s will try to retry the Job’s process several times until it succeeds. You may opt into retrying your Job depending on your use case.

Now, add the kubectl handling (run-job.sh).

#!/usr/bin/env bash

NS="mynamespace"
JOB="myjob"

# Delete the Job if it exists (runs could fail without cleanup)
kubectl delete job $JOB -n $NS || true
# Create the Job
kubectl apply -f job.yaml -n $NS
# Wait for the Job container creation
kubectl wait --for=condition=ready -n $NS \
	$(kubectl get pod -l job-name=$JOB -n $NS -o name)
# Stream logs to STDOUT (with -f follow flag)
kubectl logs -f job/$JOB -n $NS

# Handling status (complete|failed)
# Wait for complete condition – push to bg and save PID
kubectl wait --for=condition=complete \
	job/$JOB -n $NS > /dev/null 2>&1 &
completion_pid=$!
# Wait for failed condition – push to bg and save PID
kubectl wait --for=condition=failed \
	job/$JOB -n $NS > /dev/null 2>&1 && exit 1 &
failure_pid=$!

# Wait until any of the waits complete
wait -n $completion_pid $failure_pid
exit_code=$?

# Display a friendly Job status message
if (( $exit_code == 0 )); then
  echo "Job completed"
else
  echo "Job failed with exit code ${exit_code}, exiting..."
fi

# Clean up the job afterwards
kubectl delete job $JOB -n $NS

# Exit with the Job's exit code
exit $exit_code

To test how this code handles errors, simply inject one (in job.yaml):

echo "Working (2/3)..";
sleep 1;
echo "ERROR!";
exit 1;
echo "Working (3/3)..";
sleep 1;

Caveat: compute minutes

Notice that using this method, you’re wasting compute. A CI/CD runner process triggers a K8s Job, and then waits until its completion. For longer jobs, you will be blocking the runner for the whole Job duration, even though its compute load is close to zero. With self-hosted runners, this might a non-issue, but if you pay for CI/CD minutes, this can quickly ramp up your bill.

Why bother with CI/CD

If it’s tricky to setup, and might cost extra, why bother running Jobs this way? There is a number of valid reasons to do so:

CI/CD workflows run in a somewhat unique environment. They have triggers that can’t be reproduced otherwise:

Finally, the most popular CI/CD systems have a robust UI that makes managing jobs and workflows a breeze. You can view workflow runs, their logs, retry failed jobs, and do a lot more.

Many systems support manual workflow triggers. With a simple click of a button, a non-technical staff member can trigger powerful automation, that manipulates K8s resources in a safe way. This could greatly simplify many complex RBAC & kubectl access patterns.

Example workflows: