Ansible Error Handling

In real infrastructure, things go wrong. A service fails to start, a file is not where you expected it, a third-party API times out. Without proper error handling, a single unexpected failure aborts your entire playbook and leaves infrastructure in a partially-configured state. This lesson teaches you every error handling mechanism Ansible provides — from simple failure tolerance to sophisticated try-catch-finally workflows.

Default Error Behaviour

By default, if a task fails on a host, Ansible marks that host as failed and removes it from all subsequent plays in the playbook. Other hosts continue running. If all hosts fail, the playbook exits immediately. This conservative default is correct for most cases — a partial configuration is often worse than no change at all.

ignore_errors: Simple Failure Tolerance

- name: Stop application (may not be running)
  service:
    name: myapp
    state: stopped
  ignore_errors: true   # Continue even if this fails

- name: Remove old PID file
  file:
    path: /var/run/myapp.pid
    state: absent
  ignore_errors: true

Use ignore_errors: true when a task failure is acceptable and the playbook should continue regardless. Common use cases: stopping a service that might not be installed, checking for files that may not exist, running cleanup operations during updates.

Important: ignore_errors only prevents the host from being marked as failed. The task result still shows as failed in the output. For truly expected conditions, use failed_when instead.

failed_when: Defining What Failure Means

- name: Check application health endpoint
  uri:
    url: http://localhost:8080/health
    return_content: true
  register: health_check
  failed_when: health_check.status != 200 or 'healthy' not in health_check.content

- name: Run database migration
  command: /opt/myapp/bin/migrate --check
  register: migration_output
  failed_when:
    - migration_output.rc != 0
    - '"No pending migrations" not in migration_output.stdout'

failed_when replaces Ansible's default failure detection with your own condition. This is the clean way to handle tasks where "failure" has a domain-specific meaning. The inverse, changed_when: false, prevents a task from ever being marked as changed (useful for read-only commands that Ansible would otherwise mark as changed).

changed_when: Controlling the Changed Status

- name: Check if migration is needed
  command: /opt/myapp/bin/migrate --dry-run
  register: migrate_check
  changed_when: '"pending migrations" in migrate_check.stdout'

- name: Run schema check (read-only, never marks as changed)
  command: pg_dump --schema-only mydb
  changed_when: false

block, rescue, and always: Structured Error Handling

The block/rescue/always construct is the most powerful error handling pattern in Ansible — equivalent to try/catch/finally in a programming language:

- name: Application deployment with error handling
  block:
    # BLOCK: Tasks that might fail
    - name: Take application offline
      command: /opt/myapp/bin/maintenance on

    - name: Run database migrations
      command: /opt/myapp/bin/migrate up
      
    - name: Deploy new application code
      unarchive:
        src: /tmp/myapp-v2.tar.gz
        dest: /opt/myapp
        remote_src: true
      
    - name: Restart application
      service:
        name: myapp
        state: restarted
        
    - name: Bring application back online
      command: /opt/myapp/bin/maintenance off

  rescue:
    # RESCUE: Runs if any task in the block fails
    - name: Log deployment failure
      lineinfile:
        path: /var/log/deployments.log
        line: "FAILED: Deployment of v2 failed at {{ ansible_date_time.iso8601 }}"

    - name: Rollback to previous version
      command: /opt/myapp/bin/rollback
      
    - name: Take application back online after rollback
      command: /opt/myapp/bin/maintenance off
      
    - name: Notify deployment team
      uri:
        url: "{{ slack_webhook_url }}"
        method: POST
        body_format: json
        body:
          text: "Deployment failed and rolled back on {{ inventory_hostname }}"

  always:
    # ALWAYS: Runs regardless of block success or failure
    - name: Record deployment attempt in audit log
      lineinfile:
        path: /var/log/deployment-audit.log
        line: "Deployment attempted at {{ ansible_date_time.iso8601 }} by {{ ansible_user_id }}"
    
    - name: Clear temporary deployment files
      file:
        path: /tmp/myapp-v2.tar.gz
        state: absent

How block/rescue/always Works

If all block tasks succeed, rescue is skipped and always runs. If any block task fails, Ansible immediately jumps to rescue. After rescue completes, always runs. The host is not marked as failed after a successful rescue — the block/rescue combination is considered a success from the playbook's perspective.

any_errors_fatal: Stopping Everything on First Failure

- name: Critical database migration
  hosts: databases
  any_errors_fatal: true   # Abort all hosts if any host fails
  tasks:
    - name: Run migration
      command: /opt/db/migrate up

By default, a failure on one host does not stop other hosts. any_errors_fatal: true changes this — if any host fails any task, Ansible immediately stops execution on all hosts. Use this for operations where partial completion would be worse than no change (schema migrations, cluster upgrades).

max_fail_percentage: Canary Deployments

- name: Rolling deployment
  hosts: webservers
  max_fail_percentage: 20   # Abort if more than 20% of hosts fail
  serial: 2                  # Deploy to 2 hosts at a time
  tasks:
    - name: Deploy application
      unarchive: ...

max_fail_percentage allows a controlled number of failures before aborting — useful in rolling deployments where some tolerable failure rate is acceptable but widespread failures should stop the deployment.

Try This: Deployment with Rollback

Write a playbook that simulates an application deployment using block/rescue/always. In the block, simulate a failure by including a task that always fails (use command: /bin/false). In rescue, write a task that logs "ROLLBACK TRIGGERED" to a file. In always, write a task that logs "DEPLOYMENT ATTEMPT COMPLETE". Run the playbook and verify all three sections execute in the correct order and that the log file contains the expected entries.

Summary

ignore_errors continues playbook execution after task failures. failed_when defines custom failure conditions with domain-specific logic. changed_when: false prevents read-only tasks from being marked as changed. The block/rescue/always construct implements try/catch/finally semantics for grouped tasks with built-in rollback capability. any_errors_fatal stops all hosts on first failure for critical operations. max_fail_percentage implements canary deployment failure thresholds.

Leave a Comment