Ansible Error Handling
In real infrastructure, things go wrong. A service fails to start, a file is not where you expected it, a third-party API times out. Without proper error handling, a single unexpected failure aborts your entire playbook and leaves infrastructure in a partially-configured state. This lesson teaches you every error handling mechanism Ansible provides — from simple failure tolerance to sophisticated try-catch-finally workflows.
Default Error Behaviour
By default, if a task fails on a host, Ansible marks that host as failed and removes it from all subsequent plays in the playbook. Other hosts continue running. If all hosts fail, the playbook exits immediately. This conservative default is correct for most cases — a partial configuration is often worse than no change at all.
ignore_errors: Simple Failure Tolerance
- name: Stop application (may not be running)
service:
name: myapp
state: stopped
ignore_errors: true # Continue even if this fails
- name: Remove old PID file
file:
path: /var/run/myapp.pid
state: absent
ignore_errors: trueUse ignore_errors: true when a task failure is acceptable and the playbook should continue regardless. Common use cases: stopping a service that might not be installed, checking for files that may not exist, running cleanup operations during updates.
Important: ignore_errors only prevents the host from being marked as failed. The task result still shows as failed in the output. For truly expected conditions, use failed_when instead.
failed_when: Defining What Failure Means
- name: Check application health endpoint
uri:
url: http://localhost:8080/health
return_content: true
register: health_check
failed_when: health_check.status != 200 or 'healthy' not in health_check.content
- name: Run database migration
command: /opt/myapp/bin/migrate --check
register: migration_output
failed_when:
- migration_output.rc != 0
- '"No pending migrations" not in migration_output.stdout'failed_when replaces Ansible's default failure detection with your own condition. This is the clean way to handle tasks where "failure" has a domain-specific meaning. The inverse, changed_when: false, prevents a task from ever being marked as changed (useful for read-only commands that Ansible would otherwise mark as changed).
changed_when: Controlling the Changed Status
- name: Check if migration is needed command: /opt/myapp/bin/migrate --dry-run register: migrate_check changed_when: '"pending migrations" in migrate_check.stdout' - name: Run schema check (read-only, never marks as changed) command: pg_dump --schema-only mydb changed_when: false
block, rescue, and always: Structured Error Handling
The block/rescue/always construct is the most powerful error handling pattern in Ansible — equivalent to try/catch/finally in a programming language:
- name: Application deployment with error handling
block:
# BLOCK: Tasks that might fail
- name: Take application offline
command: /opt/myapp/bin/maintenance on
- name: Run database migrations
command: /opt/myapp/bin/migrate up
- name: Deploy new application code
unarchive:
src: /tmp/myapp-v2.tar.gz
dest: /opt/myapp
remote_src: true
- name: Restart application
service:
name: myapp
state: restarted
- name: Bring application back online
command: /opt/myapp/bin/maintenance off
rescue:
# RESCUE: Runs if any task in the block fails
- name: Log deployment failure
lineinfile:
path: /var/log/deployments.log
line: "FAILED: Deployment of v2 failed at {{ ansible_date_time.iso8601 }}"
- name: Rollback to previous version
command: /opt/myapp/bin/rollback
- name: Take application back online after rollback
command: /opt/myapp/bin/maintenance off
- name: Notify deployment team
uri:
url: "{{ slack_webhook_url }}"
method: POST
body_format: json
body:
text: "Deployment failed and rolled back on {{ inventory_hostname }}"
always:
# ALWAYS: Runs regardless of block success or failure
- name: Record deployment attempt in audit log
lineinfile:
path: /var/log/deployment-audit.log
line: "Deployment attempted at {{ ansible_date_time.iso8601 }} by {{ ansible_user_id }}"
- name: Clear temporary deployment files
file:
path: /tmp/myapp-v2.tar.gz
state: absentHow block/rescue/always Works
If all block tasks succeed, rescue is skipped and always runs. If any block task fails, Ansible immediately jumps to rescue. After rescue completes, always runs. The host is not marked as failed after a successful rescue — the block/rescue combination is considered a success from the playbook's perspective.
any_errors_fatal: Stopping Everything on First Failure
- name: Critical database migration
hosts: databases
any_errors_fatal: true # Abort all hosts if any host fails
tasks:
- name: Run migration
command: /opt/db/migrate upBy default, a failure on one host does not stop other hosts. any_errors_fatal: true changes this — if any host fails any task, Ansible immediately stops execution on all hosts. Use this for operations where partial completion would be worse than no change (schema migrations, cluster upgrades).
max_fail_percentage: Canary Deployments
- name: Rolling deployment
hosts: webservers
max_fail_percentage: 20 # Abort if more than 20% of hosts fail
serial: 2 # Deploy to 2 hosts at a time
tasks:
- name: Deploy application
unarchive: ...max_fail_percentage allows a controlled number of failures before aborting — useful in rolling deployments where some tolerable failure rate is acceptable but widespread failures should stop the deployment.
Try This: Deployment with Rollback
Write a playbook that simulates an application deployment using block/rescue/always. In the block, simulate a failure by including a task that always fails (use command: /bin/false). In rescue, write a task that logs "ROLLBACK TRIGGERED" to a file. In always, write a task that logs "DEPLOYMENT ATTEMPT COMPLETE". Run the playbook and verify all three sections execute in the correct order and that the log file contains the expected entries.
Summary
ignore_errors continues playbook execution after task failures. failed_when defines custom failure conditions with domain-specific logic. changed_when: false prevents read-only tasks from being marked as changed. The block/rescue/always construct implements try/catch/finally semantics for grouped tasks with built-in rollback capability. any_errors_fatal stops all hosts on first failure for critical operations. max_fail_percentage implements canary deployment failure thresholds.
