Table of Contents
- Common Linux Backup Tools: A Primer
- Troubleshooting Methodology: A Structured Approach
- Permission Issues: The Silent Saboteur
- Storage-Related Failures: When Disks Let You Down
- Network Failures: Remote Backups Gone Wrong
- Tool-Specific Errors: Decoding Backup Tool Output
- Data Corruption: Silent Failures and How to Detect Them
- Scheduling Failures: When Cron and Systemd Drop the Ball
- Advanced Troubleshooting Techniques
- Prevention: Avoiding Failures Before They Happen
- Conclusion
- References
1. Common Linux Backup Tools: A Primer
Before diving into troubleshooting, it’s critical to understand the tools you’re working with. Linux offers a range of backup utilities, each with its own strengths, weaknesses, and failure modes:
| Tool | Use Case | Common Failure Points |
|---|---|---|
rsync | Incremental file-level backups (local/remote) | Permission denied, network timeouts, vanished files |
tar | Archive-based backups (local) | Corrupted archives, disk full, unreadable sources |
borgbackup | Deduplicated, encrypted backups | Repository locks, checksum errors, disk I/O issues |
restic | Secure, deduplicated backups (cloud/local) | Repository corruption, network latency |
Amanda/Bacula | Enterprise-grade network backups | Daemon crashes, misconfigured clients, tape errors |
Most failures stem from misconfigurations, environmental issues (e.g., disk space), or tool-specific quirks. Let’s now explore how to diagnose these systematically.
2. Troubleshooting Methodology: A Structured Approach
Effective troubleshooting avoids guesswork. Follow this framework to isolate and resolve issues:
Step 1: Reproduce the Failure
- Run the backup command manually (outside of cron/systemd) to rule out scheduling issues.
- Note the exact error message (e.g.,
rsync: permission denied (13)).
Step 2: Check Logs
- Most tools log to
stdout/stderrby default. Redirect output to a file (e.g.,backup.sh > backup.log 2>&1) for analysis. - System logs: Check
/var/log/syslog,journalctl(for systemd services), or tool-specific logs (e.g.,borgbackuplogs in~/.borgmatic).
Step 3: Verify Prerequisites
- Storage: Ensure the target filesystem has free space (
df -h). - Permissions: Confirm the backup user has read access to sources and write access to the target.
- Connectivity: For remote backups, verify network access (e.g.,
ping,telnetto the target port).
Step 4: Isolate Variables
- Test with a minimal dataset (e.g., back up a single small file) to rule out source-specific issues.
- Disable non-essential services (e.g., antivirus, firewalls) temporarily to check for interference.
Step 5: Test in a Controlled Environment
- Replicate the failure in a staging environment (e.g., a VM) to avoid disrupting production data.
3. Permission Issues: The Silent Saboteur
Linux’s strict permission model is a frequent culprit. Backups often fail because the user running the backup lacks access to source files or the target directory.
Common Causes
- The backup user (e.g.,
backup-user) lacksreadpermissions on source files (chmod/chownissues). - ACLs (Access Control Lists) restrict access (check with
getfacl). - SELinux/AppArmor policies block file access (e.g.,
auditdlogs showAVC DENIED).
Diagnosis
- Check source file permissions:
ls -l /path/to/source/file. - Verify target directory write access:
touch /path/to/target/testfile(fails if no write perms). - Check SELinux status:
sestatus(enforcing mode may block backups). - Review ACLs:
getfacl /path/to/source/directory.
Resolution
- Fix ownership/permissions:
chown -R backup-user:backup-group /source/dir;chmod -R 755 /source/dir. - Adjust ACLs:
setfacl -R -m u:backup-user:rX /source/dir. - SELinux: Temporarily set to permissive mode (
setenforce 0) to test; if resolved, update policies withsemanage fcontextor create a custom module.
4. Storage-Related Failures: When Disks Let You Down
Backups depend on healthy storage. Even with permissions sorted, storage issues like full disks, read-only filesystems, or failing hardware will cause failures.
Common Causes
- Target filesystem is full (
df -hshows 100% usage). - The target disk has bad sectors (physical damage).
- The filesystem is mounted read-only (e.g., due to errors during boot).
Diagnosis
- Check disk space:
df -h /path/to/target(look forUse%). - Verify mount status:
mount | grep /path/to/target(ensurerwflag). - Check for disk errors:
dmesg | grep -i error(look forI/O error,bad sector). - Test disk health:
smartctl -a /dev/sdX(SMART data reveals pending failures).
Resolution
- Free space: Delete old backups or expand the filesystem (e.g.,
lvextend+resize2fsfor LVM). - Fix read-only mounts: Remount with
mount -o remount,rw /path/to/target; if errors persist, runfsck /dev/sdX(unmount first!). - Replace failing disks: Use
smartctlto confirm failure, then restore data to a new disk.
5. Network Failures: Remote Backups Gone Wrong
Remote backups (e.g., rsync over SSH, borgbackup to S3) often fail due to network instability or misconfiguration.
Common Causes
- Network latency/timeouts (e.g., slow WAN links).
- Firewall rules blocking ports (e.g., SSH port 22, rsync port 873).
- Authentication failures (e.g., SSH key mismatch, expired credentials).
- Packet loss (corrupts data during transfer).
Diagnosis
- Test connectivity:
ping remote-host(check latency/packet loss);telnet remote-host 22(SSH) ornc -zv remote-host 873(rsync). - Verify authentication:
ssh -v remote-host(verbose mode shows key exchange issues). - Check firewall rules:
iptables -L(local) orufw status; on the remote host, ensure incoming ports are allowed. - Monitor transfer speed:
rsync --progress /local/file remote-host:/remote/path(identifies slow links).
Resolution
- Increase timeouts: For
rsync, use--timeout=300(5 minutes); forborgbackup,--remote-ratelimitto throttle speed. - Fix firewall rules: Allow SSH/rsync ports (e.g.,
ufw allow 22/tcp). - Stabilize connections: Use a VPN for unreliable links; enable SSH keepalives (
ClientAliveInterval 60insshd_config).
6. Tool-Specific Errors: Decoding Backup Tool Output
Each backup tool has unique error messages. Let’s decode common ones and fix them.
Example 1: rsync Failures
-
Error:
rsync: read error: Connection reset by peer (104)- Cause: Network interruptions or remote host restart.
- Fix: Use
rsync --partialto resume interrupted transfers.
-
Error:
file has vanished: "/path/to/file"- Cause: File was deleted/renamed during transfer (common with temp files).
- Fix: Ignore with
--ignore-vanishedor exclude temp directories (--exclude=/tmp).
Example 2: tar Failures
- Error:
tar: Unexpected EOF in archive- Cause: Disk full, unreadable source file, or corrupted media.
- Fix: Check
df -h; verify source files withtar tf archive.tar(identifies unreadable entries).
Example 3: borgbackup Failures
- Error:
Repository lock held by PID XXXX on host YYYY- Cause: A previous backup crashed, leaving a stale lock.
- Fix: Remove the lock:
borg break-lock /path/to/repo.
Example 4: restic Failures
- Error:
repository master key and config already initialized- Cause: Attempting to re-initialize an existing repo.
- Fix: Use
restic unlockto clear stale locks or verify the repo path.
7. Data Corruption: Silent Failures and How to Detect Them
Silent corruption (e.g., bit rot) is insidious: backups complete “successfully,” but data is corrupted.
Common Causes
- Failing storage (bad sectors corrupting written data).
- Memory errors (RAM issues cause data corruption during transfer).
- Software bugs (e.g., a backup tool miscalculates checksums).
Diagnosis
- Verify checksums: Compare source and backup hashes (
md5sum /source/file /backup/file). - Use tool-specific integrity checks:
borg check /path/to/reporestic check --read-data(verifies all data in the repo).
- Test restore: Restore a file and compare with the source (
diff /restored/file /source/file).
Resolution
- Restore from a previous known-good backup.
- Replace faulty hardware (e.g., RAM tested with
memtest86+, failing disks withsmartctl). - Update backup tools (bugs are often fixed in newer versions).
8. Scheduling Failures: When Cron and Systemd Drop the Ball
Backups scheduled via cron or systemd timers often fail silently due to environment or timing issues.
Common Causes
- Cron jobs lack
PATHvariables (e.g.,rsyncnot found because it’s in/usr/binbut cron’sPATHis limited). - Systemd timers fail due to
Conditionchecks (e.g.,ConditionPathExistson a missing file). - Logs are not captured, hiding errors (cron output is emailed to
root, but email may be disabled).
Diagnosis
- Check cron logs:
grep CRON /var/log/syslog(look for(CRON) error (grandchild #XXXX failed with exit status 1)). - Inspect systemd timer status:
systemctl status backup.timerandjournalctl -u backup.service. - Test cron commands manually: Run the exact cron command as the cron user (e.g.,
sudo -u backup-user /path/to/backup.sh).
Resolution
- Use absolute paths in cron/systemd: Replace
rsyncwith/usr/bin/rsync. - Redirect output to a log file: In cron,
* * * * * /backup.sh > /var/log/backup.log 2>&1. - Fix systemd conditions: Ensure
ConditionPathExistspoints to a valid file; setWantedBy=multi-user.target.
9. Advanced Troubleshooting Techniques
For stubborn failures, use these deep-dive tools:
Log Analysis
journalctl -u backup.service --since "1 hour ago": Filter systemd service logs.grep -i error /var/log/backup.log: Search backup-specific logs for errors.
Trace System Calls with strace
- Trace file access issues:
strace -f -e trace=file rsync /source /target(shows which files causeEACCESerrors).
Check Open Files with lsof
- Identify locked files:
lsof /path/to/backup/repo(kills processes holding locks withkill -9 <PID>).
Debug Flags
- Most tools offer verbose/debug modes:
rsync -vvv: Ultra-verbose output.tar -cvzf backup.tar.gz /source --debug all: Tar debug logs.
10. Prevention: Avoiding Failures Before They Happen
An ounce of prevention is worth a pound of cure:
- Test Backups Regularly: Restore a random file monthly to verify integrity.
- Monitor Backups: Use tools like Nagios, Prometheus, or simple scripts to alert on failure (e.g.,
if ! /backup.sh; then echo "Backup failed" | mail -s "ALERT" [email protected]; fi). - Version Control Backup Scripts: Track changes to
backup.shwith Git to revert bad edits. - Redundancy: Use multiple backup targets (local + cloud).
- Document Everything: Log tool versions, command flags, and recovery steps.
11. Conclusion
Troubleshooting Linux backup failures requires a mix of systematic diagnosis, tool-specific knowledge, and attention to detail. By following the methodology outlined here—checking permissions, storage, network, and scheduling—you can resolve most issues quickly. For silent failures, leverage checksums and tool-specific integrity checks. Remember: the best backup is one that’s tested regularly. With proactive monitoring and prevention, you’ll ensure your data remains safe when disaster strikes.
12. References
rsyncMan Page: https://man7.org/linux/man-pages/man1/rsync.1.html- BorgBackup Documentation: https://borgbackup.readthedocs.io/
- Restic Documentation: https://restic.readthedocs.io/
- Linux Permissions Guide: https://www.linux.com/learn/understanding-linux-file-permissions
- SELinux Wiki: https://wiki.gentoo.org/wiki/SELinux
- “Backup & Recovery” by W. Curtis Preston (O’Reilly Media).