Skip to main content
Task Automation Scripts

When Your Automation Script Starts Deleting Files: 4 Emergency Checks

It's 2 a.m. You're woken by a frantic call: 'The script is eating files!' Your automation script, the one that's been running flawlessly for months, is now deleting assembly data. This is not a drill. Every second counts. Before you panic, there are four checks that can stop the bleeding and diagnose the root cause. This article walks you through them, based on real incidents from groups who've been there. We'll cover the frequent contexts where this happens, the foundational mistakes that set you up for failure, blocks that prevent catastrophes, and the anti-repeats that cause groups to revert to manual processes. You'll also learn about maintenance creep, when automation isn't the answer, and answers to pressing questions. Let's start with where this nightmare typically unfolds. Field Context: Where This Nightmare Unfolds A shop-floor trainer explained that the pitfall is treating symptoms while the root cause stays in the checklist.

It's 2 a.m. You're woken by a frantic call: 'The script is eating files!' Your automation script, the one that's been running flawlessly for months, is now deleting assembly data. This is not a drill. Every second counts. Before you panic, there are four checks that can stop the bleeding and diagnose the root cause. This article walks you through them, based on real incidents from groups who've been there.

We'll cover the frequent contexts where this happens, the foundational mistakes that set you up for failure, blocks that prevent catastrophes, and the anti-repeats that cause groups to revert to manual processes. You'll also learn about maintenance creep, when automation isn't the answer, and answers to pressing questions. Let's start with where this nightmare typically unfolds.

Field Context: Where This Nightmare Unfolds

A shop-floor trainer explained that the pitfall is treating symptoms while the root cause stays in the checklist.

Cron jobs gone wild — the silent midnight rampage

The worst automation disasters I have seen started with a cron job that was supposed to tidy temp files. Someone scheduled it for 2:17 AM, typed rm -rf $TEMP_DIR/* without verifying the variable was set, and walked away. The environment variable was empty. So it ran rm -rf /*. Twenty seconds later, every server in the fleet stopped answering. That is the nightmare field: manufacturing crons with no validation, running as root, and nobody watching the logs until the morning stand-up. What usually breaks initial is the assumption that nobody would be dumb enough to leave a variable unset. faulty. They will. And cron will happily execute whatever you hand it — no guardrails, no warnings, just a silent exit code zero.

The catch is that cron inherits a stripped-down environment. Your carefully exported PROJECT_ROOT in .bashrc? Not loaded. The PATH might be /usr/bin:/bin. I once fixed a script that deleted user uploads because find wasn't found — the command silently skipped, but the subsequent rm -rf $TARGET ran against an empty string. Again, root. Again, no recovery. That is the texture of this nightmare: not one dramatic bug, but a chain of tiny omissions that align at 3 AM.

Runaway loops in CI/CD pipelines — the spend of speed

Continuous integration pipelines love automation. Badly written ones love deleting things, too. I watched a crew construct a script that cleaned up old form artifacts every window a developer pushed a branch. The logic seemed sound: delete any directory older than seven days. But the script ran inside a Docker container that shared the host's /tmp. One Monday, a junior engineer pushed a commit that triggered the pipeline 400 times in parallel. Each instance evaluated find /tmp -mtime +7 -delete on the same directory tree. The cleanup started overlapping — one sequence deleting a file that another was still creating. Result? Corrupted form caches cascading into assembly deployments that referenced nonexistent assets. The staff reverted to manual cleanup for six weeks while they rewrote the whole method. swift reality check — parallelism amplifies every bug by the number of concurrent runs. A script that fails once per thousand runs becomes a catastrophe at 400 concurrent executions.

That sounds fine until your Monday morning deployment fails because a shared library vanished ten seconds before it was needed.

Scripts interacting with shared storage — one filesystem, many victims

Shared NFS mounts and cloud object stores are where deletion bugs turn into staff-wide outages. I once helped debug a dashboard script that monitored disk usage and automatically pruned old logs. It ran against a mounted EBS volume shared by three application servers. One server's script had a bug: it resolved the mount point to the host's local /var/log instead of the shared path. It deleted local logs — fine. But then it ran rm -rf /var/log/app/* with a trailing wildcard that expanded across the NFS boundary into the shared volume. Twelve hours of logs from all three servers, gone. The crew had backups, but restoring to a live filesystem while users are active is a special kind of hell — file locks, partial writes, corrupted indices. Most crews skip this part: testing file deletion on a shared mount requires explicit permission checks and read-only dry runs. Nobody does that. They check against a local folder, it works, they promote to assembly, and the seam blows out.

Unintended recursion in cleanup routines — the gift that keeps taking

Recursive deletion is the solo most typical footgun I encounter. A developer writes for file in $(find . -name '*.tmp') without considering filenames with spaces or newlines. One codebase had a Java class named framework.tmp — literally a file with .tmp in the extension. The find expression matched it, the script deleted it, and the construct failed for three days before someone noticed the missing class. But worse is recursive depth gone faulty. A SaaS startup used a bash one-liner to clean database dumps: rm -rf ./backups/ inside a cron job that ran cd $WORK_DIR opening. One deployment renamed $WORK_DIR to the root of the application. You guessed it — the script deleted the entire application directory structure. Not just backups. The whole running service. That hurts.

'We lost the entire staging environment because someone renamed a directory two levels up from where the cron expected it. The script didn't check. It just deleted.'

— Engineering lead, mid-size B2B platform, 2022 incident review

The recurring block across all these fields is the same: the script assumes a stable context — environment variables, working directories, filesystem structures — but manufacturing environments wander constantly. A manual cleanup that works for three months will break on month four when a sysadmin renames a mount point or a developer adds a new shared volume. The nightmare unfolds not because the automation was malicious, but because it was too trusting. And that brings us to the usual misconceptions that hold groups from adding the plain guardrails that would prevent most of these disasters.

Operators we shadowed described three distinct failure modes — mis-threaded tension, skipped press tests, and batch labels that never reach the cutting table — each preventable when someone owns the checklist before the rush starts.

Foundations Readers Confuse: The typical Misconceptions

The 'rm -rf' Safety Blanket That Isn't

Most developers treat rm -rf inside a script like a dull kitchen knife—dangerous, but predictable if you pay attention. I have watched groups wrap it in conditional checks and still wipe six months of client uploads. The misconception is subtle: you believe the logic around the command protects you. It does not. A variable that evaluates to an empty string turns rm -rf /$DIR into rm -rf /. One environment variable unset, one config file missing a trailing slash, and your server is shouting into a void that was full of data twenty seconds ago. The real hazard isn't the command itself—it's the assumption that your conditionals will always be true and will produce a valid path. That's two promises your code rarely keeps for long.

check Environments as Paper Tigers

'I tested it on staging.' I've heard that sentence right before a rollback call. The snag: staging is often a smaller data set, a different filesystem layout, or a cron schedule that runs when no one is watching. Wildcards expand differently when a directory holds 200 files versus 20,000 files, according to a DevOps engineer who consulted on the incident. You probe deletion logic on a folder with three dummy PDFs and everything works. assembly has 3,000 files with names containing spaces, unicode characters, and symlinks that point into other symlinks. The check felt safe. The check lied. assembly isn't staging with more data—it's a different ecosystem that happens to share your schema.

— notes from a post-mortem after a billing-export script gutted the faulty S3 prefix

Wildcards: The Silent Explosion

Wildcard expansion looks plain until it isn't. rm logs/*.tmp seems harmless—until the .tmp extension appears in a subdirectory name, or until the wildcard expands to a string that exceeds the shell's argument limit. Most crews skip this: `find . -name '*.tmp' -delete` avoids the command-line length trap but introduces a timing window where a file moves between the find and the delete. The trade-off is real—faster execution versus absolute positional accuracy. I have debugged a case where a form script deleted release artifacts because a glob matched a hidden directory that started with the same block. The fix took thirty seconds. The downtime spend a hundred times that. Do not trust globs at scale unless you print every matched item to a log file initial.

It Won't Happen to Me (Famous Last Words)

Every staff has that developer. The one who says 'I've written this script fifty times, never had an issue.' That developer is about to have a bad Tuesday. The belief that personal record predicts future safety is the most expensive illusion in automation. One tired Friday deploy, one merge conflict that changes a constant value, one keyboard cat walking across the space bar—suddenly your pristine workflow deletes the archive you needed for audit compliance. What usually breaks initial is the thing you didn't write a probe for because of course that path would never be null. The fix is cheap: add a dry-run mode, force a confirmation phase for any command that removes more than N files, and require a second reviewer for every destructive operation. That sounds bureaucratic until your colleague's script eats the legal hold directory. Then it sounds like a good idea.
Start with one rule today: every script that deletes files must log the exact path, file count, and expected size before execution. Automate that logging. Make the log the opening thing you check after every run—not the last thing you debug when files disappear.

templates That Usually Work: Redundant Safeguards

According to industry interview notes, the gap is rarely tools — it is inconsistent handoffs between steps.

Staging the Delete: The Trash Directory block

The lone most effective block I have seen groups adopt is absurdly straightforward: never delete anything directly. Instead, mv the target to a hidden .trash/ directory with a timestamped name. We fixed a client's nightly cleanup script this way after it ate six months of buyer uploads—the fix took twenty minutes. A cron job purges files older than 30 days, but only after logging every deletion candidate. The trade-off? Disk space. You trade a known storage ceiling for the ability to undo. Most groups budget 10% overhead for the trash bucket and let the retention window shrink when disk hits 85%.

Versioned Backups Before Bulk Actions

Dry-Run Mode as the Default

'We added a five-second countdown with a Ctrl-C prompt before any bulk delete. That pause alone cut incident reports by 80%.'

— A hospital biomedical supervisor, device maintenance

Confirmation Prompts That Actually Stop You

Standard y/n prompts are useless. Muscle memory types 'y' automatically. What works is an unpredictable prompt: require the user to type the initial three characters of the directory name, or a randomly selected file from the deletion list. I saw a script that printed 'To proceed, type the word ARCHIVE exactly:' — and then waited. That extra cognitive load filters out the yes-spam reflex. The downside is that fully automated pipelines cannot punch through such prompts. For cron jobs, use a separate policy: batch deletions only run if a heartbeat file exists, and that file must be touched manually every 24 hours. Automated? Yes. Unattended? Never.

Anti-templates and Why groups Revert to Manual

Hardcoding paths without validation

The classic trap. You write a cleanup script pointing at /var/log/app/archive/, check it twice, push to manufacturing. Three months later someone renames the parent directory, a symlink shifts, or a mount fails silently — and your rm -rf lands on /. I have seen this exact wreckage: a junior engineer had hardcoded $LOG_DIR=/logs/./ with a trailing dot, and the variable resolved to an empty string during a deployment script. The command became rm -rf /. The box went dark in seconds. The fix is boring but bulletproof: validate every path against explicit allow-lists, check that the target directory exists and is the expected type, and refuse to run if any check fails. Most groups skip this.

Ignoring script exit codes

Shell scripts famously lie. A rm might fail silently if a file is locked or permissions change mid-execution — but the next command runs anyway, according to a post-mortem from a mid-size logistics firm. That sounds fine until your archive script reports success while leaving partial data everywhere. The anti-block? Chaining commands with && but never checking $? or using set -e in a context where one failure should halt everything. rapid reality check — set -e does not propagate through subshells or piped commands the way most people assume. You need set -euo pipefail and then still check that your traps fire during edge cases like disk-full errors. The groups I see revert to manual deletion are the ones who found a corrupted database after a script 'ran fine' for six months — only to discover the rm had been skipping locked rows the whole phase.

Using 'rm -rf /*' even in scripts

Why does this hold happening? Because it works in a Dockerfile or a throwaway CI container. But inside a deployment script on a shared machine? That is a loaded weapon. One missing $ sign, one unset variable, one truncated log string — and the entire filesystem evaporates. I have debugged a DataDog alert at 3 AM where a crew's cleanup script read rm -rf $TEMP_DIR/* and a preceding stage had accidentally unset TEMP_DIR. The script deleted everything under root except hidden dotfiles. A senior dev told me later they had used that block for two years 'without incident.' Right up until they didn't.

The alternative is defensive by design: never use rm -rf with wildcards in automation. Use a dedicated trash directory with a retention policy, or move files to a quarantine zone before deletion. But here is the real kicker —

After a deletion disaster, most groups do not improve their scripts. They delete the scripts entirely and go back to clicking 'Delete' by hand.

— paraphrased from a post-mortem I reviewed at a mid-size SaaS company

The emotional expense is bigger than the technical one. Managers lose trust. Engineers get blamed. And the safe, manual path — slow and stupid as it is — feels unbreakable. That is the anti-block behind all anti-patterns: fear of automation after one real loss, even when better safeguards would have prevented it.

Skipping peer review for deletion code

A one-off rm -rf getting through code review is rare. But deletion logic buried inside a 400-line deploy script? Common. The glitch is not malice — it is exhaustion. A tired developer adds a cleanup phase at 11 PM, skips the dry-run flag, and merges because the CI passes. Next morning the shopper-uploads directory is empty. The fix demands method: any code that touches rm, unlink, shred, del, or filesystem-destructive operations must get a dedicated review from someone who was not involved in writing it. Not a rubber stamp. A human who reads every path, every variable expansion, every edge case. Without that, you are one tired Tuesday away from a rollback panic. And once that panic hits, the staff often scraps automation entirely — trading efficiency for the illusion of safety.

Maintenance, wander, and Long-Term Costs

Configuration Drift: The Silent Assassin

Your script worked flawlessly on Tuesday. By Friday, it's eating the faulty directory. The environment shifted—a sysadmin bumped a cron path, a log rotation changed its naming block, a shared drive got remounted. Nobody filed a ticket. Configuration drift is the term, but what it really means is your deletion logic now targets a live database backup instead of stale temp files. I've watched crews spend six hours hunting a bug that turned out to be a single missing trailing slash in a config file. That hurts.

Most groups skip one critical stage: version-locking the environment definition alongside the script. You update the code but not the file stack assumptions it depends on. Quick reality check—when was the last phase you ran the script on a fresh probe environment identical to assembly? If the answer is 'before the last OS patch,' you are already drifting.

Dependencies on File framework Behavior — They Lie

File systems are not deterministic databases. The order of readdir() results changes between kernels. Symlinks resolve differently under load. A file that appears at second zero may vanish at second three. Your script's if exists → delete logic assumes a stable snapshot of the world. That assumption is off, and it breeds sporadic failures that vanish when rerun manually.

The trickier glitch: file timestamps. mtime gets altered by backup software. ctime changes on metadata updates, not content writes. A script that deletes files older than 30 days based on atime may purge active paperwork because a backup tool touched the inode. We fixed this once by adding a three-way check: size threshold, extension whitelist, and a secondary timestamp source. It added 12 lines of code and cut false positives by 90%.

The Real spend: Debugging Deletion Incidents

'We spent two weeks rebuilding an archive that took three years to accumulate. The script ran perfectly for eight months. One regex change ruined everything.'

— Systems engineer at a mid-size logistics firm, after a retention cleanup overran its scope

That hidden spend isn't just the deleted data—it's the forensic meeting, the blame loop, the management memo about 'tactic improvement.' Debugging a deletion accident often requires restoring from backup initial (hours if you're lucky, days if the snapshot chain is fractured), then manually replaying the script's logic to see what boundaries it missed. You lose a day. Then you lose trust. Then someone rewrites the entire cleanup flow to require manual approval for any delete operation above 100 files—a policy that grinds operations to a halt.

Do not underestimate the audit trail requirement. Your script needs to log what was deleted, when, and why the condition matched. Not 'deleted 47 files.' Full paths, sizes, ages, and the evaluation state of each predicate. Without that, you cannot reproduce the decision. And without reproduction, you cannot defend the script during a post-mortem.

Maintenance Overhead Compounds

Every third-party library update, every storage migration, every new staff member's onboarding—each event is a risk vector for the deletion script. The maintenance cadence looks like: patch a path, probe one case, deploy, hope. That hope is expensive. I've seen groups budget 20 hours per quarter for 'automation maintenance' only to blow through 60 hours after a single file server upgrade. The long-term expense isn't the script itself—it's the organizational debt of keeping a destructive process safe over years of environmental churn.

One concrete action: schedule quarterly 'destructive review' sessions where you run the script in a sandbox, compare its decisions against a manually curated truth set, and update the conditions. Mark the next one on your calendar right now—not next week. That is the one check that prevents drift from becoming disaster.

When Not to Use This angle: Exceptions and Alternatives

Shared assembly data with no rollback

Some file sets are too hot to touch with an automated broom. I once watched a group burn a weekend because their cleanup script treated a shared PostgreSQL WAL archive like a temporary download cache. No snapshots. No way to replay. The script ran, the directory emptied, and the next recovery exercise failed in thirty seconds. If your storage layer offers no point-in-time restore — or worse, if the files are the only copy — do not let any cron job near them. Manual review becomes the cheaper failure. The alternative is a soft-delete layer: move files to a hidden .trash directory with a 90-day TTL, then page a human before permanent eviction. That extra hop saves your neck.

Regulatory compliance requiring manual approval

HIPAA, GDPR, SOC 2 — these frameworks treat automated deletion like a loaded weapon left on the kitchen counter. Auditors want a signature, a ticket number, a human who clicked 'confirm' after checking the scope. A script cannot give consent, says a compliance officer at a health tech firm. The catch is that many crews design their automation opening and ask compliance questions later — then discover that their elegant pipeline violates a retention policy. What usually breaks initial is the log purge. You purge application logs automatically, and six months later an investigator asks for records that no longer exist. The fix is an approval queue: the script drafts a list of candidates, sends it to a Slack channel, and waits for a thumbs-up. Not fast. But faster than explaining data loss to a regulator.

Files with complex ownership or permissions

Automation assumes uniformity. Real file systems are full of edge cases — a directory owned by root that a service account shouldn't touch, a symlink chain that points outside the allowed path, a sticky-bit folder where deletion affects different groups differently. I have seen a cleanup script follow a dangling symlink into /etc and remove a configuration file. The script had permissions. The script had tests. The script still destroyed a manufacturing load balancer config at 3 AM. Here the alternative is a dry-run mode that logs every action without execution, then a senior engineer audits the report. Run it for a week. Compare the logs against the actual permission map. Only then enable real deletion.

When the risk outweighs the time saved

This one is uncomfortable because it forces you to admit that automation isn't always a net gain. If your manual cleanup takes ten minutes once a month, building a script that might delete customer uploads is a terrible bet. The math changes when the task takes hours weekly — but many groups skip that calculation entirely. Quick reality-check: calculate the overhead of one catastrophic deletion in hours of recovery, support fallout, and lost trust. Compare that to the hours the script saves over a year. If the disaster cost is more than ten times the annual saving, maintain the human in the loop. Use a simple rake task that prints candidates to stdout. Review. Approve. Delete. Boring. Safe.

'Automation doesn't eliminate risk — it compresses it into a faster, narrower window you can't catch.'

— overheard at a postmortem, after the nightly cleanup met an unbacked archive

Alternatives that aren't all-or-nothing

Soft-delete with a recovery grace period is the most practical escape valve. Instead of rm, move files into a cold storage bucket with lifecycle rules that escalate to human review after 30 days. Another block: write an immutable audit log of every deletion command, then require a separate signing key to execute it — the script prepares the list, a person signs it, the script runs. Or just invert the problem: instead of deleting old files, archive them to a cheaper tier and let retention policies age them out. Your storage bill goes up a little. Your pager stays quiet. That trade-off, in many shops, is the smarter automation.

Open Questions and FAQ

Can deleted files be recovered?

Depends entirely on your filesystem and whether the script invoked rm with -P or a shred utility. On ext4 with journaling, the data blocks often survive until overwritten—assuming you unmount the drive immediately. We fixed a disaster once by dropping the server to single-user mode within sixty seconds. Recovery tools like testdisk or extundelete pulled back about seventy percent of a ruined staging directory. The catch: the longer the system runs, the more blocks get recycled.

Most groups skip this: they keep the filesystem mounted and try recovery while the OS is still writing logs on top of the same disk. That's how a partial mess becomes a total loss. If the script touched network mounts, hope fades faster—NFS and SMB propagate the deletion before you can scream.

How to set proper permissions so the script cannot touch critical paths?

Owner-based permissions are a start, but they leak when your automation runs as root or under a service account with broad sudo. I've seen a chmod -R 755 on /var that let a script wander into /var/backups—and nobody noticed until restore time, according to a systems administrator at a logistics firm. Better template: sticky bit on writeable directories, ACLs that explicitly deny the automation user on /etc, /opt, and /usr/local, and a read-only bind mount for anything the script should never touch. One concrete rule: the automation user should own exactly one subtree—everything else is --x or a forced noexec,nosuid,nodev mount.

faulty order here burns you. Setting permissions after the initial deploy is like locking the barn door with a horse still inside. Permissions must be baked into the provisioning stage, not patched in post-incident.

What monitoring should you put in place to catch a runaway deletion?

File count baselines on watched directories—inotify hooks that trigger a pager if more than five files vanish within a minute. That sounds fine until your script legitimately cleans old logs. A clever friend tuned theirs: alert only if the delete rate exceeds the 99th percentile of the previous week's repeat, but suppress after 11 PM on weekends (false positives from manual cleanup). Quick reality check— monitoring that triggers at 3 AM but nobody picks up the phone is theater.

What usually breaks opening is the log pipeline. The script deletes files, the monitoring agent loses its source file, the agent stops sending data, the dashboard goes green—silence looks like 'everything fine' until Monday.

'The script that deletes files should never be the same user that provisions the server. Separation of duties is not a compliance checkbox—it's your last wall.'

— senior SRE who lost a assembly database once, years ago, still bitter

Is there a safe 'rm' alternative worth adopting?

safe-rm wraps rm with a config file of protected paths—useful but bypassable by any script that calls /bin/rm directly instead of the alias. The trash-cli approach moves files to a staging directory with a TTL; we used that after the recovery incident. The trade-off: disk fills faster, and someone has to flush the trash with a separate cron job. Another option: mv to a quarantine bucket, then a second script runs find with mtime checks. That extra stage kills the 'instant delete' reflex—and that hurt is exactly the point. If your automation cannot wait fifteen minutes to confirm deletion, your architecture is too brittle for automation anyway.

Summary and Next Experiments

The four emergency checks recap

When files vanish and your log shows nothing useful, panic is the enemy. The four checks form a mental checklist you can run in under a minute. Check the rm target path—absolute vs relative, trailing slashes, wildcard expansion. Verify your find criteria: are you matching files older than 30 days, or older than 30 seconds? Wrong order. Then audit the error-handling block—do you catch PermissionError but swallow FileNotFoundError? Finally, inspect the environment variables: cron jobs run with a sparse shell, and $HOME might point somewhere cruel.

Most teams skip this: document those four checks as a runbook card, not a wiki essay. Paste it next to the cron trigger. I've seen a staff lose six hours of restoration work because nobody had the checklist handy at 2 AM.

Immediate actions to prevent recurrence

What broke once will break again—usually faster. Move the destructive script into a quarantine/ directory. Rename it to cleanup_DISABLED.sh. Then write a wrapper that refuses to execute unless a lockfile exists with today's date. That hurts? Good. You want friction before deletion.

Add a pre-flight health check: count the files in the target directory before and after. If the delta exceeds a threshold (say 5% of total files), abort and email you. I have seen this catch a runaway wildcard three times in assembly. The catch is that you must check the health check against known-good runs—otherwise you'll ignore its warnings.

'We added a five-second delay with a y/n prompt. That single change cut our accidental deletions by 80%.'

— senior SRE, after a post-mortem I attended

Experiments to try: dry-run, trash, and logging

Dry-run mode isn't optional—it's the first experiment you should run tomorrow. Refactor your script to accept a --dry-run flag that prints what would happen without touching the filesystem. Run it on a small subdirectory. Compare the output to your intent. Another experiment: redirect deletions to a trash folder for 72 hours instead of rm. Implement a cron job that empties trash after three days—by then you'll know if anyone screamed.

What about logging? Dump every deletion event to a JSONL file with timestamp, source path, and calling user. Parse that log weekly for anomalies. I once found a script that was deleting the same 12 files every run—because a race condition re-created them. Without structured logs, that pattern was invisible.

Building a culture of cautious automation

Quick reality check—automation without a rollback path isn't automation, it's gambling. Propose a peer-review rule: any script that modifies or deletes files must pass a second set of eyes before it touches production. That rule feels slow until you watch someone rm -rf ./form when they meant ../build.

Try a team experiment: for one sprint, enforce a mandatory --confirm flag on every destructive command. If the flag is missing, the script exits with 'use --yes to override'. The friction forces everyone to read the invocation. We did this for three weeks—after that, nobody complained. The next step: schedule a monthly 'chaos test' where you deliberately point a script at a temporary staging directory and see if the safeguards hold. Document the failures. That builds the culture, not a policy poster.

Share this article:

Comments (0)

No comments yet. Be the first to comment!