I was working on a project recently where I needed to control the processing of bash scripts across multiple machines. It took a little research, and some trial and error and so I thought I would share my experiences and findings with you so you can implement this in your own projects.

Before we get into it, let’s talk about a recent ticket I’ve been working on as a case study. The project called for 3 virtual machines as part of an auto-scaling group (we’re working on AWS here, but the same logic would apply to any similar deployment). The machines are not formally clustered but need to operate in a coordinated manner. The machines were to be a yum repository mirror. What’s important here though is that we can a centralised data store, in this case using EFS, which all the machines can access, and they needed to sync a number of repositories both on startup and on a regular schedule. We wanted to avoid race conditions and multiple nodes duplicating work by trying to sync the same repos at the same time. So we needed a way to tell all the other nodes that one node was currently syncing a set of repos and to step over to another sync action or quit if there was nothing else to do. We combined the techniques discussed here with the use of a SystemD Unit to manage the execution of the script.

Enter lock files…

What is a lock file?

A lock file is, itself, nothing special at all. It’s just a file that exists on the filesystem. The magic comes from the fact that it’s used to signal to other processes that a particular process is currently running. The TLDR if you are in a hurry is you check if the file exists, if it does then that signals that another process is running and you should stop trying to process the same thing. If it doesn’t exist then you create it and carry on with the rest of the script logic.

Implementing lock files

OK, so that seems easy enough, so how do we implement this in bash? Well, it’s actually quite simple. Here’s an example for you:

#!/bin/bash

LOCKFILE=/tmp/mylockfile # Set the lock file location to whatever path you need

if [ -f $LOCKFILE ]; then
    echo "Lock file exists, exiting"
    exit 1
fi

touch $LOCKFILE

# Do some work here

rm $LOCKFILE

It really is as simple as that!

In the case of my example we defined the lock file as a file on the same EFS volume as the repo data was being synced to. This way all the nodes could access the same lock file and know if another node was currently syncing the repos.

Back to top

What if the script fails?

This is a good question. If the script fails for any reason then the lock file will remain in place and the next time the script is run it will see the lock file and exit. This is not ideal, and in fact we had this exact issue during my project. There’s a few things that you can do depending on your desired outcome. In my case we were troubleshooting some bad repo definitions and we decided that we wanted to know when the lock file became a blocker for other processes so we did a couple of things:

  1. We wrote into the lock file which node had established the lock and the time it was established. This way we could see which node was causing the issue.
  2. For any subsequent node that found the lock file we wrote a message to the lock file to say which node had found the lock file and the time it was found. This way we could see which nodes were trying to run the script and when they were trying to run it.

This might look something like this:

#!/bin/bash

LOCKFILE=/tmp/mylockfile # Set the lock file location to whatever path you need

if [ -f $LOCKFILE ]; then
    echo "Lock file exists, exiting"
    echo "Lock file found by $(hostname) at $(date)" >> $LOCKFILE
    exit 1
fi

touch $LOCKFILE
echo "Lock file established by $(hostname) at $(date)" >> $LOCKFILE

# Do some work here

rm $LOCKFILE

This way we could see which node was causing the issue and which nodes were trying to run the script. This helped us to identify the issue and resolve it.

Another strategy, and perhaps one that might make more sense in a production environment would be to use a tool like trap to catch the exit signal and remove the lock file. This way if the script fails for any reason the lock file will be removed and the next time the script is run it will be able to run.

#!/bin/bash

LOCKFILE=/tmp/mylockfile # Set the lock file location to whatever path you need

trap 'rm $LOCKFILE' EXIT

if [ -f $LOCKFILE ]; then
    echo "Lock file exists, exiting"
    exit 1
fi

touch $LOCKFILE

# Do some work here

In this case we don’t need to explicitly remove the lock file as the trap will catch the exit signal and remove the lock file for us.

If you’re using trap then I would strongly recommend that you are logging your scripts somewhere else so that you have a way to find and trace an issue if it arises.

Back to top

Conclusion

Lock files are a simple but effective way to manage the execution of scripts across multiple nodes. They can also be used to manage execution of processes on the same machine if you need to ensure that only one process is running at a time. The best bit…(?) They can be implemented in a few lines of code as we’ve seen here.

In our case we had a typical ASG deployment of 3 nodes, and 3 collections of repos to be synced. By using lock files we were able to ensure that only one node was syncing a set of repos at a time and that other nodes moved on to process other repo sets. Thus we both avoided race conditions and duplication of work as well as reduced the overall sync by by nearly two thirds compared to the previous design by using the lock files to split the work across the nodes.

If you haven’t already, check out my post SystemD Unit for a great partner tool for managing and running your scripts and tools on your Linux machines.

Back to top

If this article helped inspire you please consider sharing this article with your friends and colleagues, or let me know via LinkedIn or X / Twitter. If you have any ideas for further content you might like to see please let me know too.

Back to top

Updated: