Jobs and pipes

I just watched one of Gary Bernhardt’s excellent screencasts, Tar, Fork and the Tar Pipe. It’s a succinct and powerful overview of key Unix concepts and I highly recommend watching it. If you’re new to Unix, it can be an eye-opening, refreshing look at computing. If you’re seasoned, Bernhardt’s fine execution is still a delight to watch.

While on that topic, I thought I could share some toy code that I’ve had lying around since last year. I will take the time to explain the process, hoping that in the end the result will, it too, seem refreshingly simple.

The problem

Let’s say you need to host a high-traffic website. You’ll need to be able to accommodate that traffic. For the purpose of this exercise, assume that the way to address this is to use a load balancer so that each incoming page request is randomly sent to one of two tiny HTTP servers.

You need to run the two servers concurrently, but for ease of administration they should “feel” as if they’re just a single server. So, you decide to monitor their activity by merging their real-time logs in one place. You want it to be clear which server is reporting each logged line. Finally, you want to be able to terminate both servers at once.

Servers?

We can fake these basic HTTP servers by writing a program that continuously outputs a log of incoming requests:

Listening on port 80…
GET /index.html HTTP/1.1 (200)
GET /foo HTTP/1.1 (404)
GET /private.html HTTP/1.1 (403)

For this, we can write a shell script. I’ll be using Bash, which is readily available in most systems and is easy to start with.

#!/bin/bash

main() {
  number=$RANDOM
  let "number %= 3"
  case $number in
    0) echo "GET /index.html HTTP/1.1 (200)";;
    1) echo "GET /foo HTTP/1.1 (404)";;
    2) echo "GET /private HTTP/1.1 (403)";;
  esac
  main
}

main

It works! The function main randomly outputs one of three possible lines, then naïvely calls itself ad infinitum. We may want to throttle it, though, and for that we’ll later use the built-in sleep command. The program can be terminated by pressing Ctrl-C.

The first line in the file is enough information for the shell and the text editor to know that it’s a Bash script, so we can save the program as listen and forget about the .sh file extension. Don’t forget to make it executable by calling chmod +x listen. We could just as easily not make it executable and instead always call bash listen, but there is a point I’d like to make at the end of the exercise.

Concurrency

Now that we have a pretend server, we need to run two server instances in one place. This place, for now, will be our shell session. Each instance should run in a way that doesn’t block the other. The answer is the control operator &, which tells the shell to run any given command in the background.

./listen &
./listen &

The above will spawn two background processes, known as jobs, which will run independently. With two of our frantic servers running simultaneously, the result will be an even busier and more confusing access log that is not very useful.

You’ll also notice that Ctrl-C no longer terminates the servers! This is because that signal is only sent to the current foreground process (the shell itself) and not its offspring. Fortunately, the built-in jobs command will output a list of jobs attached to the shell. We can combine it with the kill command to get rid of any rogue servers.

kill `jobs -p`

Before we proceed, this is what our new run program looks like:

#!/bin/bash

echo "Spawning fake servers…"
./listen 8000 &
./listen 8001 &

8000 and 8001 are port numbers that we’d like our fake servers to listen to.

Managing noise

The next step is to make it clear which requests are coming into which servers. We could say that we want to namespace our access log. Let us then create a namespace program which takes each incoming line and outputs it colour-coded and prefixed with a given label.

This program amounts to 3 lines of Bash which I’ll share later, but their analysis is beyond the scope of this article. They touch on a number of concepts and interfaces—from how read parses input to ANSI control sequences—that I’ll happily talk about in another post if you ask nicely.

With the namespace tool we can revisit run:

#!/bin/bash

echo "Spawning fake servers…" | ./namespace runner $RED
./listen 8000 | ./namespace web_01 $GREEN &
./listen 8001 | ./namespace web_02 $BLUE &

The pipe operator | feeds the output of the left-hand command directly into the right-hand command. In other words, each “server” has its output rerouted through namespace so that the final output is neatly marked. Because we’ve kept the control operator & at the end of each line, these piped processes remain independent.

The result is suddenly much nicer.

Something isn’t right, though. Pressing Ctrl-C doesn’t stop the output, and you may feel as though your shell session has been hijacked. To clean up, run the following on any shell session:

pkill -fI namespace

Job control

Waiting

In the picture above, notice the $ character at the start of the fourth line: that’s the shell’s prompt. Indeed, run terminates as soon as it spawns the last server. Since the servers are run in the background, run doesn’t assume it should wait for them. But we want run to control both the beginning and end of our servers, and the first step for that is to tell run to wait. This is easier than you might assume:

./listen 8000 | ./namespace web_01 $GREEN &
./listen 8001 | ./namespace web_02 $BLUE &
wait

The execution of run will now pause until the jobs it has spawned all terminate. This can be observed if we add a fourth possible outcome to our listen server implementation in which it fails and exits:

…
2) echo "GET /private HTTP/1.1 (403)";;
3) echo "FATAL ERROR. Terminating."; exit 1;;

It works, but is not enough: run will indeed keep running as long as its servers are running, but pressing Ctrl-C will not terminate the servers. It will merely kill the parent, run, and leave the servers orphaned and running on their own forever. This might be fine for Sophocles, but we’d like Ctrl-C to just put an end to the whole tragedy.

Setting up traps

run needs to be able to intervene before being terminated. The signal to kill a program is also known as an interruption or a trap, and we will use the trap command to set up an interrupt handler:

# Kill spawned jobs before exiting
trap 'kill `jobs -p`' EXIT

./listen 8000 | ./namespace web_01 $GREEN &
./listen 8001 | ./namespace web_02 $BLUE &
wait

The above means that, if an EXIT signal is received, run is still allowed to run kill on its jobs before itself terminating.

One final touch: run should let the user know when it finishes. For that, our trap handler will also output something for the user.

trap '(echo Terminating | ./namespace runner $RED) && kill `jobs -p`' EXIT

The result is a satisfying one that handles both expected and unexpected termination.

Coda

Here’s the final source code. I’ve added a couple of bits here and there, and included a file that exports environment variables that allow the colour coding performed in namespace.

At the end of the exercise, this to me is what stands out:

  • The file doing the actual orchestration, run, is incredibly tiny, yet it manages background processes, redirects their outputs, filters them through other programs and combines them, and does all this without resorting to local variables or any other kind of buffer.
  • Though it can seem intimidating to a newcomer, the compactness of namespace hints at the expressive power of shell-scripting languages for certain tasks.
  • Pipes (|) are not a language construct, but a feature of the operating system that the shell exposes. By using pipes, we get good data throughput and relative safety for free.
  • In fact, all of the inter-process communication across our three programs (aside from the trap signal) is based on stitching together standard input (stdin) and standard output (stdin), which are just special files that can be thought of as streams. Piping becomes analogous to function composition, and is extremely powerful in systems programming.
  • Think of everything that we didn’t do: allocate resources, create sockets or other channels of communication, deal with complicated system interfaces. Finally, notice how we didn’t even pull dependencies! That should always be the case for such trivial examples, but the reality is that more often than not new runtimes require that everything be required() because the basic interfaces are insufficient. In our project, you could say the framework is the system: Bash provides very convenient syntax to perform system calls, and any extra functionality we need is just a command invocation away. With standards such as POSIX, everyday tasks rarely need dependencies that the system can’t provide, and our software tends to be quite portable.
  • The reason why all of those tools can work together and be combined in new scripts is that they all follow the same interface. Each program has an input and an output (and stderr, which we will disregard). Once again alluding to functional programming, each program can be thought of as a black box—not because it doesn’t have side effects, but because its input, output and effects are well specified. That’s why earlier I recommended saving our scripts with no file extension: it doesn’t matter that they are all written in Bash, it only matters that they honour the standard interfaces.

I write about this subject for the same reason I write about Unix in general: despite its flaws, Unix has been a highly influential system, mostly with good reason. Software engineering is a relatively young field, but not as young as it would like to be. Despite all the history it packs, trends in the field tend to overlook that history quite spectacularly. And so each decade we reinvent the wheel, we create problems already solved.

As in philosophy, if we take a step back the song remains the same. The big topics we return to include data processing, dependency management and composition, process management and concurrency, state management, abstraction and machine. Unlike philosophy, I think we have a shot at finding satisfying answers to these in less than 3000 years.

Author: Miguel Fonseca

Engineer at Automattic. Linguist, cyclist, Lindy Hopper, tree climber, and headbanger.

4 thoughts on “Jobs and pipes”

  1. Thanks for the pointer to the tar pipe screencast — Gary Bernhardt’s Destroy All Software is one of my favorite formats for dense and deep, but useful programming insights.

    I’d love to see what you, Miguel, would come up with if you tried to cram similarly compressed insights related to developing for Gutenberg into a screencast of your own. 🤞

    And so each decade we reinvent the wheel, we create problems already solved.

    Amen! I heartily agree with that sentiment.

    At the same time though we need to acknowledge that the scale and dynamics of tight-knit development teams and especially loosely coupled development communities has changed quite a bit in recent decades. I believe the changes in those social structures also made some level of reinvention — and, with it, evolution — necessary.

    Like

    1. I’d love to see what you, Miguel, would come up with if you tried to cram similarly compressed insights related to developing for Gutenberg into a screencast of your own.

      Oh, thanks for the vote of confidence! There’s an implied exercise in focus and concision in there. I’d love to execute that well. I might just give it a go at some point.

      Liked by 1 person

    2. At the same time though we need to acknowledge that the scale and dynamics of tight-knit development teams and especially loosely coupled development communities has changed quite a bit in recent decades. I believe the changes in those social structures also made some level of reinvention — and, with it, evolution — necessary.

      That’s a good point. Are there particular reinventions that you’d judge necessary?

      Liked by 1 person

      1. Nothing is necessary if you don’t specify what for. 🙂

        I feel like for example the npm ecosystem has solved problems that weren’t there before, and of course is also finding out about new problems that weren’t one previously.

        Distributed version control is a nice evolution of version control.

        The web-based pull request is a nice evolution of the email-based pull request.

        Continuous Integration as a SaaS is a useful evolution of having to run and maintain your own CI server somewhere.

        Many of these “merely” made things vastly more accessible to more people, but often to a degree that made a qualitative difference in how software development happens.

        Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: