Being a good UNIX neighbour

Posted on April 29, 2023 in cli, unix

The UNIX philosophy is a set of design principles that has had a huge impact on the development of software systems. In essence, the UNIX philosophy stresses the importance of keeping things simple and modular. You should think of the shell as a programming language of its own! Take this made-up example:

curl -s 'https://www.example.com/query?symbol=GOOG' | jq '.price' |
sqlite3 stocks.db "UPDATE portfolio SET price = $(cat), ..."
&& sqlite3 stocks.db "SELECT price FROM portfolio WHERE price > 9000 AND ..." |
xargs -I {} curl -X POST -H "Content-Type: application/json" -d '{"symbol": "GOOG", "price": "'{}'"}' https://example.com/api/sell

In this (absolutely non-functional) condensed program, the following happens:

We download the HTML page for the GOOG stock.
Extract the current price.
Update the price of the GOOG stock in our database.
If the price is over 9000, we send a notification to our API to sell stocks.

Getting back to the UNIX philosophy, this means writing small programs that do one thing well and can be combined with other programs to achieve more complex functionality. Other key principles of the UNIX philosophy include using plain text as a universal interface, favoring simple implementations over more complex ones, and using pipelines to combine simple programs into powerful workflows.

If you're writing command line tools, it's important to consider how they fit into the UNIX ecosystem. As part of this, it's helpful to ensure your tools can be easily integrated into pipelines with other tools. A key way to achieve this is by allowing your tool to accept input from other tools through standard input and output.

Example

Let's take a look at an example of a fictitious tool called do_x. In this example, we define a click argument parser that allows the user to specify an input file or to use standard input by default. We also provide an option to output the results in JSON format. After processing the input data, we output the results either as a string or as a JSON object, depending on the user's choice.

Starter code

Here we already assume that we are using the click cli library. If you're not familiar with it, you can check out the documentation. In order to install it, you'll need to pip install click.

Let's start off with an example of what I might have:

import click


def do_stuff(input_data):
    """Do stuff."""
    return {'result': input_data}


@click.command()
@click.argument('filename')
def main(filename):
    with open(filename, 'r') as f:
        input_data = f.read()

    # Do something with input data and print it
    output_data = do_stuff(input_data)
    # Use `echo` for better compatibility
    click.echo(output_data)


if __name__ == '__main__':
    main()

Some notes: we're using click to parse the command line arguments. We're using its echo function to print the output instead of the stdlib print because it's more portable. We're also using click's argument decorator to specify that the user must provide a filename as an argument.

Otherwise this code is straightforward: this tool is meant to be called from the command line with a simple filename argument. It reads the file, does something with the data, and prints the result. We're using standard out for the output, which is fine in this case. It's a good start, but it's not very flexible. What if we want to use this tool in a pipeline? What if we want to use it with standard input? What if we want to use it in a script?

Using standard input

The first item of business is to make this tool more flexible. Tools should accept data from stdin and offer to read files from an argument. Your tool's API should be flexible enough to handle both. In our case, we can implement this by specifying the filename argument as optional and setting the default value to stdin. This will allow us to call the tool with or without a filename argument, and it will read from stdin if no filename is provided.

In general, the UNIX philosophy favors passing data via pipes and standard input, as it allows for a more flexible and composable toolchain. This is because a tool can be designed to read data from standard input, process it, and then output the results to standard output, which can be used as input to another tool in the pipeline.

However, there are situations where it may be more appropriate to pass filepaths as arguments instead of reading data from standard input. For example, if a tool needs to process multiple files, it may be more convenient to pass the filepaths as arguments instead of requiring the user to redirect the contents of each file to standard input.

A good tool can handle both cases, allowing the user to pass either filepaths or data via standard input, depending on their preference or the specific use case. This can be achieved by designing the tool to first check if a filepath argument was passed, and if not, to read from standard input.

In our example in particular, the click package handles this for us with the click.File type, which

Defaults to stdin if the input is set to -, and
Passes a subclass of io.TextIOBase (e.g. StringIO) in either case.

This makes it handy as it handles the best practice case by default:

import click


def do_stuff(input_data):
    """Do stuff."""
    return


@click.command()
@click.argument(
    'filename',
    type=click.File,
    default=click.get_text_stream('stdin'),
)
def main(filename):
    if filename.name == '<stdin>':
        # Let the user know why we're waiting for input
        click.echo('Reading from STDIN')

    else:
        with open(filename.name) as f:
            # This works fine with both a file or STDIN
            input_data = f.read()

    # Do something with input data
    output_data = do_stuff(input_data)
    return output_data

By allowing the user to specify an input file or to use standard input, our tool can be easily integrated into pipelines with other tools. For example, let's say we have a file called input.txt containing some data that we want to process with our tool called e.g. cli and then pipe the results into another tool called do_y:

$ cat input.txt | cli | do_y

By default, cli will read from standard input, allowing us to pipe the contents of input.txt into it. cli will then output the results to standard output, which can be piped into do_y. This allows us to easily chain together multiple tools to create powerful pipelines.

Again, the tool doesn't output any useful data, so it's fine to keep the output as standard out. However, if we wanted to output the results to a file, we could do so by specifying the output file as an argument. This would allow us to use the tool in a script, for example.

Adding JSON support

Another important aspect of being a good UNIX neighbour is ensuring that the fields in your output are standard enough that they can be easily translated with other tools. This means using a standard delimiter, such as a tab or a comma, and avoiding using special characters that may cause issues when parsing the output with other tools.

Finally, if possible, it is also helpful to provide the option to output the results in JSON format. JSON is a standard data format that can be easily parsed and processed by many programming languages, making it a great option for interoperability between tools. This can be achieved by adding a flag to your tool that allows the user to specify the output format. Depending on the API you strive to provide, you may also want it to become the default.

Also here, since the output of the tool is meant to be used by other tools, it is important to output only the useful results to standard output. This means avoiding printing any additional information, such as status messages or warnings, to standard output. Instead, it is better to print these messages to standard error, which can be redirected to a file if needed.

import json
import click

def do_stuff(input_data):
    """Do stuff."""
    return {'result': input_data}

@click.command()
@click.argument(
    'filename',
    type=click.File,
    default=click.get_text_stream('stdin'),
)
@click.option(
    '--json',
    '-j',
    is_flag=True,
    help='Output results in JSON format',
)
def main(filename, json):
    if filename.name == '<stdin>':
        # Notice here that we're now printing to STDERR
        click.echo('Reading from STDIN', err=True)

    else:
        with open(filename.name) as f:
            input_data = f.read()

    output_data = do_stuff(input_data)

    if json:
        # Output results as JSON
        click.echo(json.dumps(output_data))

    else:
        # Output results as a string
        click.echo(output_data)

which makes it possible e.g. to use this tool in a pipeline with jq:

$ cat input.txt | cli --json | jq '.result' | do_y

and not have to resort to more complex contorsions of string manipulation like cut -f2 — or God forbid, having to use awk — to extract the result.

Conclusion

In conclusion, by designing our command line tools to be good UNIX neighbours, we can create powerful pipelines that allow us to efficiently process and analyze data. This means allowing our tools to accept input from other tools through standard input and output, using standard delimiters in our output fields, and providing the option to output results in JSON format. By following these principles, we can create tools that work well with others and promote the UNIX philosophy of modularity and simplicity.