Fail Fast

Fail Fast

Failure is fashionable. Making is much easier than thinking and failures are not stigma, let's take this idea to our code.

TL;DR: Fail fast. Don't hide your mistakes under the rug.

Failure to program in the 1950s had dire consequences. Machine time was very expensive. Jumping from punch cards to the compiler and then to execution could take hours or even days.

Luckily those times are long gone. Are they?

Broken

Photo by chuttersnap on Unsplash

A methodological step back

In the 1980s, punch cards were no longer used. The code was written in a text editor, then the program was compiled and linked to generate executable code for a typical desktop application.

This process was slow and tedious.

An error involved generating logs to a file with parts of the execution stack to try to isolate the cause of the defect. Try a fix, recompile, link, etc. and so iteratively.

With the arrival of interpreted languages, ​​we began to believe in the magic of editing the code ' on the fly' with a debugger where we could access the state.

However, in the late 1990s with the rise of web systems, we went back several steps. Except in those cases where we could simulate the system on a local server, we put logs in the code again while debugging our integrated software remotely. On the other hand. Thanks to the misuse of invalid abstractions our software-generated errors are far away from the failure and root cause of the problem.

This is worsened by the use of invalid representations with possible Null values ​​that generate unpredictable failures when trying to find out the origin of null values many function calls later.

Defensive programming

The rise of autonomous cars allows us to learn about the behavior of drivers. Initially, the cars worked well following the traffic rules, but this caused accidents with cars driven by human beings.

The solution was to train autonomous cars to drive defensively.

Self-driving cars

As in many of our solutions, we are going to reverse the burden of proof.

Let's suppose that the preconditions are not met and if so, fail quickly.

The argument against this type of inline control is always the same: The code becomes slightly more complex and potentially less performant.

As always, in the face of laziness, we will reply that we privilege the robust code, and in the face of performance, we will request concrete evidence through a benchmark that shows what the true penalty really is.

As we saw in the article about the immutability of objects if an invalid date is created we must immediately report the problem.

In this way, we will fail very close to the place where the fault occurs, and we can take action. Most of the "modern" languages ​​hide the dirt under the carpet and allow "continue (as if nothing happens)" the execution so that we have to debug the cause of the problem with logs in order to carry out a forensic analysis in search of the failure root cause far away.

Representation is always important

The best way to fail fast is to properly represent objects while respecting our only design rule:

Bijection with the real world.

A misrepresentation of a geographic coordinate using an array with two integers is not going to know how to "defend" itself from possible invalid situations.

For example, we can represent latitude 1000°, and longitude 2000° on a map as follows and this will generate errors when we want to calculate distances in some component that uses this coordinate (probably doing some kind of modulus magic and getting very cheap tickets).

terraplanismo

This is solved with good representations and with small objects that respect the bijection of both valid and invalid behaviors and states.

A bijection is straight: a coordinate is not an array. not all arrays are coordinates.

This would be the first iteration. The coordinate should check that the latitude is within a range. But that would couple the coordinate to latitude violating the bijection rule. Latitude is not an integer and vice versa.

Let's be extreme:

With this solution, we do not have to do any checks when building geographic coordinates because the latitude is valid per construction invariant and because it is correctly modeling its real counterpart.

As the last iteration, we should think about what a degree is. An integer? A float? It is clear that a degree exists in reality so we have to model it. No chance to escape.

By now performance purists are often outraged by the following thought:

It is much easier and more readable to create a coordinate as an array than to do all that indirection of creating degrees, latitudes, longitudes, and coordinates.

To make this decision we always have to do performance, maintainability, reliability, and root cause analysis of our failures. Based on our desired quality attributes we will privilege one over the other. In my personal experience, the good and precise models survive much better requirements change and ripple effects, but that depends on each particular case.

compass

Photo by Robert Penaloza on Unsplash

Let's go back to space

As the last example let's go back to the situation where the Mars Climate Orbiter rocket mentioned in the article exploded:

The rocket was developed by two teams from different countries using different metric systems. The example below is a simplified scenario.

Instead of failing early and getting caught up in a self-healing code routine this error spread and blew up the rocket.

A simple check of measures would have detected the error and, potentially taken some corrective action.

The exception is the rule

Our code must always be defensive and controlled by its invariants at all times as indicated by Bertrand Meyer. It is not enough to turn on and off software assertions.

These assertions must always be on in productive environments. Once again, when faced with doubts about performance penalties, the forceful response must be certain evidence of significant degradation.

Exceptions must occur at all levels. If a movement is created with an invalid date the exception must be reported when creating the date. If the date is valid but it is incompatible with some business rule (for example, you cannot settle movements in the past) this must also be controlled.

The solution is robust but it is coupling the movement to date and a static method of a global class. One of the worst possible couplings for a system that could run in multiple time zones.

To solve this problem we have several options:

  1. Leave the coupling to the class.
  2. Send as a parameter a date validator that can validate the date using double dispatch.
  3. Remove date validation responsibility from the movement.

When in doubt about our design decisions, we can always go back to our bijection and ask our business expert whose responsibility is this.

It is clear that by taking the third option we could potentially create movements with invalid dates. But the validity (or not) of the date is not a movement's responsibility and does not belong to its representation invariants.

The case would be different if a movement has an agreement date, a creation date, and a settlement date with clear business constraints among them. But then we would be facing a very low cohesive object.

As always, design decisions involve continuous trade-offs.

Conclusions

Suspecting an invalid situation, we must throw an exception in all cases. When in doubt, it should be done as early as possible.

We should never hide errors by coupling ourselves to the decision to mask this problem with the use of it that has to understand the situation.

We must strictly follow the bijection rule, creating the necessary abstractions that can defend themselves.


Part of the objective of this series of articles is to generate spaces for debate and discussion on software design.

We look forward to comments and suggestions on this article.

This article was published at the same time in Spanish here.