Deep Learning as Engineering

In the absence of a rigorous unified theory explaining why deep learning works, the field leans heavily on empirical experiments to demonstrate results. This means new deep learning applications often require a non-trivial amount of software engineering to "make things work". Unfortunately, it also makes these experiments more vulnerable to bugs and other issues common in large software projects. OpenAI Five explicitly calls this out in its description page (see graph below of pre/post bug performance). From experience with HAI, I can confirm that simple software bugs can cause significant regressions in model performance. Even worse, models often "work around" bugs and still perform reasonably well, which can make it incredibly difficult to find a bug (or even know that one exists!).

OpenAI Five bugs graph

OpenAI Five performance before and after bugfixes. Source: OpenAI

Software engineering uses a variety of tools to address these problems. Although not all such tools immediately transfer to deep learning programs, several do (or can be adapted slightly). This post explores ideas and tools from software engineering that can help researchers manage the software in deep learning research projects.

Note: The tools are chosen for effectiveness when using Python and Tensorflow. They may or may not generalize to other environments.

General Tips

I'll start with some general tips from software engineering that I've found helpful.

  • Build Incrementally: While it's tempting to build a complicated product immediately, progress often comes faster when one begins with a simple product and incrementally adds complexity. This happens because adding features to a working piece of code means one only has to reason about and debug the new feature. In contrast, when writing all features and then testing one must reason about and debug everything concurrently. Confounded hypotheses abound. Starting simple tightens the feedback loop for what features may or may not work. Often one need only build a fraction of the whole project to derisk it or discover that the original plan needs to be adjusted. The latter is essentially what The Lean Startup recommends, applied to a research project.

    For example, HAI went through several intermediate stages before reaching its current state. Initially I forked the OpenAI Universe Starter Agent and modified it play a very simple homegrown Hearthstone variant. I then incrementally built it up: replacing the simulator with Spellsource, increasing the difficulty level of its opponents, etc.

  • Fail Fast: If an unexpected or unsupported event occurs, fail immediately (or have a good reason not to). Model training and evaluation are not high assurance tasks; given how difficult neural networks are to reason about it's preferable to quickly identify and fix an issue than to have it potentially masked by your neural network "learning around it".

    It's particularly important to deliberately fail fast with Tensorflow and Python because the ecosystem tends to silently mask errors or propagate them far enough from their source to make debugging difficult. For example, tf.one_hot will silently return empty if its input indices are out of range. In my experience, numpy is even worse, probably due to it's incredibly flexible interfaces.

  • Make Errors Actionable: When errors occur try to make them actionable. This can be through descriptive error messages, for example reporting the entire proto that containing an invalid field. In the case of rare or difficult-to-reproduce errors, it's good to record enough context to be able to recreate the exact conditions in which the error arose. For example, to track down the cause of some Infs/NaNs during HAI training I saved the entire graph and offending minibatch on the iteration when Inf/NaN first occurred so that I could reload the problematic state later and inspect it in a debugger.


In a dynamically typed language like Python, unit tests offer the first defense against bugs. This holds true for deep learning programs using Tensorflow too, although the lack of interpretability of even simple neural networks makes traditional unit testing difficult. How can one make assertions about the outputs of a neural network if one can't reason about what those outputs should be?

Although it's true that predicting exact output values of a neural network is difficult, I've had success testing them by applying simple property checks. That is, instead of checking that a network has an exact output when supplied a particular input, one can check that perturbations of a network's inputs change (or don't change) its outputs. With Tensorflow eager execution one can debug these tests using traditional Python tools like pdb!

Having larger integrations tests (potentially entire training and evaluation cycles) run regularly to detect regressions can also help find bugs more quickly. Most large projects have Continuous Integration to automatically run and flag regressions. For smaller projects running the most recent model with random parameters every night can serve a similar role. If one already tracks experiment runs then a simple script running randomly parameterized experiments overnight shouldn't be hard to write. Alternatively, one can use a hyperparameter tuner like Katib to drive these random experiments.

Unfortunately, all of the hyperparameter tuners I'm aware of require substantial setup investment. Katib worked well for HAI but is rough around the edges. I recommend it only if (1) one has a GPU-enabled Kubernetes cluster available and (2) one is willing to spelunk around the Katib codebase a little bit. I hope it continues to mature and becomes a batteries-included option for Kubernetes users in the future.


I include this section in case the reader hasn't heard of bisecting before. "Bisecting" refers to applying binary search on a linear commit history between a bad and good commit to find the change that introduced a regression.

It's incredibly useful when one notices that something (call it X) is False at one commit and True at another commit but isn't sure when X stopped being True (often X is "the thing was working"). It lets one find out the exact commit where X went from True to False in logarithmic time. To emphasize how great this is: if one knew that X became False in one of 128 consecutive commits, bisecting would tell you the exact offending commit in \( log_2 (128) = 7 \) steps! In my view, students should be made to linearly search through commits for a regression before being taught binary search. Then they'd really appreciate it.

Note: git has great support for bisecting so there's really no excuse to do ad hoc or linear search for a regression on more than 10 commits.


When best practices and tests fail to catch a bug, it's time to reach for the debugger. Unfortunately, debugging Tensorflow programs is more complicated than debugging pure Python programs.

Python vs Tensorflow

Before discussing debugging approaches, I'll give some context for why debugging Tensorflow programs doesn't just reduce to debugging Python. At a high level, each thread in one's program can be thought of as either "executing Python" or "executing Tensorflow" at any given time:

  • "Executing Python" means that the Python interpreter is running commands. This happens on the CPU (almost always) and usually constitutes the "glue" that binds the core computational portions of the program together.

  • "Executing Tensorflow" means that the Tensorflow runtime is executing a computation graph (that was defined in Python, but isn't Python!). The Tensorflow runtime runs as a shared library invoked by Python. When it is invoked, control passes from the Python interpreter to the Tensorflow and Python-level tooling stops working.

This complicates debugging; it means that the tools best suited to debug parts of a program "executing Python" usually won't work well for parts "executing Tensorflow".

Debugging in Python

The Python standard library includes pdb, a very powerful debugger that can essentially provide a Python shell at any point during a program's execution (as long as that execution is "in Python"). In tests or other contexts where bugs can be reproduced, one can pause execution and start debugging by inserting:

import pdb

Alternatively, I recommend using pdb's post_mortem functionality to drop into a debug shell if/when an uncaught exception occurs. If you use Abseil and then you can simply specify the --pdb_post_mortem commandline flag and this happens automatically. Otherwise the following snippet does the job:

import contextlib
def launch_pdb_on_exception():
    except Exception:
        import sys
        import traceback
        import pdb
        info = sys.exc_info()

Tip: To make pdb even better, consider installing the pdbpp package. It adds tab completion and syntax highlighting among other things.

Debugging in Tensorflow

Tooling for debugging inside the Tensorflow runtime is not as robust as tooling for Python. tfdbg allows for poking around a Tensorflow graph either as it's executing or in retrospect (after it's executed). It's useful for investigating issues that are difficult to reproduce outside of actual training. For example it offers great support for root causing a source of Infs or NaNs during training.

If possible, though, I recommend reproducing an issue in eager mode and then applying Python tooling to solve a problem.

Closing Thoughts

I was surprised at how much deep learning blends traditional software engineering with theoretical or heuristic research insights (at least in the case of HAI). It suggests that software engineering (both practices and tooling) could be as important for project success as research ideas themselves.

I hope this post provides a solid primer on what tools software engineering has to offer deep learning.

As always, feel free to let me know if there are any errors or improvements I could make to this post.