Drop into REPL when your Python script crashes

7 hours ago 1

Too long; didn’t read: Add this at the beginning of your “main” script.

import sys import os import pdbpp def debug_hook(exc_type, exc_value, traceback): if exc_type is KeyboardInterrupt: sys.__excepthook__(exc_type, exc_value, traceback) return print(f"Uncaught exception: {exc_type.__name__}: {exc_value}") pdbpp.post_mortem(traceback) if os.environ.get('DEBUG'): sys.excepthook = debug_hook

Replace pdbpp with pdb if you don’t want to pull pdbpp dependency. When you want to drop into REPL when your program throws an exception, invoke your script as,

# Doesn't matter what you pass into DEBUG as long as the environment variable exists. $ DEBUG=1 python main.py

When debugging Python scripts, the exception value (the string that tells you why the program crashed) isn’t always helpful. You often need to dig deeper into the program’s state at the moment of the crash. For instance, when PyTorch throws a mismatch error, the error value itself may not be enough to pinpoint the problem. An error like:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (64x1024 and 128x512)

doesn’t tell you which two matrices caused the issue. So, one strategy is to use the python debugger, pdb by either manually setting breakpoints in the code or if you don’t want to touch the code, then you usually invoke it with something like,

python -m pdb main.py

You are dropped into a GDB like interface where you can step through the code, list the current location in the code and inspect your state variables. At this point, you typically let the program run until it hits the exception. However, if you’re not careful, pdb will exit, and you’ll lose the program’s state.

This becomes tedious if you are doing a lot of debugging or writing pipeline from scratch.

You are a detective. Your process is dead. You want to examine the body (ahem, process) while it is still warm and Linux has not dumped the process heap and stack. You want to gather as much information as you can. What people generally do is to start the debugging in the except block:

try: raise ValueError("On Purpose.") except Exception: exc_type, exc_value, traceback = sys.exc_info() pdb.post_mortem(traceback)

This starts the post-mortem debugging where you can have a GDB like interface as well as a full Python REPL session where you can run any Python code right at the point the program failed with all your state variables, backtrace stack and program counter location intact. You can go up and down the frame and run any Python statement. For instance, if I suspect the matmul between the weights and the input is causing the error, I can verify their shapes, just as PyTorch reported:

print(weights.shape, input.shape)

and if not, I can prints the local variables with,

pp locals()

and inspect the shape of all the concerned matrices so that when I rerun my experiment, I don’t have to second guess a fix. I’ll save a deep dive into using the Python debugger effectively for another time.

However, adding pdb.post_mortem to every try-except block gets old too fast. And we want to do it automatically while also being able to control whether we want post-mortem debugging (when you are in production, instead of dev.) We will use the excepthook for that purpose.

So, when a Python script raises an exception, the interpreter calls the function sys.excepthook and passes to it 3 positional arguments: the exception type, the exception value and the entire traceback. You can reassign the sys.excepthook to any function you want provided that the function takes in those three arguments.

We define a new function debug_hook that takes in those arguments:

import sys import os import pdbpp def debug_hook(exc_type, exc_value, traceback): if exc_type is KeyboardInterrupt: sys.__excepthook__(exc_type, exc_value, traceback) return print(f”Uncaught exception: {exc_type.__name__}: {exc_value}”) pdbpp.post_mortem(traceback)

I am using pdbpp here which I find to be nicer than both pdb and ipdb. ipdb breaks pdb in mysterious ways with way too many magic in there. So, I avoid it. You can use ipdb or pdb if you want. We also don’t want to drop down to REPL if we are deliberately killing the process, so call the default excepthook, sys.__excepthook__ in that case.

Now we just have to set the excepthook to this function we defined, only when a DEBUG environment is set/present (doesn’t matter what the value is set to).

if os.environ.get('DEBUG'): sys.excepthook = debug_hook

This makes sure that we now only invoke our custom post-mortem fallback in development only, instead of production by invoking,

DEBUG=1 python main.py DEBUG=true uv run python main.py #also works

That’s it. We now have a robust way of doing post-mortem debugging.

Although I have been emphasizing on enabling post-mortem for development purpose only, I have been secretly using this even when running my full experiments. If a three week long training fails at 80% completion, I don’t have to anticipate what I will need: I can save the checkpoints, or anything else required and restart from there.