How to troubleshoot Postgres when it fails to start
Yesterday, my laptop ran out of battery while I was working on an art project, and my computer shut down unexpectedly.
I plugged it in and rebooted, and continued to work on my art project.
The next day, when I began to work on my programming project, I was greeted with this error:
In this blog post, you'll learn how to troubleshoot Postgres installed with Homebrew on a Mac, and how this example fits into a larger framework that you can use to troubleshoot any software error.
UPDATE (Nov 2022): Homebrew now stores data in/opt/homebrew
, not/usr/local
as shown in the screenshots
UPDATE (Nov 2022): Homebrew now separates Postgres packages by version, e.g.brew info postgresql@14
orbrew services start postgresql@14
The game plan: how do we troubleshoot automated systems?
For now, it's enough to mention the general model of how to troubleshoot part of an automated system:
- try shotgun solutions, in case you can skip troubleshooting complelety
- design an experiment to isolate and reproduce the error reliably
- gather log samples from your experiment and make a diagnosis
- apply the fix, and verify your experiment passes clear
A simple math problem: why we should avoid troubleshooting
I've found that troubleshooting is time-consuming, non-productive time--and contrary to my first intuition, it's actually a very poor substitute for intentional, directed learning.
Therefore, it's in our best interests to minimize troubleshooting time to the smallest possible value.
We can accomplish this by never troubleshooting at all!
For example, if you know that...
- restarting XYZ will fix your issue
- you need to restart XYZ once per day
- it takes 10 seconds to restart XYZ
- it would take about 45 minutes to find root cause and actually troubleshoot the issue
... then you won't break even on troubleshooting for 270 days.
Chance are, in 270 days, you'll upgrade the particular package, stumble across the root cause of the problem somewhere else, or the problem will mysteriously go away.
Trying some shotgun solutions
First, get the status of all services:
brew services list
I got this output, which tells me that Postgres failed to start:
Next, attempt a restart:
brew services restart postgresql@14
... Dang, that doesn't work--we need to actually troubleshoot this.
Design an experiment to isolate and reproduce the issue, reducing the iteration time for fix attempts
In software, we're generally trouble shooting pieces of automated systems. We could make hypotheses, apply solutions, and attempt to test the entire system as a whole, but this generally takes too much time.
The reason is that these long chains of automation generally have very long iteration loops:
- rebooting computers
- making API requests
- compiling the entire project's source code
Since we know the shotgun solutions didn't work, and we're going to commit time to troubleshooting, our first action is to reduce this iteration loop to the smallest possible time.
write down some manual commands you can run to reproduce the issue
We know there is a problem starting Postgres through Homebrew, therefore we look at the source code for the automated system that's executing that command:
vim ~/Library/LaunchAgents/homebrew.mxcl.postgresql@14.plist
- often times, the call site will point you to special options or config used to run the command
- find the call site first, and then move on to configuration second, if there's nothing interesting
Under the ProgramArguments key, we can see the exact command and arguments used:
Step 2 of the troubleshooting model is done--since we have the command that Homebrew is using to execute the command, we can just run the command ourselves:
postgres -D /usr/local/var/postgres
This is our "experiment"--we can apply potential fixes, manually run this command, and make very fast rapid-fire iteration until the error is fixed.
No need to reboot our computer, or wonder if the Homebrew commands are hiding command output, or wonder if some other part of the system is interfering with starting Postgres--we have isolated and reproduced the error in a little test tube, on command.
Gather log samples from your experiment
In this case, the Postgres command logs additional output to the console:
And look at that--running the command manually produces a verbose error message, which basically points to our solution.
Aha! That must be a stale postmaster.pid
file, that was never deleted when my computer shut down yesterday
Apply the fix and retry the experiment
Remove the offending file:
rm -f /usr/local/var/postgres/postmaster.pid
Retry the experiment:
postgres -D /usr/local/var/postgres
It starts up normally--we're done!
Conclusion
I hope this blog post has helped those of you searching for an answer to your Postgres start-up problem, and given you the seeds of a troubleshooting system you can use for rest of your career.
Resources
- https://apple.stackexchange.com/questions/437618/why-is-homebrew-installed-in-opt-homebrew-on-apple-silicon-macs Discussion of why Homebrew moved its storage from
/usr/local
to/opt/homebrew