We use logs for everything, to track errors, measure performance, keep a journal of how our software runs, or even debug code in production. Since we use it so much, we should be good at it by now, right?
Surprisingly, well written, useful logs are not frequent. The norm is logs full of garbage, old debug statements ("I was here"), unhandled exceptions, and non actionable information.
You read the file, looking for that one word, that should be followed by another word, but two threads are performing the same task and each will output the two log messages. Maybe, if you look at the timing, you'll know which is which.
You try to make sense of multiple interleaving messages, but they come from two machines with desynchronized clocks, and one of them is sending batches of messages every second instead of sending them as they are generated. You wonder if the code is executing in the right order and write essays about causality and determinism.
I have a hundred stories like those, and there's a common theme, a root cause for the way we write logs. As a developer, we think of them first as a developer's tool, not as the main interface to check your app's health in production. They are meant to be read with tail
while the application runs on small workloads, for a limited time.
Developer logs VS ops logs
Managing software in production means detecting when something is wrong, but knowing every bit of the program's state is a counterproductive way of doing it.
You want to know the important metrics:
- someone consulted a web page: leave that to your web analytics software
- someone consulted a web page, but the app failed to answer: log it
- someone consulted a web page, and the page's size is larger than the buffer:
don't log it, it's useless most of the time - someone bought something: log it, even if you have a backend to see that
- someone bought something and the server failed: log it and send an email to
the dev team to fix it right now
More generally, here is a scale from ops log to developer log:
- a transaction has been performed, successfully or not => ops logs
- individual steps of that transaction failed => ops logs
- individual steps of that transaction succeeded => dev logs
- value of variable X at point Y in code => definitely developer logs
Here is an example (from code I wrote) of developer centric logs:
I am not saying developer info should not appear in logs. Just that it is best hidden under "debug" or "trace" levels, if your logging system supports levels (most do). It is fine to have useless messages in there, as long as they are not used in production/ It is fine to use "printf debugging" in development, but it should never appear in production.
Your goal in writing logs is to spare time for the person that will read them. It may be a sysadmin in your company, a client using your software, or yourself, six months from now, trying to put the app back online at 2 AM. Please think of your future self.
Making the logs more ops friendly
To make the life of the journal's reader easier, you need to optimize for two reading engines:
- filtering software, like grep
- the human eye
Why the human eye?
Because we are good at detecting patterns, and filtering out the useless parts of an image. If a sequence of three lines appears regularly, we will see it easily. If we only care about the message part of the log, not the prefix (time, PID, etc), we will focus our attention on it and ignore the rest.
The consequence is that everything that will break the brain's flow will make your logs harder to read. if the user id appears at the beginning of one message, but at the end of another one, it will be much harder to see which messages are related.
Fortunately, what works well for the human eye also works for filtering tools. Standardized, common prefixes are easy to search for, and easy to recognize. Related information should have identifying information, always stored in the same place.
Practical advice
How can you apply those principles right now? You begin by making a small wrapper over your logging library to automatically insert useful information, in a fixed format:
- a timestamp, preferably in ISO8601 (easier to read) at UTC (no timezone conversion
when reading). By the way, make sure your servers are all set up to use the same
timezone, this will save headaches - a timestamp from a monotonic clock if your application is time critical
- identify the current instance: add a server identifier (name, IP, whatever)
and an instance identifier (process id, thread id) - the running code's version (commit id or version number)
- the file's name, line number, class and function names are useful for debugging,
so add them for the "debug" and "trace" levels (but don't activate those levels in
production unless you have a good reason) - add some correlation information: the user id, a request id, anything that
will let you track which action resulted in which messages - code status: are we in the middle of an error? Is something pending?
- then, at the end, you can put a written message. You can use structured logging
instead of raw text if you want to track data with more automated tools
This is a lot of information to put on one line, but we have great tools at our disposal. We can filter on one of these fields and remove it from the output. We can use terminals larger than 80 characters. We can even color parts of those logs to let the eye separate them easily.
If I had to rewrite the previous example logs that way:
Side note: beware exceptions. It is fine to display an exception's stack trace while debugging the code, but an exception in production means two things:
- you forgot to replace the stacktrace with a proper error message for an exception
you handled - there's an exception that you do not handle in your code
In both cases, it is a signal that something wrong happened and that it should be fixed soon. Also, it usually messes up the log's format, and fills up the log with useless information. I once heard about an app that needed to run on machines with big disks and big CPUs, because there were so many exceptions logged that the machine could not keep up.
Anyway, The goal of that approach is to have a common format for every message, simple to parse and filter. Make sure that those common parts have the same length, and that you use the same separators everywhere. Tabs are usually better than spaces, since we rarely use them in log messages. This will make the logs easier to read, and much easier to filter for.
With a good logging discipline, you will soon see non conforming messages as bugs, and you will be much more efficient when debugging and operating your application.