01 July 2016

🎶They roll down stairs and travel in pairs and over your devops dog🎶

I swear this dog looks like our devops guy

Well. Hopefully. Large numbers of software applications (including some of our own at work) don't log enough, log the wrong things, aren't actually logs, or don't end up anywhere useful. But what's a developer to do? How do we know when to log? Or what?

Logs are pretty cheap. In a typical configuration, logs are collected locally where they are spooled before being sent to a collection point (or a collection of collection points). rsyslog is capable of efficiently handling thousands of records per second. It is also capable of filtering the logs based on arbitrary properties of the log record, including the message itself. In a network with sufficient throughput the network overhead associated with even verbose logging from a busy application is rarely a bottleneck. All of these factors combined mean that from an infrastructure standpoint we should be going wild with logs. Assuming your application can be easily configured by your administrator (or by yourself) you can safely log large volumes of information.

At the application level, logging is often done synchronously. In Python (particularly the cPython implementation), there is no real threading and no opportunity to offload logs to a separate worker thread. While unfortunate, applications rarely suffer from performance issues here, especially when configured to send logs locally. It is safe to log frequently in this manner with little performance impact.

More frequently performance issues can arise when attempting to log large amounts of information that involve a side effect. For example, an application loads some information from the database and logs it. If the logging process causes additional information to be loaded from the database, this can be problematic especially when iterating over a large collection and performing a load at each step. If your application is typically long running, slowdowns like this can go unnoticed. Caution must be exercised when logging data that may have network, disk, or process related side effects.

Even with the above caveats, you still probably aren't logging enough. In most cases, the performance issues encountered above are more acceptable than a complete lack of logging.

Not Logging Enough

There are three aspects of not logging enough: information, context, and frequency.

What information you should log is actually a lot more straightforward than most people realize. When you call a function and want to log that it is being called, some obvious information to log are the parameters passed to the function. You may, additionally, want to note the return value of the function or the fact that the function is returning. This pattern (ingress and egress logging) can also provide useful information in the form of timestamp deltas so you can see how long the function call took. You may also want to actually log the execution time itself. This sort of logging should generally be used for functions that are major entry points for your application or any function that has an extended running time (as a guideline, anything over a second or two is probably "extended").

You should also strongly consider logging whenever your service makes a call to an external resource, regardless of the time taken. External resources like databases, files (yes, logging that you're going to open a file and where it is can be very helpful), APIs (even internal), and subprocess invocations all should be logged. And you should include as many of the parameters as you are able to when doing so. Return values as well (see my comments on timing above) are useful especially when the call is to a database service or API.

Your logs need context. "updated client" is not a particularly useful message. Don't assume you'll be able to locate your log message based on a string like that (and don't log empty strings. Jimmy logs empty strings). If you updated a client, that client record probably has an identifier and a name. Include them! We use SQLAlchemy as our ORM and our base class that all other models inherit from include a __repr__ that interpolates the class name and the primary key column values into the output. The Client model can override this to include the name. SQLAlchemy also provides for an info attribute where you can stash any information you want. It's not a bad place to indicate extra fields you want interpolated into your __repr__. So the above message becomes "updated client <Client 1 'Jimmy, Inc.'>".

Of course, you probably want to include what it is you updated! SQLAlchemy's change events are great for this because they will allow to define once that you want to listen for attribute changes and then allow you to log them all. "set attribute 'phone" on <Client 1 'Jimmy, Inc.'> to '999-999-9999' from '000-000-0000'" can be automatically generated for every change. Pretty cool.

What other context is there? In the information examples I gave, I typically also log the function name so I can tell where it is being called from. You can also use stack inspection techniques to locate the module, class, function, file, and line, but there can be an associated performance penalty associated with doing so. The example above for automatically logging attribute changes can suffer from a lack of context but can be corrected using this technique.

Logging via syslog (and many other methods) also typically include other helpful context by default. Things like time and hostname are some examples. Software versions are also helpful, though you may want to log these just once at application start. If you're logging from a device and it has a physical location, you may want to include that location in your log records, especially if the device can move. Other aspects of the environment are natural targets for inclusion as well.

We have many processes that log some information when they begin running. For processes running from cron, this serves as a kind of heartbeat. For web applications, logging when a request begins and returns can serve the same purpose, especially when a health check is being frequently executed on the service. Long running daemons that wait for work (for example, by selecting on a socket) should log after each select timeout. Just because your application's process is alive doesn't mean it is doing anything useful. Logging frequently can help diagnose problems when your application hangs or becomes deadlocked.

Record twice, log once, read over and over again

Log records should be easily readable by humans. You aren't really doing anyone any favors by logging in XML, so please never, ever do that. JSON is also not a log format that you really want to read. If log in binary from Python, consider getting a job in a different field (obviously we all have constraints. I build server based applications in Python. Embedded devices in C might be a different story). Logs should read like a biography of your process' lifetime.

But that doesn't mean we should create logs that can't be read by machines. Very often, we would like to mine information from logs, get statistics (we're all using some form of statistics collection, right?), and sometime just want grep to be useful. These considerations also require that our logs be readable by programs as well. Typical formats often include metadata and context in a structured manner. This is good. But we should also be mindful of the fact that our messages need some structure, too. When we log times, people like formats like "1 day, 3 hours, 46 minutes, and 37.23 seconds". This is how people think about and communicate time. But searching for processes that ran longer than four hours with this format is less than ideal. You should also include the time in milliseconds so that you can easily extract and filter on that value from a program.

Identifying Logs

yep, that's wood
print("updated client {}".format(repr(client)))

Not a log. Don't do that. Especially if you are a module author. Your users will hate you.

sys.stderr.write is also not logging.

with open("/tmp/derp.log", "a") as logfp:
    logfp.write("updated client {}".format(repr(client)))

Still not logging. These approaches work in a pinch but are definitely not a structured approach.

logger = logging.getLogger(__name__)
logger.debug("updated client {}".format(repr(client)))

Nailed it. This is how you log in most Python applications. And this is the method you should always use. You should also be loading your logging configuration using logging.fileConfig. Your sysadmin will love you. Using fileConfig allows complete control over your logs, their format, and how they're emitted to other locations. Unless you have special needs or are using an incompatible system (I'm looking at you, twisted) this is the method you WILL use.

As an added bonus, the entire Python ecosystem logs this way. Modules can be instrumented with logging and with the appropriate configuration you can collect the logs from them as well. Want to see what SQLAlchemy is doing? How about requests? It's as simple as configuring your logging to collect from their loggers.

What level should we log at?

debug is your go-to level. When in doubt, it's usually a safe choice. Most items mentioned above will be debug level. Some items like start events, quit events, and the execution of slow functions should log slightly higher at info. This affords your system administrator the ability to discard debug level logs if so desired without sacrificing important information regarding the application's current state (and you should let them make that decision. Log!) Warn should be used when the application encounters a state in which it can continue but that should be brought to the attention of someone. A condition such as an unexpectedly slow network call is worthy of a warning. Errors are anything that prevent the normal functioning of your application. If at all possible, your application should fail gracefully. But it should most certainly log the condition that resulted in the error. If you're executing a query and need to ensure some data is present before proceeding, that missing data should result in a log emitted at error level. Python also includes a logger.exception method. You should use this when you are handling an exception and want the stack trace automatically logged for you as well.

Now that we've got logs, what do we do with them?

Send them to a collection server. rsyslog supports collecting logs locally, relaying logs, and also storing your logs somewhere useful. You can then take those logs and stash them in other tools, like the popular elasticsearch-logstash-kibana (ELK) stack. You can send them to a third party log collection and analysis service like loggly. You can configure rsyslog to email you errors. Just don't throw them away! There's a lot of valuable information in those logs that you may have never known that you needed, especially when your devops gets up at 3am and subsequently wants to kill you.