Live Demo
Blog   >   Cloud Security   >   Reliable UNIX Log Collection in the Cloud

Reliable UNIX Log Collection in the Cloud

One way organizations can improve their security and operational ability is to collect logs in a central location. Centralized logging allows engineers across the entire organization to have a “common view” of the system under load, and can provide vital shared context when things go wrong.

Over the last few months, we at Threat Stack have been reworking how we handle all aspects of our logging system. This project encompasses everything, from the content of our log data to the infrastructure that collects it. In this post you’ll learn about how our internal applications send log data, where they send it to, and the trade offs we considered in making our collection system reliable.

Where Will The Logs Go?

Before figuring out what we were going to capture, we looked at a variety of logging platforms. We were going to pick something, so it was worthwhile to figure out what we would pick, how extensible it was, our ability to understand how it works, etc. Some solutions were very expensive; some were very rudimentary and involved a lot UNIX pipes. Our goal was to find a balance between these two extremes. On one hand, spending money to solve a complicated problem is an excellent way to move on to other projects as an engineering organization. On the other, we don’t have the resources to build the most robust logging platform on our own.

We didn’t want to stick with our existing SaaS vendor or select a new SaaS offering because centralized logging functionality is largely commoditized (i.e., little additional feature value beyond available non-SaaS options) and the risk of application data leaking sensitive information was too large. We’re fortunate that we don’t have to enable debug logging in production often. Even so, it’s important to ensure that our engineering team can enable this verbose logging when necessary. We made a balanced risk-oriented decision based on the type of data our logs could collect, leading us to a non-SaaS solution.

In the end, we found Graylog2 to be a convenient balance between these extremes for the bulk of our log data. It’s a tool that would get us most of what we were looking for, without having to write a lot of our own tooling around it. Graylog2 supports many log input mechanisms and formats, allows us to split up log data into “streams” that we can apply access control rules to, and has a decent user interface for configuration and parsing. A plugin for Slack allows us to alert on certain log messages, which is useful for collecting information on application errors in one place.

Graylog2 also uses ElasticSearch for data storage. Since we already have ElasticSearch in production along with all the accompanying knowledge and automation, we were okay with incurring that infrastructural cost.

How Do They Get There?

Applications, generally, write logs to a file on the local system. Many infrastructure engineers (including myself) have started and ended their implementation of centralized log systems by looking at where these files are written to, configuring some sort of agent to collect them, sending them to the central host, and then flying the “Mission Accomplished” banner.

One early lesson learned from working on centralized logging: this approach works well for syslog data, which generally follows a pattern of one line having enough context for one event. It is not “one size fits all” though. Many of our applications will provide full stack traces when they have errors, which requires (often complex) multi-line parsing so that the logging system understands the start and end for a particular message, fields we’d like to search on, etc. This can get complicated quick.

With this in mind, we decided to go further: Are there improvements we could make to our collection of services (written in Scala and Node.JS) to make them log in formats that would be easier to bring in to Graylog2/search on? It turns out the answer was yes, and was less complicated than we had imagined.

Shipping Logs From Scala Applications

Our collection of services written in Scala uses the LOGBack logging framework. The logstash-gelf plugin for LOGBack (and many other common JVM logging frameworks) will take a message and send it off to Graylog2 with minimal configuration. After adding the dependency to our application, we used Chef to update the LOGBack configuration, so we added the following block to logback.xml (and corresponding appender-ref entry) to our configuration:

<appender name="gelf" class="biz.paluch.logging.gelf.logback.GelfLogbackAppender">
    <timestampPattern>yyyy-MM-dd HH:mm:ss,SSSS</timestampPattern>

Once our app restarted, logs started flowing into Graylog2. The ability to add in customized additional fields at the application level allowed us to create customized “streams” that capture all application data in a particular environment.

Shipping Logs From Node.JS Applications

On the Node.JS side of the house, we moved to the Bunyan logging library. Bunyan outputs log data in JSON — and allows you to add arbitrary objects to the log output. Additionally, there are built-in serializers that can parse certain types of objects (req, res, err) and only keep important bits of that information so you don’t store more than you need. We use the gelf-stream library to send that data to Graylog.

Moving away from entirely human-readable logs can cause some engineering teams a certain amount of angst. What happens if I’m debugging on a particular machine? One of the big selling points with Bunyan is that it ships with a CLI tool to parse and print something in a more legible format.

For example, if we put{test: true}, {“Success: Took” + sec + “seconds”) into our code, the log output would be:

  "msg":"Success: took 4.09 seconds"

When parsed by the Bunyan CLI tool, this message displays as:

[2017-02-06T14:59:22.632Z]  INFO: api/28182 on i-12341234123412341: Success: took 4.09 seconds (logtype=application, environment=development, test=true)

Messages with a larger object attached would have the object placed under the log message. Key/value parameters are in the parenthesis.

One thing that made this a little more complicated than our Scala projects is that it required the modification of applications to provide more context. We chose to do this for the majority of our larger services at the time of implementation, rather than waiting.

Shipping Syslog

For logs still on hosts that actually do subscribe to the “one line, one event” paradigm, we use Graylog2’s “Collector Sidecar” to grab logs. Graylog2’s collector calls out to the Graylog2 server, and configures Filebeat to send files based on a series of tags. One neat feature of Graylog2 is that it supports “extractors” which can be used to parse log entries into specific fields and events. One way we used this functionality was to parse specific log files, such as dpkg logs or VPN logs.

Shipping Our Chef “First Boot” Logs

We have a lot of tooling to keep track of host status with Chef. That said, one issue with the cloud is that when bringing hosts up we lack a system “console” that we can connect to for information on any issues with bootstrapping a new node. As a result, we continue to use a SaaS service for this purpose. Since Chef allows us to filter out sensitive bits of data,  we’re able to safely utilize a third party for this data.

Technical Issue: Making UDP More Ops-Friendly

Many tools that revolve around system logging use UDP as a transport mechanism. UDP (as opposed to TCP) is connectionless — a client just sends packets towards a server, with no guarantee that it was received. There are speed advantages and also operational advantages to this. Imagine your logging service is down? Should your applications be blocked trying to set up a connection to the log service? In some cases, perhaps — but many applications are not written with this error case in mind.

With TCP connections, load balancers are often used to distribute traffic across multiple hosts to solve this problem. With UDP, since we have no connections, we can’t use existing tooling to help us out. Load balancing software generally (and correctly) wants to get a response from the target service to determine its health — not an unrelated check. Graylog2’s UDP receivers don’t send any “acknowledged” response, so that makes monitoring it more difficult.

Our solution to this is using Consul to monitor Graylog2’s status API on each host, NGINX on each host to actually “load balance” the UDP service (announced last March), and consul-template to dynamically reconfigure NGINX when a listener drops out. It’s not a perfect solution — but we lose fewer messages and we can safely perform upgrades and other maintenance this way.

Wrapping It Up

We’ve been using this infrastructure in our application environment for the last six weeks or so. Our largest production services are logging there now, and we’re using the infrastructure for searching today. Our engineers love it and we’re pleased with the pace that new features get added to the application.

If you’re looking to do the same: set up a proof-of-concept, show it to your staff, come up with a few example use cases that solve problems they have. Don’t be afraid to roll up your own sleeves and dig into some of their code if that’s an option — many times this work is critically important (just like all the other things that are critically important). Being able to help out will move things along quicker.

Revamping our logging infrastructure has been a multi-month project that has happened in concert with other security efforts within Threat Stack. Hopefully this post can help you with your own logging journey. We’d love to have links in the comments!