If I learned one thing at Monitorama 2016 in Portland, Oregon, it’s this: it has never been easier to monitor your infrastructure. Not only have the tools come a long way in the last few years, but the community and perspectives on monitoring have rallied as well, by focusing on the people who build and use monitoring systems.
Here are some of the key observations I made at this year’s conference along with a few tips and recommendations:
- You can build the metrics infrastructure you need, right now. The last five years have produced a sea change in the improvement, coverage, and scaling of infrastructure monitoring software as tools have simultaneously shifted into more modern tooling and planning for cloud-scale.
- Dashboards and visualization. Dashboarding and visualization is helping to improve stability monitoring infrastructure tools. Aggregating monitored data in pluggable dashboards is becoming a more common and desirable state for operations professionals. Grafana deserves a special shout out for doing great work on presenting monitoring data to humans.
- Reduce noise, improve data, and boost the effectiveness of your operations people. Signal-to-noise-ratio affects the ability of your ops team to work on active products. Too many alerts are not actionable or are too frequent. Alert fatigue can lead directly to short- or long-term productivity loss or increase churn in your ops or on-call organizations. Be sure you’re getting the right alerts so you can effectively resolve situations that require the dramatic problem-solving and adaptive powers of the human brain! Do not waste your people! If your on-call team is getting too many alerts, you need to change your monitoring system so you’re getting the RIGHT alerts rather than ALL of them. On a similar note, be sure you regularly (monthly or quarterly) review your alerts. If you do this, you will likely remove alerts that no longer apply to existing systems or brainstorm a few alerts that could make your life easier and notify you of more relevant alert scenarios. This will dramatically reduce noise-y alerts, so that actionable and understandable data is clearly and immediately available to your operations people.
- Use the tools that meet your current scale. Premature optimization is still the devil. Gradually build the system you need. Investigate solutions: focus on composability, pluggability, and customizability. This will allow you to give the right tools to your developers and operations teams to solve the real problems or milestones you have to meet today. Do not spend your time implementing a vendored or internal solution that does not help you at your current scale. There are products out there that you can use to start solving problems the right way (with monitoring) today.
- Monitoring is a platform. Monitoring solutions need to be a platform to better adapt to the individual customers, users, developers, or management teams that use your solution. This will make it easier for the people using your monitoring tools, and especially your operations engineers, to focus their attention on urgent problems. Give your dev-team full ownership of their monitoring by providing them with a strong platform and great tools to use. Empower your users to understand and act more effectively on the monitored data you are producing.
- A final word: Monitoring tools are getting better. But people still need to be the center of progress in operations. It is time to start emphasizing that operations is the behaviors of people USING systems. Monitoring can make operations better as long as your tools help make the lives and jobs of operations professionals better. Be sure you work with vendors and colleagues who understand this; there is great work being done.