Logs and Ops – Streamlining Troubleshooting and Management of a Hosted Atlassian Environment

Simplifying and streamlining troubleshooting and management is critical for getting the most out of your Operations team. Today I’d like to talk about some of the tools we use at Avant to help us do just that.

So Many Logs, So Little Time

Screen-Shot-2014-02-25-at-4.42.52-PM-1024x557

 

A huge challenge with any online hosted service is working with logs, and extracting meaningful data from them.

Imagine an application with the following components running on a number of Linux VMs:

  • nginx reverse proxy / cache
  • Tomcat container running JIRA
  • PostgreSQL database

This will generate a lot of log data to sift through once it hits production.

 

What is the alternative to logging on to N number of VMs and grepping through logs? The ElasticSearch ELK Stack (AKA ElasticSearch + Logstash + Kibana). This tool set has been a huge boon for troubleshooting efficiency. Searching and reporting based on logs has never been easier.

In our configuration each node forwards the logs for its role, plus intrusion detection logs to a centralized Logstash instance, which filters and outputs the results to the ElasticSearch backend. Another monitoring tool such as Nagios or Zenoss will also forward check results which will in turn be aggregated in ElasticSearch.

Now we have centralized logs that are quick and easy to search and correlate events. You can easily pre-configure dashboard that gather all the usual suspects in one place for troubleshooting (nginx error logs, Tomcat errors, slow database queries, etc).

Why Am I Still Doing This Manually?

Doing things manually has some major drawbacks: it is labor intensive, error prone, and you can only trust the people with the right skills to do it. When your Operations team has limited capacity, it is time to automate the routine, time consuming tasks.

Enter Rundeck.

This excellent tool does exactly what it advertises on the tin, it ‘enables self-service operations’. Combined with the ELK stack, an operations engineer can look at the logs, troubleshoot the problem, and follow up by restarting a failed service, without having to log on to a box or tail logs. Here are some of the things we use Rundeck for:

  • Managing services (start/stop/refresh)
  • Re-configuring JVMs (add debugging, change memory parameters)
  • Initiating and scheduling backups and maintenance windows

Rundeck also serves as a centralized cron server – with the added benefits of a slick web interface for activity reports, as well as notifications. This gives visibility to the operations team on everything that is supposed to happen as a scheduled task.

Screenshot from 2014-06-25 11:41:27

 

So What Does This Mean To Our Customers?

In the end, we provide a very high quality hosting service. Our operations team has the tool set at it’s fingertips to proactively address issues before they become a problem. If a problem does occur, we can rapidly address it with our streamlined troubleshooting process.

Mistakes arising from human error are minimized with a library of tested, repeatable operations tasks. This means less risk of downtime and faster maintenance windows.

 

 

Browse the case_studies archive. Bookmark the permalink.