From proof-of-concept to production: Building the centralized logging system using ELK
As an intern on the Platform API team at MediaMath, I worked on developing an initial proof-of-concept for a centralized logging system, using the Elasticsearch, Logstash, and Kibana (ELK) stack.
Before a centralized logging system was built, the Platform API team had the challenge of logs being scattered across multiple servers. Investigating issues meant having to search one server, then the next, and so on, and then stitching the evidence together to form a theory. It was hard enough to investigate an already reported problem. It was pretty much impossible to spot problems ahead of time.
The solution: build a centralized logging system.
For my proof-of-concept, I used the ELK stack. ELK is comprised of the tools Elasticsearch, Logstash, and Kibana. Together they form a great tool for analyzing logs and provide a centralized way to search and visualize log data.
- Elasticsearch is a distributed full-text search and analytics engine that is easy to scale.
- Logstash collects, parses, and enriches event data and abstracts away log collection and management.
- Kibana is a powerful and beautiful data visualization tool for the collected logs.
When I initially installed ELK on my local machine, I opted for the default configuration settings. To get familiar with how ELK works, I experimented with pushing data through command lines, and played with the search and query features through their REST API. Next, I collected logs from remote QA servers using a reverse-ssh tunnel to get those from QA servers, and then I created a distributed setup where each subsystem had an independent Openstack instance.
My initial ELK architecture:
- Logstash Shippers: Collects log data from the QA and database servers.
- Message Queue: Buffers the logs coming from shippers. Here we used RabbitMQ due to its persistent nature, so even if things on the receiver side fail, the logs can be buffered and pushed once the receivers are up again. We can go with pub/sub pattern if we have multiple consumers.
- Logstash Indexer: Parses and filters logs. I used Grok heavily for bringing structure to the logs. The logs were of 3 types: syslog, Postgres, Apache, and error logs. I tried multiline to achieve stateful behavior in between parsing two log events, but it did not work as required.
- Elasticsearch: Stores and indexes the logs. We went with one cluster with three extra-large nodes, since they required lot of storage, a high CPU usage, and JVM heap space. Elasticsearch provides both a configuration file and a REST interface for specifying different parameters like number of shards, replicas, and amount of memory that would be used.
Fig 1. Screen grab of Kibana showing log data for last two months.
Since Elasticsearch is the central component of the stack, getting its configuration right was vital. After tinkering with different settings, we ended up allocating half the RAM to Elasticsearch, because of its high usage of JVM heap. We set it up with five shards and one replica per shard. I wrote a shell script that ran Elasticsearch as a service/daemon process. Logs for a single day were put in one index, giving us one index per day.
We found Elasticsearch to be good for log analysis for several reasons:
- It provides a schema-free data model, and hence can be effectively used for storing logs.
- The search functionality includes filtering, pattern matching/text search, range queries, and faceting (aggregation of statistics).
- Percolation is reverse-search. For example, it can be used to match a particular document against any of a list of predefined queries. The percolation feature can be used to provide alert/notifications on seeing unusual behavior in the system.
Depending on how far the log data needs to be preserved, all of the previous logs were deleted. Because of indices per day, deleting of logs was just an API call specifying the index-date stamp.
The use-cases supported for the project were:
- Time-series event-log analysis. Since most of the logs register timestamps, analyzing those accordingly gives good insights. Any anomalous spikes clearly suggest something is wrong.
- Testing for daily CRON jobs.
- Finding patterns in log data by querying. Using logical operators can allow for refined search results and drilling down to a particular set of events.
- Basic percolation was implemented as a trigger mechanism upon seeing the term “exception.”
Logstash filters are CPU-heavy, so if the logs are properly formatted at the source itself, this task becomes super quick. Properly configured, Logstash can process and spit out over 2000 events/sec.
Fig 2. Screen grab of Kibana showing the pattern for CRON jobs.
What I learned:
In the case of Elasticsearch, increasing the number of shards improves indexing performance (due to increasing parallel search capacity). Increasing the number of replicas improves search throughput. These two parameters should be configured depending on the application\use-case.
As for the memory requirements for Elasticsearch, it depends on the amount of data coming in that needs to be indexed, and the type of search queries that are supported.
I did basic profiling of the Elasticsearch cluster using BigDesk tool on parameters like heap space, system memory, number of read/writes ops, and GC, and finding which instances were hardest hit. I found BigDesk to be a great tool that provides most of the metrics needed for profiling.
Finally, I learned that ELK has great support community. There are lots of online resources available for helping to configure and setup ELK, which were of great help to me.
My Future Plans:
As far as the proof-of-concept goes, most of the design worked out great. As I leave MediaMath and head back to school, I am handing over the task of rolling out ELK into production to the TechOps team. And as a far-reaching goal, I would like to unify and correlate the events in order to get better insights.