As a SaaS business, we’re regularly trying new products and tools to help us with DevOps. We recently started to evaluate a lot of different SaaS monitoring tools and wanted to share some of our findings. You’ll find some thoughts about the following tools in this post.
- New Relic, the dominant player with multiple agents, even for Ruby and Python.
- ruxit, the new kid on the block for full-fledged and integrated all-in-one monitoring with smart AI technology.
- Sumo Logic, the cloud-based tool for consolidation and analysis of log files.
- Elasticsearch, the general purpose full-text search engine which is often combined with Logstash and Kibana forming the ELK stack.
- OpsGenie, an alert and notification management tool which provides integrations for a vast number of monitoring and tools.
- PagerDuty, an alarm aggregation and dispatching service for incident management.
- Amazon CloudWatch, the monitoring service for AWS infrastructure resources.
- Pingdom, a simplistic availability and performance monitoring tool for web sites.
New Relic provides application monitoring, customer experience monitoring, and analytics of monitored data.
For application monitoring, you pick from a broad range of agents that you download and deploy through scripting and then configure to your technology-specific applications. For example, there is an agent for Java, one for .NET, another for PHP, one for Ruby, one for Python, etc. For every component you instrument with an agent, you get extra charts and visibility.
To better analyze monitoring data, they have built their Insights product, which allows you to perform queries against their monitored data and 3rd party data. This is useful for example in calculating conversion rates or geo-analysis.
However, the best thing about New Relic is that they offer their products in a light version for free. You get “lite” application monitoring, mobile monitoring, browser monitoring, and synthetic monitoring with 24h data retention for free. So if you don’t have much money, this is the place to go.
So why should you consider New Relic?
- Freemium pricing – application, browser, synthetic monitoring
- Ruby, python support
- Scripting approach to synthetic monitoring
- Single servers or independent (i.e., unconnected) servers
ruxit is the new contender to New Relic. Despite having launched just recently, it comes with a totally different approach to monitoring and makes use of artificial intelligence to power an all-in-one offering that’s targeted to modern DevOps operations and eBiz people. Their UI is different too, they have created an infographic-fueled, touch-first dashboard.
ruxit’s all-in-one approach combines application monitoring with user-experience-, network-, server-, cloud- and infrastructure-monitoring, and soon also synthetic monitoring, into a single offering. This sounds heavy, but actually it’s the opposite. Deployment is super easy, because all you need to do is install a single agent on the host. Other products require multiple types of agents. From there it auto-discovers your full-stack with all running components. What’s cool is that it not only visualizes all your environment dependencies, but also it uses so called “Smartscape” for its analytics. This is actually the most important part as it means several things.
In typical SaaS environments nothing is static forever. Instances come and go, failover, load-balancing, and so on. So understanding how multiple components interact with each other and which end-user is impacted is a nightmare as soon as you have more than a single Apache deployed. That’s where ruxit’s Smartscape comes in. It gives you a real-time map of all the dependencies across tiers and stack and tells you how they interact with each other and how end-users will be impacted if you deploy a change or take an instance down.
For me it’s always about the customer. So I don’t care too much about getting a slower IO on a new Amazon instance, unless it is impacting end-user. That’s what I like about ruxit. It always tells me what’s the impact to my customers. It’s doing that by predictive-analytics of monitoring metrics and leveraging the Smartscape to automatically understand the causal dependencies. Thus it knows when a slow IO impacts end-users or not.
Why should you consider ruxit?
- All-in-one monitoring, instead of multiple monitoring tools – real-user, application, cloud, server, network, infrastructure monitoring
- Easy deployment and automatic dependency discovery
- AI analytics approach to root-cause analysis and fewer alerts (ruxit calls it “no-alerts” technology)
- WebUI, Java, node.js, and .NET based app support
Sumo Logic is a log analyzer as a service. You can deploy a collector on the machines with the log files that ships your logs to Sumo Logic for analysis. There are many parsers that come out of the box but you can configure your own log file parsing, too. The collector picks up the local log files and ships them securely and compressed to the cloud. There is an option to reduce the log data being shipped at the collector, but for best results, pretty much all log events are shipped.
You do not need to deploy a collector when you hook it up to remote interfaces like AWS CloudTrail. There is a broad set of adapters available for any case. The log events end up at Sumo Logic’s cloud storage for analysis with a moderate delay of only a few minutes.
Obviously log analytics is now the interesting piece. The Auto-Summarize capability is great, especially when you just start out using such a log analysis product, as it helps to consolidate different sources of log files, from multiple hosts and systems into a single view so you can understand how events may correlate. A query language helps you to better find what you need and also makes searches repeatable. An additional Log Reduce capability condenses similar log entries so you have shorter reports and a better overview. Anomaly Detection is another capability that helps the daily review process.
Once you get the hang of it, auto-summarize might not suffice, and you will want to cover more sophisticated use cases using their REST or Java API interface, which enables a high level of automation. The analysis run is reasonably quick, from a range of a few seconds to a minute, depending on data volume and query type.
Why should you consider Sumo Logic?
- Consolidated and correlated review of log files for anomalies like crashes, erroneous app behavior from multiple-machines and sources
- For non-tech reviews like analysis of AWS CloudTrail audit logs, which is important in SaaS environments
- As a complement to monitoring with the aforementioned ruxit or New Relic, as log analysis covers technologies that may not be handled by all agents, including windows events, syslog, custom sources, load balancers, etc.
- Getting started quickly
- The ability to go beyond auto-summarize capabilities and use the REST interfaces to perform your own analytics on Sumo Logic’s collected data
Elasticsearch is a general purpose full-text search engine which is often used for log analysis when combined with tools for log retrieval and pre-processing. A common implementation uses Logstash for fetching/preprocessing, Redis for caching/buffering, and Kibana for visualization. However it is also used for other things like webshop search functionality, and data analytics.
Integration with different data providers is achieved via Logstash, a tool for log collecting, parsing and forwarding. There are a large number of built-in and add-on integrations with most common formats, you can also define collection rules yourself.
Elasticsearch is a tool, not a hosted solution, so it needs the traditional hardware acquisition and deployment work. Though it runs nicely in the cloud and is built for distributed and high-availability deployments. Additionally there are providers who offer hosted Elasticsearch environments.
You usually send text to Elasticsearch together with JSON metadata and later you perform searches via REST interfaces, making it well suited for integration into a custom DevOps workflow.
Various techniques like sharding and parallel query execution help keep the response-times short even with large amounts of data.
Expect a bit of learning curve as the tool can do a lot and thus also requires a bit of getting used to.
Why consider Elasticsearch?
- When you prefer more of a home-grown style of data analytics or have special use cases
- When you need to handle more than log files
- You want to create home-grown visualization
- Low cost (if you already have hardware)
OpsGenie provides alert and notification management as a service including on-call scheduling and escalation capabilities. Although it’s cheaper than well-known tools like PagerDuty, it doesn’t need to be shy about benchmarking.
A simple and flexible mail integration as well as a RESTful API allows almost every tool to integrate with OpsGenie. Also newcomers like ruxit can thus feed in notifications, although it doesn’t yet provide an official integration module like New Relic, Nagios, Pingdom, and others.
OpsGenie has a simple UI that allows you to implement schedules, escalation policies, and incident routings. The capabilities are similar to what PagerDuty provides. You can not only define several schedules and various escalation policies, but also route incidents depending on source, content or tags to mobile devices running the OpsGenie App or simply receive incidents via email, SMS, or an automated call.
Developer-centric DevOps teams will really love OpsGenie for its flexibility in alerting workflows that are helpful in reducing reliance on network operations teams. A simple UI allows you to define complex alerting rules, so you can combine several incoming notifications into a single meaningful and actionable alert. Cloud natives will appreciate the ability to define delay periods for automatic failovers before deciding whether escalations are required.
Source: Product-Screenshot – https://www.opsgenie.com
You can send periodic heartbeat messages from your monitoring tools to OpsGenie to make sure you don’t miss alerts because your monitoring tool is offline.
Forwarding alerts in a HipChat room, to Slack, or simply leaving a note before forwarding alerts, allows you to collaborate on and solve incidents together in a team. The built-in reporting provides some basic KPIs like “mean time to resolve” and some rough trends. It also allows you to limit the number of alerts you get.
Why should you consider OpsGenie?
- Smaller DevOps/NoOps Teams using a high level of automation
- Complex alerting workflows. For instance checking if failover automation works as expected before starting an escalation
- Alerting in case of emergencies
PagerDuty is an alarm aggregation and dispatching service. It allows you to integrate all your monitoring systems, APM solution, API management, and customer support system. Already 100+ integrations are provided out-of-the-box and an email gateway or a RESTful API allows you to integrate not only monitoring tools like New Relic or ruxit, and log analyzer tools like Elasticsearch or Sumo Logic, but also your own tools, as long as they send emails or start REST calls.
With PagerDuty you can define escalation policies and route incidents to registered mobile devices running the PagerDuty app. It supports different phone providers for automated calling, SMS service providers, and email providers. It also allows you to route alarms depending on their source and incident type. The capabilities for defining escalations and routing are quite simple to use and powerful. As long you don’t need workflow to decide whether an escalation is needed or not, nearly everything is covered. There are teams, schedules, escalation delays and you can cycle through escalation policies until a response from someone is received. There is also a fallback: if you overlook an alert type you still have an auto-escalation mechanism that will cover it.
It integrates with many common collaboration tools like HipChat, Slack, Flowdock and CampfireWork in order to allow effective team work. Automated incident progress updates for other teams helps to eliminate unnecessary email chains.
PagerDuty has a very easy setup. Workflow and escalation definition require no deep tech skills. You can access the incident management and configuration UI through web or mobile app. This enables you to react to alerts right from your mobile.
Last but not least, there is an analytics module that allows you to create statistics and reports on incident frequency, MTTR, and similar metrics. This helps you to optimize support processes and incident management.
Why should you consider PagerDuty?
- If your escalation workflow or policy needs are not too complicated
- You want an easy but powerful way to consolidate alerts from various monitoring systems
- Your support or operation teams are spread across the globe and you need to reliably get them notified of incidents by phone, SMS, or email
- You want to add an easy-to-use incident management system to your existing monitoring tools
- You want to have analytics and a high-level overview of all your systems health
- You want to have reporting about incident handling that helps to reduce resolution time
Amazon CloudWatch is a monitoring service for AWS resources and applications that’s hosted in Amazon’s cloud.
A common use case for CloudWatch is to keep services running in a healthy and efficient way. That’s achieved through collecting and tracking metrics for AWS resources such as EC2 instances, Elastic Load Balancers, EBS volumes, Relational Database instances and more. In addition CloudWatch allows you to set alarms and take automated actions, like launching or removing EC2 instances within an auto-scaling group.
CloudWatch also offers a decent level of customization – you can send and store metrics for custom apps as well as system and application log files in order to gain a better understanding of how your apps and systems operate.
Source: Product-Screenshot – https://aws.amazon.com/cloudwatch/
A large number of 3rd party vendors see the CloudWatch API as a means to offer greater value to Amazon customers. The number of integrations keeps growing and spreads across different areas. Quite common are integrations of notification services like OpsGenie or PagerDuty, where CloudWatch alerts are taken to a next level. The other frequently integrated solutions are monitoring tools. This category groups vendors with diverse market positions. Starting with key APM players like New Relic, through newcomers like ruxit, ending up with companies for which AWS monitoring is their core of their business, for example CopperEgg or Stackdriver.
Monitoring tools complement infrastructure level counters collected over the CloudWatch API with application performance data. With agent technology they are capable of providing insights into system processes and OS level metrics, which are crucial for cloud-based deployments, for example CPU steal time.
CloudWatch provides infrastructure counters that are valuable for some monitoring solutions like ruxit. They are able to automatically detect dependencies between applications, services and AWS infrastructure components. All these data are combined to perform intelligent root-cause analysis with precise indications of end-user impact. This is in contrast to CloudWatch, which only alerts you about exceeded metrics thresholds.
CloudWatch offers two weeks metrics retention, which may be enough for your needs. However, most monitoring tools keep data longer than two weeks, which some customers appreciate.
Why should you consider CloudWatch?
- Infrastructure-level coverage for AWS services is sufficient
- If 2 weeks of data retention meets your needs
- You run all workloads in Amazon Cloud and aren’t considering moving towards a hybrid cloud
Pingdom is a simplistic availability and performance monitoring tool, focusing on answering one important question – is my website up and performing well? With pricing options catering to small bloggers to enterprise businesses, their solutions are a good fit for many organizations. With both synthetic and real user monitoring solutions baked into the product suite, they provide web application monitoring at all levels.
Configuring checks to your taste is only half the battle. Pingdom also offers a variety of ways to consume external monitoring data. The UI is clean and modern and contends with newer monitoring solutions such as ruxit. Real time dashboard for incidents, uptime and transaction checks offer operational views to help you keep a close eye on your monitoring status. Emailed reports and optional public status pages allow you to easily share your website’s health with colleagues and partners. When there is a problem, you are notified of each incident via SMS, Twitter, email or push notifications to Pingdom’s Android and iOS app.
Pingdom also offers access to your data via RESTful APIs as well as pre-made WordPress plugins. With a variety of monitoring options, Pingdom is a cheap and effective way to make sure your website is behaving.
Why should you consider Pingdom?
- It is ideal for quick and easy external availability and performance monitoring
- Different pricing packages provide flexibility for your business
- Integrating your Pingdom metrics into other third party solutions such as WordPress
- Immediately be notified of incidents and easily share your results with colleagues
Hopefully, you will have found this list of SaaS monitoring tools useful. You would like to add one more tool you have used and found helpful? Please leave a comment and share with us!
This article was brought to you by Usersnap – a visual bug tracking and screenshot tool for every web project.