As a SaaS business, we’re regularly trying new products and tools to help us with DevOps. We recently started to evaluate a lot of different SaaS monitoring tools and wanted to share some of our findings. You’ll find some thoughts about the following tools in this post.
Sematext Cloud is an all-in-one infrastructure performance and log monitoring, real user, frontend, API, website, and uptime monitoring SaaS.
Every modern organization depends on software. Where there is software there are metrics, and logs which means that monitoring both of these critically important types of data is crucial for the success of the business. Sematext offers a platform that bridges the gap between performance monitoring, real user & synthetic monitoring, transaction tracing, and logs.
Source: https://sematext.com/
Organizations that use multiple tools from multiple vendors ultimately pay a higher price, both in licensing fees and in ongoing operational inefficiencies. Sematext Cloud (SaaS) and Sematext Enterprise (on-premises) provide infrastructure monitoring, application performance monitoring, transaction tracing, real user & synthetic monitoring and log management in a single, unified solution.
The best part about Sematext is that they offer a 14-day free trial on their pricing plans. You can get “Basic” infrastructure monitoring with upto 3 containers/hosts with no costs (absolutely free).
New Relic provides application monitoring, customer experience monitoring, and analytics of monitored data.
For application monitoring, you pick from a broad range of agents that you download and deploy through scripting and then configure to your technology-specific applications. For example, there is an agent for Java, one for .NET, another for PHP, one for Ruby, one for Python, etc. For every component you instrument with an agent, you get extra charts and visibility.
Source: http://newrelic.com/application-monitoring
They also recently started to offer a synthetic monitoring product to execute simple HTTP pings for availability monitoring. Optionally you can script a click-path to simulate user activity. The real-user monitoring requires that you inject a JavaScript tag into each of your application’s web pages. Once you have done that, it provides deeper web performance information on pages than Google Analytics. This is why you want to add it to your web app. For mobile monitoring, you add a library to your app. With this you can add additional instrumentation in the app source code.
To better analyze monitoring data, they have built their Insights product, which allows you to perform queries against their monitored data and 3rd party data. This is useful for example in calculating conversion rates or geo-analysis.
However, the best thing about New Relic is that they offer their products in a light version for free. You get “lite” application monitoring, mobile monitoring, browser monitoring, and synthetic monitoring with 24h data retention for free. So if you don’t have much money, this is the place to go.
Middleware is an all-in-one cloud monitoring solution available in software as a service model. It offers comprehensive monitoring capabilities for mobile, web, and server-based applications and extensive dashboarding support. With features like distributed tracing, logs, real user monitoring, and synthetics monitoring, Middleware ensures end-to-end visibility.
The best part of Middleware is that they offer three industry-leading services: logs, metrics, and traces, free of charge. Scale your infrastructure effortlessly and gain valuable insights into your system’s performance.
So why should you consider Middleware?
Dynatrace is the new contender to New Relic. Despite having launched just recently, it comes with a totally different approach to monitoring and makes use of artificial intelligence to power an all-in-one offering that’s targeted to modern DevOps operations and eBiz people. Their UI is different too, they have created an infographic-fueled, touch-first dashboard.
Dynatrace’s all-in-one approach combines application monitoring with user-experience-, network-, server-, cloud- and infrastructure-monitoring, and soon also synthetic monitoring, into a single offering. This sounds heavy, but actually it’s the opposite. Deployment is super easy, because all you need to do is install a single agent on the host. Other products require multiple types of agents. From there it auto-discovers your full-stack with all running components. What’s cool is that it not only visualizes all your environment dependencies, but also it uses so called “Smartscape” for its analytics. This is actually the most important part as it means several things.
Source: https://dynatrace.com/
In typical SaaS environments nothing is static forever. Instances come and go, failover, load-balancing, and so on. So understanding how multiple components interact with each other and which end-user is impacted is a nightmare as soon as you have more than a single Apache deployed. That’s where Dynatrace’s Smartscape comes in. It gives you a real-time map of all the dependencies across tiers and stack and tells you how they interact with each other and how end-users will be impacted if you deploy a change or take an instance down.
For me it’s always about the customer. So I don’t care too much about getting a slower IO on a new Amazon instance, unless it is impacting end-user. That’s what I like about Dynatrace. It always tells me what’s the impact to my customers. It’s doing that by predictive-analytics of monitoring metrics and leveraging the Smartscape to automatically understand the causal dependencies. Thus it knows when a slow IO impacts end-users or not.
Now, when there is a problem – regardless whether it’s IO, nodeJS, AWS, Web Server, 3rd parties like Facebook, Java or JavaScript exception – Dynatrace gives you the root-cause right away, down to the code-level and SQL/NoSQL database statements. You might find it useful to replay a problem from its inception to see its entire evolution.
Sumo Logic is a log analyzer as a service. You can deploy a collector on the machines with the log files that ships your logs to Sumo Logic for analysis. There are many parsers that come out of the box but you can configure your own log file parsing, too. The collector picks up the local log files and ships them securely and compressed to the cloud. There is an option to reduce the log data being shipped at the collector, but for best results, pretty much all log events are shipped.
You do not need to deploy a collector when you hook it up to remote interfaces like AWS CloudTrail. There is a broad set of adapters available for any case. The log events end up at Sumo Logic’s cloud storage for analysis with a moderate delay of only a few minutes.
Obviously log analytics is now the interesting piece. The Auto-Summarize capability is great, especially when you just start out using such a log analysis product, as it helps to consolidate different sources of log files, from multiple hosts and systems into a single view so you can understand how events may correlate. A query language helps you to better find what you need and also makes searches repeatable. An additional Log Reduce capability condenses similar log entries so you have shorter reports and a better overview. Anomaly Detection is another capability that helps the daily review process.
Source: https://www.sumologic.com/
Once you get the hang of it, auto-summarize might not suffice, and you will want to cover more sophisticated use cases using their REST or Java API interface, which enables a high level of automation. The analysis run is reasonably quick, from a range of a few seconds to a minute, depending on data volume and query type.
Elasticsearch is a general purpose full-text search engine which is often used for log analysis when combined with tools for log retrieval and pre-processing. A common implementation uses Logstash for fetching/preprocessing, Redis for caching/buffering, and Kibana for visualization. However it is also used for other things like webshop search functionality, and data analytics.
Integration with different data providers is achieved via Logstash, a tool for log collecting, parsing and forwarding. There are a large number of built-in and add-on integrations with most common formats, you can also define collection rules yourself.
Elasticsearch is a tool, not a hosted solution, so it needs the traditional hardware acquisition and deployment work. Though it runs nicely in the cloud and is built for distributed and high-availability deployments. Additionally there are providers who offer hosted Elasticsearch environments.
Source: https://www.elastic.co/
You usually send text to Elasticsearch together with JSON metadata and later you perform searches via REST interfaces, making it well suited for integration into a custom DevOps workflow.
Various techniques like sharding and parallel query execution help keep the response-times short even with large amounts of data.
Expect a bit of learning curve as the tool can do a lot and thus also requires a bit of getting used to.
OpsGenie provides alert and notification management as a service including on-call scheduling and escalation capabilities. Although it’s cheaper than well-known tools like PagerDuty, it doesn’t need to be shy about benchmarking.
A simple and flexible mail integration as well as a RESTful API allows almost every tool to integrate with OpsGenie. Also newcomers like Dynatrace can thus feed in notifications, although it doesn’t yet provide an official integration module like New Relic, Nagios, Pingdom, and others.
OpsGenie has a simple UI that allows you to implement schedules, escalation policies, and incident routings. The capabilities are similar to what PagerDuty provides. You can not only define several schedules and various escalation policies, but also route incidents depending on source, content or tags to mobile devices running the OpsGenie App or simply receive incidents via email, SMS, or an automated call.
Developer-centric DevOps teams will really love OpsGenie for its flexibility in alerting workflows that are helpful in reducing reliance on network operations teams. A simple UI allows you to define complex alerting rules, so you can combine several incoming notifications into a single meaningful and actionable alert. Cloud natives will appreciate the ability to define delay periods for automatic failovers before deciding whether escalations are required.
Source: Product-Screenshot – https://www.opsgenie.com
You can send periodic heartbeat messages from your monitoring tools to OpsGenie to make sure you don’t miss alerts because your monitoring tool is offline.
Forwarding alerts in a HipChat room, to Slack, or simply leaving a note before forwarding alerts, allows you to collaborate on and solve incidents together in a team. The built-in reporting provides some basic KPIs like “mean time to resolve” and some rough trends. It also allows you to limit the number of alerts you get.
PagerDuty is an alarm aggregation and dispatching service. It allows you to integrate all your monitoring systems, APM solution, API management, and customer support system. Already 100+ integrations are provided out-of-the-box and an email gateway or a RESTful API allows you to integrate not only monitoring tools like New Relic or Dynatrace, and log analyzer tools like Elasticsearch or Sumo Logic, but also your own tools, as long as they send emails or start REST calls.
With PagerDuty you can define escalation policies and route incidents to registered mobile devices running the PagerDuty app. It supports different phone providers for automated calling, SMS service providers, and email providers. It also allows you to route alarms depending on their source and incident type. The capabilities for defining escalations and routing are quite simple to use and powerful. As long you don’t need workflow to decide whether an escalation is needed or not, nearly everything is covered. There are teams, schedules, escalation delays and you can cycle through escalation policies until a response from someone is received. There is also a fallback: if you overlook an alert type you still have an auto-escalation mechanism that will cover it.
Source: http://www.pagerduty.com/incident-resolution/
It integrates with many common collaboration tools like HipChat, Slack, Flowdock and CampfireWork in order to allow effective team work. Automated incident progress updates for other teams helps to eliminate unnecessary email chains.
PagerDuty has a very easy setup. Workflow and escalation definition require no deep tech skills. You can access the incident management and configuration UI through web or mobile app. This enables you to react to alerts right from your mobile.
Last but not least, there is an analytics module that allows you to create statistics and reports on incident frequency, MTTR, and similar metrics. This helps you to optimize support processes and incident management.
Amazon CloudWatch is a monitoring service for AWS resources and applications that’s hosted in Amazon’s cloud.
A common use case for CloudWatch is to keep services running in a healthy and efficient way. That’s achieved through collecting and tracking metrics for AWS resources such as EC2 instances, Elastic Load Balancers, EBS volumes, Relational Database instances and more. In addition CloudWatch allows you to set alarms and take automated actions, like launching or removing EC2 instances within an auto-scaling group.
CloudWatch also offers a decent level of customization – you can send and store metrics for custom apps as well as system and application log files in order to gain a better understanding of how your apps and systems operate.
Source: Product-Screenshot – https://aws.amazon.com/cloudwatch/
A large number of 3rd party vendors see the CloudWatch API as a means to offer greater value to Amazon customers. The number of integrations keeps growing and spreads across different areas. Quite common are integrations of notification services like OpsGenie or PagerDuty, where CloudWatch alerts are taken to a next level. The other frequently integrated solutions are monitoring tools. This category groups vendors with diverse market positions. Starting with key APM players like New Relic, through newcomers like Dynatrace, ending up with companies for which AWS monitoring is their core of their business, for example CopperEgg or Stackdriver.
Monitoring tools complement infrastructure level counters collected over the CloudWatch API with application performance data. With agent technology they are capable of providing insights into system processes and OS level metrics, which are crucial for cloud-based deployments, for example CPU steal time.
CloudWatch provides infrastructure counters that are valuable for some monitoring solutions like Dynatrace. They are able to automatically detect dependencies between applications, services and AWS infrastructure components. All these data are combined to perform intelligent root-cause analysis with precise indications of end-user impact. This is in contrast to CloudWatch, which only alerts you about exceeded metrics thresholds.
CloudWatch offers two weeks metrics retention, which may be enough for your needs. However, most monitoring tools keep data longer than two weeks, which some customers appreciate.
Pingdom is a simplistic availability and performance monitoring tool, focusing on answering one important question – is my website up and performing well? With pricing options catering to small bloggers to enterprise businesses, their solutions are a good fit for many organizations. With both synthetic and real user monitoring solutions baked into the product suite, they provide web application monitoring at all levels.
Source: https://www.pingdom.com/
Pingdom offers a variety of ways to keep an eye on your site. An uptime check is the most popular way to monitor, providing a scheduled HTTP/HTTPS GET request to any given URL – alerting you if there is ever an outage in any of their North American or European locations. Developers can take advantage of Pingdoms text based scripting language that allows you to build transaction checks using CSS locators. Although these multi-step transactions run within a browser, they lack the detailed header information that competitors such as New Relic provide. Schedulable pings, TCP, UDP, DNS and email server checks accommodate all networking needs. Last but not least, Pingdom also offers a JavaScript tag that provides metrics from your real users. These values will differ from your uptime checks but provide second-to-none insight into your end users availability and performance.
Configuring checks to your taste is only half the battle. Pingdom also offers a variety of ways to consume external monitoring data. The UI is clean and modern and contends with newer monitoring solutions such as Dynatrace. Real time dashboard for incidents, uptime and transaction checks offer operational views to help you keep a close eye on your monitoring status. Emailed reports and optional public status pages allow you to easily share your website’s health with colleagues and partners. When there is a problem, you are notified of each incident via SMS, Twitter, email or push notifications to Pingdom’s Android and iOS app.
Pingdom also offers access to your data via RESTful APIs as well as pre-made WordPress plugins. With a variety of monitoring options, Pingdom is a cheap and effective way to make sure your website is behaving.
Hopefully, you will have found this list of SaaS monitoring tools and management platforms useful. You would like to add one more tool you have used and found helpful? Please leave a comment and share with us!
This article was brought to you by Usersnap – a top-rated customer feedback solution to make confident product decisions.
Picture this: You’re in the middle of a hectic workday, balancing strategic decisions with daily…
Ever wish customer feedback came with subtitles? With the right feedback analytics tools, you can…
Survey design is the backbone of effective data collection, enabling businesses, product managers and researchers…
Wondering how to master Jira’s vast capabilities for strategic project/product success? Epics are the key…
In this article, we walk you through the ultimate in-app feedback how to strategy, including…
PMs, have you ever struggled with creating complex surveys for User Acceptance Testing (UAT) or…