Minimize Downtime by Creating a Health-check for Your NodeJS Application

Minimize Downtime by Creating a Health-check for Your NodeJS Application

Minimize Downtime by Creating a Health-check for Your NodeJS Application

When an API receives more traffic than it can handle, it will crash and be unavailable until someone manually brings it back online. This is referred to as downtime. One of the most important aspects of maintaining a production application is minimizing downtime.

Downtime can arise due to certain crucial parts of your app breaking, thereby bringing down the whole system. For instance, if the database crashes, your application will no longer be able to properly service most requests. The best way to minimize this is to have data about running servers, giving us room to react before anything goes wrong.

The most efficient way to achieve metrics tracking is via third party tools such as the TICK stack (Telegraf, InfluxDB, Capacitor and Kronograf), Sentry and Prometheus. However, each of these tools requires a lot of time investment in order to consistently track and maintain.

A simpler way to track an API is to create a single route that responds with a simple “UP” or “DOWN” message to indicate whether key parts of our application are working or not. This could be useful when debugging often cryptic client-side errors.

For example, rather than a simple 500 error message, an authorized user has additional details about errors being experienced by the server. It also gives us room to react to a server that is ‘unhealthy’ (i.e. in need of attention) but not necessarily ‘down’.

Why create a health-check route?

The most compelling reason to have a health-check route is in order to provide an API-accessible way to determine how ‘healthy’ the API is. This information could then be used in different ways:

  • Feed it to platforms like Statuspage.io to create automated incident reports.
  • Used by a load balancer to determine whether to route traffic to an instance or not,
  • By client apps to determine what kind of error messages to display to users.
  • By developers to debug server-side errors from the route level

Tools like Prometheus have their place in a production environment, and should probably be used to some degree. However, they are usually a huge undertaking – from configuration to maintenance – so many developers tend to avoid them or aren’t very consistent in their use. Nonetheless, shipping an application without an uptime/downtime-monitoring service shouldn’t isn’t a great option to pick.

A health-check route is a simple compromise that can either be built on top of or scrapped for a more mature and robust solution at later stages. The goal here isn’t to create a fully-fledged status monitoring system for your API. Rather, it’s to enforce some level of monitoring early into an application. Some monitoring is better than none at all.

How it works

Your server’s health could depend on any number of metrics. Slow response and long query times are good indicators of an ‘unhealthy’ server, for example.

In this guide, we will first focus on simple core metrics such as whether your database accepts connections or not, and get to more complicated issues like query times later on. Services such as redis instances and other microservices can be handled in a similar fashion, but we won’t cover those here.

We’re going to create a single route that is responsible for collecting any data from the underlying server and returning an appropriate response.

If you’re confident with your Nginx prowess, you could also implement the same using extensions. This has the benefit of providing data directly to the load balancer, rather than on isolated instances.

Monitoring Database Metrics

The database is the holy grail of your application, making it the most important part of your stack to monitor and analyze. Different databases will have different monitoring requirements, and for this guide, we’re going to concentrate on Postgres.

A lot of functionality can be optionally enabled using Postgres extensions. When enabled, pg_stat_statements silently records queries run against your database, strips out some variables from them and saves useful information about each query, e.g. query time, affected rows and the query itself.

To enable pg_stat_statements, run the following command in your database:

Then head over to your postgresql configuration file (found in /etc/postgresql//main/postgresql.conf for Ubuntu 18.04+ users)

And add the following lines:

Now, every query we make will be recorded in pg_stat_statements, and will look like:

To learn more about what each of these fields means, visit the Postgres documentation
Creating a health-check route with express
Our application is going to be created using express-starter and will have the following dependencies:

  • pg – used to connect and interact with a Postgres database.
  • dotenv – for loading environment variables

To install them, run

Create a file named ‘.env’ in the root directory of your project and add the following variables

These will be used by pg to connect to our database.

A simple check we could perform is to ensure that the database is simply up and accepts queries. Here is the method we’ll use to perform this check:

This code tries to query for stats from pg_stat_statements and if an error occurs, it marks the database as being down and returns an error message.

We can make this a bit more powerful by leveraging the wealth of information available at our disposal from pg_stat_statements. For instance, if we wanted to track the total amount of time (in hours) and the average time (in milliseconds) queries take, we could use the following query:

And set up an alert (an email, for instance) for any queries that run consistently slow. Note: any query that takes more than 1ms can be conventionally considered ‘slow.’ This can be adjusted to fit your specific requirements.

Let’s create a function that will filter out long-running queries and mark the database as unhealthy.

And here is how you’d use it:

This returns responses such as:

All the code is commented, but let’s go over it again.

  • When the /health route is called, three core metrics are called to ensure the API responds correctly. If it returns an ‘up’ status, it is considered healthy. A ‘down’ status on any of the core metrics indicates the API cannot respond to requests at all.
  • The analyzeDatabaseHealth method filters for any queries that take more than 1ms on average to complete and alerts us the database is unhealthy. Note that the database isn’t marked as being down since it still executes queries. It is marked as healthy otherwise.
  • Any number of checks can be integrated into the system – analyzeRedis and analyzeThirdPartyConnection are stub methods that represent this.

Conclusion

This guide went over a brief look at how to implement a health-check route for your NodeJS application. Remember – this isn’t meant to be a fully-fledged solution for your API-monitoring, but can make a good entrypoint to learning about how to deliver metrics that fulfil your expectations.

Other important metrics you might want to track include:

  • Uptime
  • Cache-hit ratio
  • Remaining disk space
  • Used vs available RAM

Source code of the lesson https://github.com/Bradleykingz/nodejs-health-check

About the author

Stay Informed

It's important to keep up
with industry - subscribe!

Stay Informed

Looks good!
Please enter the correct name.
Please enter the correct email.
Looks good!

Related articles

21.10.2020

Verifying an Email Address Without Sending an Email in NodeJS

Almost every platform on the internet needs to be able to uniquely identify its users, and email addresses are the most common mechanism for ...

Create simple POS with React.js, Node.js, and MongoDB #16: Order Screen

In this chapter, we are going to create an order page. The order page displays the products and the calculator to help calculate the total price. ...

Working With API in React Application using Axios and Fetch

We know that only API can separate the frontend from the backend. This tutorial will show the typical scenario of how to properly call Axios and ...

No comments yet

Sign in

Forgot password?

Or use a social network account

 

By Signing In \ Signing Up, you agree to our privacy policy

Password recovery

You can also try to

Or use a social network account

 

By Signing In \ Signing Up, you agree to our privacy policy