When an API receives more traffic than it can handle, it will crash and be unavailable until someone manually brings it back online. This is referred to as downtime. One of the most important aspects of maintaining a production application is minimizing downtime.
Downtime can arise due to certain crucial parts of your app breaking, thereby bringing down the whole system. For instance, if the database crashes, your application will no longer be able to properly service most requests. The best way to minimize this is to have data about running servers, giving us room to react before anything goes wrong.
The most efficient way to achieve metrics tracking is via third party tools such as the TICK stack (Telegraf, InfluxDB, Capacitor and Kronograf), Sentry and Prometheus. However, each of these tools requires a lot of time investment in order to consistently track and maintain.
A simpler way to track an API is to create a single route that responds with a simple “UP” or “DOWN” message to indicate whether key parts of our application are working or not. This could be useful when debugging often cryptic client-side errors.
For example, rather than a simple 500 error message, an authorized user has additional details about errors being experienced by the server. It also gives us room to react to a server that is ‘unhealthy’ (i.e. in need of attention) but not necessarily ‘down’.
Why create a health-check route?
The most compelling reason to have a health-check route is in order to provide an API-accessible way to determine how ‘healthy’ the API is. This information could then be used in different ways:
- Feed it to platforms like Statuspage.io to create automated incident reports.
- Used by a load balancer to determine whether to route traffic to an instance or not,
- By client apps to determine what kind of error messages to display to users.
- By developers to debug server-side errors from the route level
Tools like Prometheus have their place in a production environment, and should probably be used to some degree. However, they are usually a huge undertaking – from configuration to maintenance – so many developers tend to avoid them or aren’t very consistent in their use. Nonetheless, shipping an application without an uptime/downtime-monitoring service shouldn’t isn’t a great option to pick.
A health-check route is a simple compromise that can either be built on top of or scrapped for a more mature and robust solution at later stages. The goal here isn’t to create a fully-fledged status monitoring system for your API. Rather, it’s to enforce some level of monitoring early into an application. Some monitoring is better than none at all.
How it works
Your server’s health could depend on any number of metrics. Slow response and long query times are good indicators of an ‘unhealthy’ server, for example.
In this guide, we will first focus on simple core metrics such as whether your database accepts connections or not, and get to more complicated issues like query times later on. Services such as redis instances and other microservices can be handled in a similar fashion, but we won’t cover those here.
We’re going to create a single route that is responsible for collecting any data from the underlying server and returning an appropriate response.
If you’re confident with your Nginx prowess, you could also implement the same using extensions. This has the benefit of providing data directly to the load balancer, rather than on isolated instances.
Monitoring Database Metrics
The database is the holy grail of your application, making it the most important part of your stack to monitor and analyze. Different databases will have different monitoring requirements, and for this guide, we’re going to concentrate on Postgres.
A lot of functionality can be optionally enabled using Postgres extensions. When enabled, pg_stat_statements silently records queries run against your database, strips out some variables from them and saves useful information about each query, e.g. query time, affected rows and the query itself.
To enable pg_stat_statements, run the following command in your database:
CREATE EXTENSION pg_stat_statements;
Then head over to your postgresql configuration file (found in /etc/postgresql//main/postgresql.conf for Ubuntu 18.04+ users)
And add the following lines:
shared_preload_libraries = 'pg_stat_statements' pg_stat_statements.track = all pg_stat_statements.max = 10000
Now, every query we make will be recorded in pg_stat_statements, and will look like:
{ "userid": 16384, "dbid": 24576, "queryid": "-1144474632245246520", "query": "SELECT NOW()", "calls": "11", "total_time": 0.15964, "min_time": 0.010483, "max_time": 0.037042000000000005, "mean_time": 0.014512727272727271, "stddev_time": 0.007268825078135321, "rows": "11", "shared_blks_hit": "0", "shared_blks_read": "0", "shared_blks_dirt }
To learn more about what each of these fields means, visit the Postgres documentation
Creating a health-check route with express
Our application is going to be created using express-starter and will have the following dependencies:
- pg – used to connect and interact with a Postgres database.
- dotenv – for loading environment variables
To install them, run
npm install pg dotenv // or yarn add pg dotenv
Create a file named ‘.env’ in the root directory of your project and add the following variables
PGUSER= #postgres user PGPASSWORD= #postgres password PGDATABASE= #postgres database
These will be used by pg to connect to our database.
A simple check we could perform is to ensure that the database is simply up and accepts queries. Here is the method we’ll use to perform this check:
const analyzeDatabaseHealth = async function () { let status = "up" let name = "database"; try { // Try and make a query const result = await pool.query('SELECT * FROM pg_stat_statements'); // if there are no results, there are no queries recorded if (!result) return { status, name, condition: { health: "unhealthy", cause: "no stats" } }; // otherwise, our database is okay and accepts queries return { status, name, condition: { health: "healthy" } } } catch (e) { // If the query fails, pg_stat_statements is not enabled // or some other database error occurred. // mark it as 'down' status = 'down'; return { status, name, condition: { health: "down", cause: "unable to execute queries" } } } };
This code tries to query for stats from pg_stat_statements and if an error occurs, it marks the database as being down and returns an error message.
We can make this a bit more powerful by leveraging the wealth of information available at our disposal from pg_stat_statements. For instance, if we wanted to track the total amount of time (in hours) and the average time (in milliseconds) queries take, we could use the following query:
SELECT (total_time / 1000 / 60) as total_time, (total_time/calls) as avg_time, query FROM pg_stat_statements ORDER BY 1 DESC LIMIT 100;
And set up an alert (an email, for instance) for any queries that run consistently slow. Note: any query that takes more than 1ms can be conventionally considered ‘slow.’ This can be adjusted to fit your specific requirements.
Let’s create a function that will filter out long-running queries and mark the database as unhealthy.
const analyzeDatabaseHealth = async function () { let status = "up" try { // run the query const {rows: stats} = await pool.query(` SELECT (total_time / 1000 / 60) as total_time, (total_time / calls) as avg_time, query FROM pg_stat_statements ORDER BY 2 DESC LIMIT 1; `) // if there are no stats, something is probably wrong if (!stats) return { status, condition: { health: "unhealthy", cause: "no stats" } }; // filter out any query that takes more than 1ms on average const longRunningQueries = stats.filter(queryObj => { return queryObj.avg_time > 1 }); // mark the database as unhealthy if (longRunningQueries.length) return { status, condition: { health: "unhealthy", cause: "long-running queries" } } // if there are no errors, the database is fine return { status, condition: { health: "healthy" } } }catch (e) { // mark the api as ‘down’ status = 'down'; return { status, condition: { // mark the database as ‘down’ health: "down", cause: "unable to execute queries" } } } };
And here is how you’d use it:
router.get("/health", async function (req, res) { let status = "up"; let cause = "" const health = { // aggregate status status: "up", }; const coreMetrics = [ analyzeDatabaseHealth, // method analyzing database health analyzeRedis, // method analyzing redis analyzeThirdPartyConnection, // method analyzing third-party ]; for (let i = 0; i < coreMetrics.length; i++){ // run every function in turn. Note that each is async. let func = coreMetrics[i]; const metricResult = await func(); // if any core metric is down // set the api as being down, since some requests won’t be processed. if (metricResult.status === "down") { status = "down"; cause = `dependent service '${metricResult.name}' is down`; } // add this to the 'services' array health.services.push(metricResult); } // if status is not ‘up’, we set it to ‘down’ and state a cause. if (status !== "up") { health.status = status; health.cause = cause; } // return the api health. return res.send(health); });
This returns responses such as:
{ "status": "up", "services": [ { "status": "up", "name": "database", "condition": { "health": "healthy" } }, { "name": "redis", "status": "up", "condition": { "health": "healthy" } }, { "name": "third-party", "status": "up", "condition": { "health": "healthy" } } ] }
All the code is commented, but let’s go over it again.
- When the
/health
route is called, three core metrics are called to ensure the API responds correctly. If it returns an ‘up’ status, it is considered healthy. A ‘down’ status on any of the core metrics indicates the API cannot respond to requests at all. - The
analyzeDatabaseHealth
method filters for any queries that take more than 1ms on average to complete and alerts us the database is unhealthy. Note that the database isn’t marked as being down since it still executes queries. It is marked as healthy otherwise. - Any number of checks can be integrated into the system –
analyzeRedis
andanalyzeThirdPartyConnection
are stub methods that represent this.
Conclusion
This guide went over a brief look at how to implement a health-check route for your NodeJS application. Remember – this isn’t meant to be a fully-fledged solution for your API-monitoring, but can make a good entrypoint to learning about how to deliver metrics that fulfil your expectations.
Other important metrics you might want to track include:
- Uptime
- Cache-hit ratio
- Remaining disk space
- Used vs available RAM
Source code of the lesson https://github.com/Bradleykingz/nodejs-health-check