Monitoring a Ups with Telegraf and Grafana

3 hours ago 1

Our power supply is normally pretty reliable, but last week we had a an outage.

Although we've got solar, we don't (currently) have an islanding switch, so when the grid goes down, so do we.

This power outage only lasted about 45 minutes, but came at a really bad time: I was due to be interviewing someone, so had to try and get signal so that I could at least send a SMS and tell them that we'd need to re-schedule.

I used to have a UPS, but didn't replace it after the battery reached end-of-life - at the time we had a young child in the house, so having something be persistently energised seemed like quite a bad idea.

That's no longer a concern though, so I decided that it was time to plug important things (laptop, switch router etc) into a UPS - partly to protect them from damage, but also so that there's something that I can do during an outage (this week, I couldn't do much more than sit and work my way through a Toblerone).

This post details the process of installing Network UPS Tools (NUT) and configuring Telegraf to collect metrics from it, allowing graphing and alerting in Grafana.

The UPS

It doesn't matter too much what model of UPS you have, NUT supports a wide range of kit. Mine has a USB connection, so we're using NUT's usbhid support.

My UPS is a Powerwalker VI Series UPS and shows up in lsusb like this

Bus 006 Device 015: ID 0764:0601 Cyber Power System, Inc. PR1500LCDRT2U UPS

The UPS has 4 mains plug sockets on the back, so I've got a few things plugged in:

My router/firewall (our fiber ONT is in a different room and has it's own battery backup)
My main switch
My NAS
An external HDD array
The extension lead which runs my desk

Running my desk means that it has to power a couple of monitors and a couple of laptops.

This isn't quite as bad as it sounds though:

If I'm not at my desk, the monitors will be off and the laptops will be (relatively) idle
If I am at my desk, the plan is to unplug the laptops and have them run off battery so that they're not using the UPS's capacity

NUT setup

Installing

NUT is in the Ubuntu repos, so:

sudo apt update sudo apt install nut nut-client nut-server

Next we confirm that NUT can actually see the UPS:

sudo nut-scanner -U

If all is well, this'll write out a config block:

[nutdev1] driver = "usbhid-ups" port = "auto" vendorid = "0764" productid = "0601" product = "2200" serial = "11111111111111111111" vendor = "1" bus = "006"

We need to write that into NUT's config, so invoke again but redirect:

sudo nut-scanner -UNq 2>/dev/null | sudo tee -a /etc/nut/ups.conf

The name nutdev1 isn't particularly informative, though, so we can also hand edit ups.conf to change it (and add a desc attribute to provide a description of the UPS):

sudo nano /etc/nut/ups.conf

I set mine like this:

[deskups] desc = "Cyber Power System UPS" driver = "usbhid-ups" port = "auto" vendorid = "0764" productid = "0601" product = "2200" serial = "11111111111111111111" vendor = "1" bus = "006"

Make a note of the name (the bit in square brackets), we'll need it shortly.

Setting Up For Monitoring

Next we want to set up credentials for NUT server

I used my gen_passwd utility to generate a random password, but use whatever method suits you:

NUT_PW=`gen_passwd 24 nc`

Create the user:

echo -e "\n[monitor]\n\tpassword = ${NUT_PW}\n\tupsmon master\n" | sudo tee -a /etc/nut/upsd.users

Now provide the credentials to upsmon, change the value of UPS_NAME to match the one that you set for the UPS in ups.conf earlier:

# Change to match the name in ups.conf UPS_NAME="deskups" echo -e "\nMONITOR $UPS_NAME@localhost 1 monitor $NUT_PW master\n" | sudo tee -a /etc/nut/upsmon.conf

Keep a note of the UPS name and password, we'll need it again when configuring telegraf.

Configure NUT to run as a netserver (so that Telegraf can talk to it):

sudo sed -e 's/MODE=none/MODE=netserver/' -i /etc/nut/nut.conf

Restart services:

for i in nut-server nut-client nut-driver nut-monitor do sudo systemctl restart $i done

Confirm that nutserver is listening:

$ sudo netstat -lnp | grep 3493 tcp 0 0 127.0.0.1:3493 0.0.0.0:* LISTEN 3854210/upsd tcp6 0 0 ::1:3493 :::* LISTEN 3854210/upsd

Check that we get data back about the UPS:

upsc $(upsc -l 2>/dev/null) 2>/dev/null

If all is well, we're ready to move onto collecting data.

Collection and Visualisation

With NUT now able to report on the UPS, the next step is to have that data collected so that we can visualise it and (optionally) alert based upon it.

Telegraf

We're going to use the upsd input plugin to talk to NUT. This was introduced in Telegraf v1.24.0 so, if you're using an existing install, make sure that your telegraf is recent enough:

telegraf version

If you don't have Telegraf, there are install instructions here (note: you're also going to want an InfluxDB instance or free cloud account because the Dashboard that we'll use for visualisation uses Flux).

The input plugin is pretty simple to configure, append the following to /etc/telegraf/telegraf.conf:

[[inputs.upsd]] ## A running NUT server to connect to. ## IPv6 addresses must be enclosed in brackets (e.g. "[::1]") server = "127.0.0.1" port = 3493 # The values for these are found in /etc/nut/upsmon.conf username = "deskups@localhost" password = "[redacted]" additional_fields = ["*"] # Map enum values according to given table. ## ## UPS beeper status (enabled, disabled or muted) ## Convert 'enabled' and 'disabled' values back to string from boolean [[processors.enum]] [[processors.enum.mapping]] field = "ups_beeper_status" [processors.enum.mapping.value_mappings] true = "enabled" false = "disabled"

After restarting (or reloading) telegraf, you should start to see metrics appearing in InfluxDB:

Screenshot of Chronograf showing the upsd measurement and some of the fields

Visualisation

I use Grafana for visualisation and, conveniently, there was already a community dashboard (the source for which can be found on Github).

On the community page Click Download JSON.

Then, in Grafana

New Dashboard
Import JSON
Drag the JSON file over

You'll be presented with a set of options for the Dashboard - choose the relevant InfluxDB datasource to query against:

Screenshot of the dashboard import page

You'll then be taken to the dashboard itself.

It's quite likely that the dashboard will be broken - by default it looks for a bucket called upsd-Telegraf (I write into a bucket called telegraf).

To fix it

Settings
Variables
bucket

Scroll down to find Values seperated by comma and change it to contain the name of your bucket

Screenshot of the values seperated by comma field having been overridden to use the name telegraf

Click Back to Dashboard and the dashboard should now load:

Screenshot of the dashboard

I already track electricity costs, plus we're on a 30 minute tariff, so I also edited the dashboard to remove the cost related row (and then the associated variables).

Alerting

The upsd measurement contains a field called ups_status which will normally be OL (online).

If the mains cuts out (or someone unplugs it to test behaviour...) the value will change to report that the UPS is running from battery:

Screenshot showing the field change value after the UPS was unplugged

Note: The new state OB DISCHRG isn't actually a single status, it's reporting two (closely related) status flags.

After power is restored, the UPS reports itself back online but also notes that the battery is now charging:

Screenshot of the new state - OL CHRG

This means that creating an alert is not as simple as if r.ups_status != "OL".

I also only really wanted an email notification to warn me of the following status symbols:

We're running from battery (flag: OB)
The UPS is reporting an alarm (flag: ALARM)
The UPS is reporting that the battery charge is too low (flag: LB)
The UPS is reporting overload (flag: OVER)
The UPS requires battery replacement (flag: RB)

RFC 9271 is quite well designed in that no defined symbol exists as a sub-string of another, so we can safely do something like:

for flag in ["OB", "ALARM", "LB", "OVER", "RB"]: if flag in ups.status: alarm()

Of course, to do that with Grafana's alerting we need to translate the logic into a Flux query:

// Define the regex to use when checking for alertable states alarm_regex = /(OB|LB|OVER|RB|ALARM)/ // Extract reported status from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "upsd") |> filter(fn: (r) => r["_field"] == "ups_status") |> group(columns: ["ups_name", "_field"]) |> keep(columns: ["_time", "_value", "_field", "ups_name"]) |> aggregateWindow(every: 1m, fn: last, createEmpty: false) // Identify whether the status contains any flags of concern // Grafana alerting requires the main column to be numeric // so we need to shuffle things around |> map(fn: (r) => ({ _time: r._time, //flags: r._value, ups_name: r.ups_name, _value: if r._value =~ alarm_regex then 1 else 0 })) |> group(columns: ["ups_name"])

The return values of this query are based on whether any of the problematic flags exist - if they don't, it'll return 0, if they do the value will be 1.

This allows use of a simple threshold in the grafana alerting config:

Screenshot of the grafana alert config

With the alert saved, I unplugged the UPS and waited:

Screenshot of the alert, it's moved to a pending state

A minute later, the alert was escalated to Pagerduty:

Screenshot of the notification email from PagerDuty

A couple of minutes after plugging the UPS back in, the alert recovered.

Conclusion

Setting up monitoring of the UPS was pretty easy - NUT supports a wide range of devices and exposes status in a standardised way.

NUT is well supported by Telegraf and there was already a community dashboard available to visualise UPS status.

This means that, in practice, the hardest part of all of this was fishing the relevant power leads out of the rack to plug into the back of the UPS.

Now, if the power fails, I should (depending on whether our fiber connection is still lit up) get a page to warn me. Either way, the UPS will provide some coverage for small outages.

Read Entire Article