Replacement Server Monitoring

This is part two of a three part series of blog posts about picking a replacement monitoring solution, getting it running and ready, and finally moving our customers over to it.

In our last post we discussed our need for a replacement monitoring system and our pick for the software stack we were going to build it on. If you haven’t already, you should go and read that before continuing with this blog post.

This post aims to detail the set up and configuration of the different components to work together, along with some additional customisations we made to get the functionality we wanted.

Component Installation

As mentioned in the previous entry in this series, InfluxData, the TICK stack creators, provide package repositories where pre-built and ready to use packages are available. This eliminates the need for configuration and compilation of source code before we can use it. This allows us to install and run software with the use of a few commands with very predictable results, as opposed to often many commands needed for compilation, with sometimes wildly varying results. Great stuff.

All components are available from the same repository. Here’s how you install them (example shown is for an Ubuntu 16.04 “Xenial” system

curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add -
source /etc/lsb-release
echo "deb https://repos.influxdata.com/${DISTRIB_ID,,} ${DISTRIB_CODENAME} stable" | sudo tee /etc/apt/sources.list.d/influxdb.list
sudo apt-get update && sudo apt-get install influxdb
sudo systemctl start influxdb

The above steps are also identical for the other components, Telegraf, Chronograf and Kapacitor. You’ll just need to replace “influxdb” with the correct name in lines 4 and 5.

Configuring and linking the components

As each of the components are created by the same people, InfluxData, linking them together is fortunately very easy (another reason we went with the TICK stack). I’ll show you what additional configuration was put in place for the components and how we then linked together. Note that the components are out of order here, as the configuration of some components is a prerequisite to linking them to another.

InfluxDB

The main change that we make to InfluxDB is to have it listen for connections over HTTPS, meaning any data flowing to/from it will be encrypted. (To do this, you will need to have an SSL certificate and key pair to use. Obtaining that cert/key pair is outside the scope of the blog post). We also require authentication for logins, and disable the query log. We then restart InfluxDB for these changes to take effect.

sudo vim /etc/influx/influx.conf

[http]
    enabled = true
    bind-address = "0.0.0.0:8086"
    auth-enabled = true
    log-enabled = false
    https-enabled = true
    https-certificate = "/etc/influxdb/ssl/reporting-endpoint.dogsbodytecnhology.com.pem"

sudo systemctl restart influxd

Note that the path used for the “https-certificate” parameter will need to exist on your system of course.

We then need to create an administrative user like so:

influx -ssl -host ivory.dogsbodyhosting.net
> CREATE USER admin WITH PASSWORD 'superstrongpassword' WITH ALL PRIVILEGES

Telegraf

The customisations for Telegraf involve telling it where to reports its metrics to, and what metrics to record. We have an automated process, using ansible for rolling these customisations out to customer servers, which we’ll cover in the next part of this series. Make sure you check back for that. These are essentially what changes are made:

sudo vim /etc/telegraf.d/outputs.conf

[[outputs.influxdb]]
  urls = ["https://reporting-endpoint.dogsbodytechnology.com:8086"]
  database = "3340ad1c-31ac-11e8-bfaf-5ba54621292f"
  username = "3340ad1c-31ac-11e8-bfaf-5ba54621292f"
  password = "supersecurepassword"
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"

The above dictates that Telegraf should connect securely over HTTPS and tells it the username, database and password to use for it’s connection.

We also need to tell Telegraf what metrics it should record. This is configured like so:

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = true
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
[[inputs.diskio]]
[[inputs.net]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.procstat]]
  pattern = "."

The above tells Telegraf what metrics to report, and customises how they are reported a little. For example, we tell it to ignore some pseudo-filesystems in the disk section, as these aren’t important to us.

Kapacitor

The customisations for Kapacitor primarily tell it which InfluxDB instance it should use, and the channels it should use for sending out alerts:

sudo vim /etc/kapacitor/kapacitor.conf
    [http]
    log-enabled = false
    
    [logging]
    level = “WARN”

    [[influxdb]]
    name = "ivory.dogsbodyhosting.net"
    urls = ["https://reporting-endpoint.dogsbodytechnology.com:8086"]
    username = admin
    password = “supersecurepassword”

    [pushover]
    enabled = true
    token = “yourpushovertoken”
    user-key = “yourpushoveruserkey”

    [smtp]
    enabled = true
    host = "localhost"
    port = 25
    username = ""
    password = ""
    from = "alerts@example.com"
    to = ["sysadmin@example.com"]

As you can probably work out, we use Pushover and email to send/receive our alert messages. This is subject to change over time. During the development phase, I used the Slack output.

Chronograf Grafana

Although the TICK stack offers it’s own visualisation (and control) tool, Chronograf, we ended up using the very popular Grafana instead. At the time when we were building the replacement solution, Chronograf, although very pretty, was somewhat lacking in features, and the features that did exist were sometimes buggy. Please do note that Chronograf was the only component that was still in beta at this period in time. It’s now had a full release and another ~5 months of development. You should definitely try it out for yourself before jumping straight to Grafana. We intend to re-evaluate Chronograf ourselves soon, especially as it is able to control the other components in the TICK stack, something which Grafana does not offer at all.

The Grafana install is pretty straightforward, as it also has a package repository:

sudo vim /etc/apt/sources.list.d/grafana.list
    deb https://packagecloud.io/grafana/stable/debian/ jessie main
sudo apt update
sudo apt install grafana

We then of course make some customisations. The important part here is setting the base URL which is required due to the fact we’ve got Grafana running behind an nginx reverse proxy. (We love nginx and use it wherever we get the chance. We won’t detail the customisations here though as they’re not strictly related to the monitoring solution, and Grafana works just fine on it’s own.)

sudo vim /etc/grafana/grafana.ini
    [server]
    domain = display-endpoint.dogsbodytechnology
    root_url = %(protocol)s://%(domain)s:/grafana
sudo systemctl restart grafana

Summary

The steps above left us with a very powerful and customisable monitoring solution, which worked fantastically for us. Be sure to check back for future instalments in this series. In part 3 we cover setting up alerts with Kapacitor, creating awesome visualisations with Grafana, and getting all of our hundreds of customers’ servers reporting in and alerting.

Part three is here.

Part 1: Picking a Replacement
Part 2: Building the replacement (you are here)
Part 3: Kapacitor alerts and going live!

Feature image background by tomandellystravels licensed CC BY 2.0.