Alerting on metrics

You can set up alerts based on the metrics collected for SpatialOS cloud deployments. For example, you might want to receive alerts if the number of nodes is lower than it should be, or if snapshots keep failing.

You can’t set up alerts within the Console, so you’ll need to set up the metrics on your own analytics platform (such as Grafana), or
access them through code.

Example: setting up alerts using Grafana

In this example, you want to receive alerts if there are no workers connected for a period of 5 minutes or more.

  1. In Grafana, create a query to return the number of workers connected (of a particular type):

You can copy the query text from the "Workers connected" example below.

For more details, see Grafana’s documentation on querying using Prometheus, and the Prometheus’s documentation on querying.

For more examples of how to query the SpatialOS deployment metrics, see the metrics reference page.

  1. Create an alert.

    Set the conditions so that you’ll be alerted if the maximum value returned from the query, within the last 5 minutes, is below 1. In other words, you’ll be alerted if, in the last 5 minutes, there are no workers connected.

For more details, see Grafana's documentation on configuring alerts.

  1. Set up notifications for the alert. For details, see Grafana's documentation on alert notifications.

Examples of sensible alerts

Many of your alerts will need to be game-specific, but there are a few that we think are useful for everyone as a starting point.

These examples use Grafana (see above).

Snapshot failure

Alert me if, in the last hour, more than 30% of snapshots failed.

  • Query:
spatialos_snapshot_count::sum{project="project_name", dpl="deployment_name", outcome="failure"} / ignoring(cluster, dpl, outcome, project) sum(spatialos_snapshot_count::sum{project="project_name", dpl="deployment_name"}) > 0

The > 0 works around an issue to do with NaN values.

  • Alert:
WHEN max() OF query (A, 1h, now) IS ABOVE 0.3`

If the first snapshot of your deployment fails, you’ll get an alert (since at that point, it’s 100% of the snapshots). If you don’t want an alert in these circumstances, adjust the alert conditions to specify a minimum number of snapshots that you want SpatialOS to attempt before you get alerted. Which number you choose will depend on how often you take snapshots (which you can customise using snapshot_write_period_seconds).

Workers connected

Alert me if, in the last 5 minutes, there were no workers connected (of a particular type).

  • Query:
spatialos_worker_connected::sum{project="test_project", dpl="test_deployment", dpl_tag="tag", worker_type="worker_type"} > 0

The > 0 works around an issue to do with NaN values.

  • Alert:
WHEN max() OF query (A, 5m, now) IS BELOW 1

Other metrics to alert on

You might also want to set up alerts for the following metrics:

How you set up the queries and alerts for these metrics depends on your game. Think about what values you expect to see, and what values you’d consider problematic.

Updated about a year ago

Alerting on metrics

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.