Naming metrics, or why my statsd client doesn't support tags

It may not be a popular claim, but I still think logs and metrics serve an important role in modern observability along with tools like tracing. And there are a number of tools for collecting metrics, like Etsy's StatsD and other implementations, Datadog, Promethus, the ELK stack, and others.

One critical distinction between these tools is whether or not they support tags for metrics. A tag is either a string or a key-value pair, and it fundamentally changes how to approach metric naming.

For example, let's consider a metric called stream.users.connected for a chat application, which is a gauge type—i.e. it reflects the absolute state of a value at a time, not a count or rate.

Because we care about uptime, we want to ensure that we have redundancy and the ability to scale horizontally, so we can have more than one server (or pod) running our streaming connection service at a time.

Since this is a guage, if two or more running copies of the service send values, we won't get all of them, we'll only get the last one, and if we're lucky we may see some rapid fluctuation. We need a way to disaggregate this metric and then sum up those individual values in order to know the total number of connected users.

Without tags, we might do something like add a per-host prefix or suffix to the metric name itself, e.g. stream.users.connected.192_168_3_18. This means we disaggregate by metric name and then do our aggregation on the fly later, e.g. we might make a graph of sum(stream.users.connected.*).

With tags, we have another tool available: we may not need to change the metric name at all. We can aggregate by metric name, and disaggregate by tags later, e.g. stream.users.connected#host:192_168_3_18 (where host is a tag key and the IP address its value).

Another great example is counting or timing HTTP requests. Without tags, the metric name needs to contain all the things you want to be able to count by, e.g.

%HOST.http.request.%METHOD.%PATH.%STATUS_CODE

Then if you want to look at all requests to a particular path, you fill in the other variables with wildcards, e.g. *.http.request.*.some_path.*. Disaggregate the metric name in the application, and aggregate in the query.

With tags, the metric name can be much simpler, e.g.

http.request#host:%HOST,method:%METHOD,path:%PATH,status:%STATUS

Using a single metric with several tags means that counting sum(http.request) will give us the total number of requests, and we can dig further by specifying more tags. We aggregate by the metric name, and later disaggregate in the query.

That's why it's critically important to understand from both the client and the server whether or not tags are supported. The StatsD server and other implementations of the original protocol don't support tags, and so the client I maintain does not, either. If you named a metric expecting to use tags to disaggregate it later, you would lose data—if the metric was recorded at all.

(The true purpose of this post is that I wanted this idea somewhere easier to find than buried in Github issues or the statsd client docs.)