This is a follow-up to my previous post about Linkerd and microservices and covers adding Linkerd-viz to the stack so we get aggregated metrics from our services. Thanks to @siggy for the pointing it out and for reading my blog :P

So Linkerd-viz provides a way to scrape and display metrics from a set of Linkerd instances, this is very much needed if you are running sidecar deployments of Linkerd, as having to access N different admin dashboards to monitor your stack is not practical.

Note that I did find Linkerd-viz to be more of an example setup than a ready to use system, but it certainly gets you pointed in the right direction.

For metric aggregation Linker-viz uses Prometheus and for display Grafana. I had not used Prometheus before, but it works great and was fairly easy to configure, Grafana I have used quite a bit already.

As the official Linkerd-viz repo was focused on Kubernetes and DC/OS, I needed to create my own Prometheus config as I was using Consul and because I use Linker-to-Linker sidecar setup I actually needed two dashboards rather than one in Grafana.

My Prometheus config ended looking like this:

global:  
  scrape_interval:     5s
  evaluation_interval: 5s

scrape_configs:  
- job_name: 'linkerd'
  metrics_path: /admin/metrics/prometheus

  consul_sd_configs:
    - server:   '172.17.0.1:8500'

  relabel_configs:
  # only collect from linkerd's
  - source_labels: ['__meta_consul_service']
    regex:         (.*(9990)$)
    action: keep
  # strip off port
  - source_labels: ['__meta_consul_service']
    action: replace
    target_label: instance
    regex: (.*)-9990
    replacement: $1

  metric_relabel_configs:
  # remove port postfix from consul auto-naming
  - source_labels: [__name__]
    action:        replace
    target_label:  __name__
    regex:         rt:outgoing:dst:id:_:io_l5d_consul:dc1:([^:]+)_4140:(.*)
    replacement:   rt:outgoing:dst:id:_:io_l5d_consul:dc1:$1:$2
  # write service names into label {service:"app1"}
  # rt:outgoing:dst:id:_:io_l5d_consul:dc1:forex_exchange_rate:requests
  - source_labels: [__name__]
    action:        replace
    target_label:  service
    regex:         rt:outgoing:dst:id:_:io_l5d_consul:dc1:([^:]+):.*
    replacement:   $1
  # write service names into label {service:"app1"}
  # rt:incoming_forex_currency_converter:dst:id:_:inet:127_1:80:requests
  - source_labels: [__name__]
    action:        replace
    target_label:  service
    regex:         rt:incoming_(.+):dst:id:_:inet:127_1:80.*
    replacement:   $1
  # remove service name from metric (linkerd:outgoing:requests)
  # rt:outgoing:dst:id:_:io_l5d_consul:dc1:forex_exchange_rate:requests
  - source_labels: [__name__]
    action:        replace
    target_label:  __name__
    regex:         rt:outgoing:dst:id:_:io_l5d_consul:dc1:[^:]+:(.*)
    replacement:   linkerd:outgoing:$1
  # remove service name from metric (linkerd:incoming:requests)
  # rt:incoming_forex_currency_converter:dst:id:_:inet:127_1:80:requests
  - source_labels: [__name__]
    action:        replace
    target_label:  __name__
    regex:         rt:incoming_(.+):dst:id:_:inet:127_1:80:(.*)
    replacement:   linkerd:incoming:$2

Scraping config is simple enough:

- job_name: 'linkerd'
  metrics_path: /admin/metrics/prometheus

  consul_sd_configs:
    - server:   '172.17.0.1:8500'

Tell Prometheus where to scrape metrics from and where to find services to scrape, in this instance Consul.

Next was to configure relabelling, this formats our metrics into something we can easily use in Grafana.

  relabel_configs:
  # only collect from linkerd's
  - source_labels: ['__meta_consul_service']
    regex:         (.*(9990)$)
    action: keep
  # strip off port
  - source_labels: ['__meta_consul_service']
    action: replace
    target_label: instance
    regex: (.*)-9990
    replacement: $1

Here we make sure we only try and collect from Linkerd instances and also strip the the port from the instance name to make it look nice.

Next up was rewriting the metrics into something the dashboards could read easily. This is where I found my Linker-to-Linker setup did not work out-of-box with the official Linkerd-viz repo. In that repo the dashboard only displays incoming metrics, for me I need incoming and outgoing with the addition that incoming has slightly different meaning in my setup. The difference being my incoming is the local http router which routes incoming requests to localhost:80.

First part is to strip the ports:

  metric_relabel_configs:
  # remove port postfix from consul auto-naming
  - source_labels: [__name__]
    action:        replace
    target_label:  __name__
    regex:         rt:outgoing:dst:id:_:io_l5d_consul:dc1:([^:]+)_4140:(.*)
    replacement:   rt:outgoing:dst:id:_:io_l5d_consul:dc1:$1:$2

Here we take the name of the metric and replace itself to remove the port number, this again is to make the service names nice to read and consistent.

After that we need to assign the metric a service:

# write service names into label {service:"app1"}
  # rt:outgoing:dst:id:_:io_l5d_consul:dc1:forex_exchange_rate:requests
  - source_labels: [__name__]
    action:        replace
    target_label:  service
    regex:         rt:outgoing:dst:id:_:io_l5d_consul:dc1:([^:]+):.*
    replacement:   $1
  # write service names into label {service:"app1"}
  # rt:incoming_forex_currency_converter:dst:id:_:inet:127_1:80:requests
  - source_labels: [__name__]
    action:        replace
    target_label:  service
    regex:         rt:incoming_(.+):dst:id:_:inet:127_1:80.*
    replacement:   $1

The first rule takes the outgoing metrics name and extract the service from it and sets it as the service name.

The second rule takes the incoming metrics name and extracts the service name from it. This time though we extract the name from the router label, this is one of the things I had to change to get the metrics working. I was expecting the metric to include the service name like the incoming metrics, but for some reason it doesn't. So rather than rt:incoming:dst:id:_:inet:127_1:80:forex_currency_converter:requests I was just getting rt:incoming:dst:id:_:inet:127_1:80:requests which was a pain :) To work around this I renamed my Linkerd router label to include the service name, meaning I could extract it from the metric line easily. There is probably a better way to do this, or maybe a bug in the Linkerd metrics, but for now its fine :)

Lastly, we just need to format the metrics ready for the dashboard.

# remove service name from metric (linkerd:outgoing:requests)
  # rt:outgoing:dst:id:_:io_l5d_consul:dc1:forex_exchange_rate:requests
  - source_labels: [__name__]
    action:        replace
    target_label:  __name__
    regex:         rt:outgoing:dst:id:_:io_l5d_consul:dc1:[^:]+:(.*)
    replacement:   linkerd:outgoing:$1
  # remove service name from metric (linkerd:incoming:requests)
  # rt:incoming_forex_currency_converter:dst:id:_:inet:127_1:80:requests
  - source_labels: [__name__]
    action:        replace
    target_label:  __name__
    regex:         rt:incoming_(.+):dst:id:_:inet:127_1:80:(.*)
    replacement:   linkerd:incoming:$2

This part is pretty easy, take the full metric and rewrite them into the correct bucket keeping only the metric name.

Once Prometheus was configured the rest was just a simple task of taking the Grafana dashboard conf from the official repo and making on for incoming another for outgoing. With that done and configured in Rancher, everything just worked!

So now I have fancy stats for my simple microservices system! Next I need to look at getting better logs from my containers to make debugging easier but this gives me great insight into general system health.