Recently I’ve been working on a toolset (see github swarm-traefik-state-analyzer) intended to aid in the health analysis of Docker Swarm services that are proxied by Traefik in an Docker swarm footprint architecture I previously described in a previous post (click here to read).
In short you have 1-N docker swarm clusters, each segmented into 1-2 networks (internal/external). All inbound http(s) traffic for either segment passes through higher level proxies (layer4) or direct lb bound fqdns (layer3) on to its corresponding hard/software load balancer (layer2), to one of several Traefik instances (layer1), and then on to individual Swarm (layer0) service containers.
Triaging “where the hell does the problem reside” in such a setup can be a daunting task as there are many possible points of mis-configuration, hardware and/or software failures that can be the culprit if the root problem.
- Are the service containers themselves ok?
- Are all my swarm nodes up?
- Is my service accessible and responding on the swarm pub port?
- Is Traefik functioning?
- Are all my Traefik labels working?
- Is Traefik on the right network to talk to the service?
- Is the DNS for these labels correct?
- Is the load-balancer pointing to the right Traefik backend?
- Is something busted in front of my load-balancer?
- Is the name/fqdn even pointing to the correct balancer or whatever is in front of that?
- Are there issues with TLS/SSL?
Ugh… well those kinds of questions is what this tool is intended to assist in helping to narrow down where to look next. The swarm-traefik-state-analyzer scripts collect relevant info from a Docker Swarm, generate all possible avenues of ingress across all layers for service checks to services on a swarm, and execute those checks giving detailed results.
By validating access directly through all possible layers of a Swarm/Traefik footprint described above you can help figure out what layers are having issues to properly stop the bleeding.
Using this tool assumes you are running under service architecture described in my previous post, and is based around [swarm-name].yml and [service-state].yml files described here.
The design of swarm-traefik-state-analyzer is intended to clearly separate the roles of each script to do one particular thing, and decoupled so that in the future the “state” of the services could be fetched from an entirely different orchestrator such as Kubernetes.
- swarmstatedb.py – consumes current “state” of requested “services” from the swarm manager’s API. Data is stored in a fairly generic JSON format that is not tightly coupled to swarm specific.s
- servicechecksdb.py – consumes a swarmstatedb.json file and decorates it with service check data for each layer/endpoint. This file could be consumed by any program to actually execute or instrument another system to actually do the health checks.
- servicechecker.py – a simple script which actually executes all healthchecks in parallel as defined in a servicechecksdb.json file.
- servicecheckerreport.py – script that gives a CLI summary of the servicecheckerdb.json raw results file
- servicecheckerdb2prometheus.py – consumes servicecheckerdb.json raw results files (polling) and presents metrics for prometheus consumption
- testsslcmdsgenerator.py – Consumes a servicechecksdb.json file to generate testssl.sh commands (see https://github.com/bitsofinfo/testssl.sh-processor)
- analyze-swarm-traefik-state.py – orchestrates calls to all the above in a simplified single CLI
Check it out at: https://github.com/bitsofinfo/swarm-traefik-state-analyzer