Fully Automated Lets Encrypt TLS certs with ACME-DNS on Kubernetes

This article covers fully automating DNS and the issuance of TLS certificates of Kubernetes for Ingress based workloads (both public and private) utilizing cert-manager, external-dns, acme-dns and kubernetes-acme-dns-registrar

Scenario

You are a busy DevOps professional. You want to setup an Kubernetes platform that can accept any typical HTTP based workload (Ingress based) with minimal management overhead. You don’t want to be bothered with day-to-day requests for new DNS hostnames and TLS certificates… nor all the management overhead that can come with that long term. Workloads from your team, particularly in lower dev/qa environments, come and go all day long; new apps are released, old ones go away; these workloads sometimes can be publicly accessible (public dns zones), other times internal only (private dns zones). At the end of the day you simply want to create an automated ecosystem to support this and “set it and forget it” …so to speak.

Reality

There are numerous solutions and tool sets to help you solve this type of problem out there. They generally range from some kind of manually process using various tools, to a full blown service mesh or something in between. This article covers a solution that sits sort of in-between, who through the years across multiple variations has served me quite well in addressing the scenario above.

Tooling

As stated above there are a plethora of existing open source tools out there to aid in building a solution to address this on top of Kubernetes. Here are some of the key ones to make all the automation actually happen.

external-dns: this project is is fantastic. Basically one or more external-dns instances runs on your cluster, detect the presences of new objects (such as Ingress objects) and automatically creates DNS entries in a target DNS server/service that point to that’s Ingress’ controller’s backing service’s IP. As you would expect each external-dns instances is fully configurable to filter what types of objects it will listen for and which target DNS backend service it will be managing entries in.

Workloads and Ingress objects: these are the actual application deployments, pods, services and Kubernetes Ingress objects that define your running application and which host(s) it can be reached by as well as which cert-manager Issuer should be used when executing TLS certificate operations for it.

Ingress Controllers: you are going to need one or more Ingress controllers to actually service the traffic for each defined Ingress object its selector’s are applicable for. The IP of this ingress controller’s backing k8s service is the target of each distinct Ingress “host” which gets an A record created for it via external-dns. You can have N number of ingress controllers as you see fit which, as with external-dns, align to different ingress objects based on label/annotation (or other) based selector filters.

cert-manager: Next up we are going to want cert-manager running on our cluster, which itself can also react to Ingress objects and automatically request Lets Encrypt CA signed TLS certificates for the hosts defined by each Ingress. Once issued these TLS certificates (and backing keys) are stored as kubernetes secrets on the cluster and are then used by the Ingress Controller for TLS handling/termination for each Ingress. cert-manager also handles automatically renewing these as they expire. One key config item here is configuring an ACME Issuer for cert-manager that targets Lets Encrypt (staging or prod) AND defines an Solver for handling the ACME challenges using HTTP01 or DNS01; for the architecture in this article we would use DNS01

acme-dns: When cert-manager initiates the process to automatically, create and acquire a new TLS certificate with the Lets Encrypt CA, under the covers it is using the ACME protocol; part of this protocol involves the CA validating a challenge to prove that the requester actually has authority over the domain name the host certificate is being requested for. The ACME CA (Lets Encrypt) will attempt to verify the existence of this challenge in various ways, the two most common being HTTP01 which checks for the challenge via http at the requested host‘s URL, or DNS01 which checks for the challenge via a DNS TXT record for the domain (we will be using DNS01). There are various reasons why you might want to use DNS01 vs HTTP01 (such as wildcard names, or an internal non-public http service), its your choice based on your scenario. That said, if you utilize DNS01 the acme-dns project provides a use-case specific DNS server you can use to serve up these challenge TXT records so that you don’t have to (or if you can’t) provision them in your primary DNS server yourself, or for example if your DNS zone is private. A good article for the reasons on this is here. In this use-case cert-manager would be configured to utilize acme-dns with the DNS01 solver, and it places the TXT record challenge values there for each host according to the configuration of the Solver using the acme-dns.json host/fqdn mapping to acme-dns zone file (see documentation here)

So how does this work?

So give the above components, how does this all work? This article is not intended to provide you with a step by step of setting this all up, but I will provide you with the high level, AND most importantly where the gaps are in the automation that need to be filled in. Once all the above is configured, lets walk through how it all works together in a typical use-case:

  1. DevOps staff are informed by the Dev team of an new k8s application and Ingress host fqdn that will need to be accessible on the k8s cluster, this triggers the DevOps team to manually do the items listed in #2 in preparation prior to app deployment:
  2. DevOps first manually registers a new domain/host with the acme-dns server’s API and get back a acme-dns “registration”. Second they manually update the “acme-dns.json” secret bound in the cert-manager Issuer with a new entry for the domain/hostname that contains this registration information. 3rd they manually update the primary DNS server with the CNAME record that will tell the verifier how to locate the challenge TXT record. This is obtained from the fulldomain field in the registration. Read here for the manual steps
  3. Now that things are ready, the application is deployed and a k8s Ingress is created/updated on Kubernetes for the app`s hostname
  4. external-dns detects the Ingress and creates an A record in the Ingress’ host FQDN DNS zone that points to the IP of the Ingress controller Service’s load balancer’s IP address
  5. cert-manager detects the Ingress and initiates the configured Issuer‘s ACME TLS certificate request against the Lets Encrypt CA. Assuming the Issuer‘s Solver is configured w/ DNS01 and acme-dns (see document here) cert-manager will proceed to lookup the hostname/domain’s registration info in the acme-dns.json secret via the solver’s configuration that was prepared manually by DevOps (in step #2 above) and use the original registration info to authenticate against the acme-dns DNS server to set the challenge TXT record properly for that domain.
  6. lets-encrypt eventually successfully resolvest the challenge TXT record via DNS and validates that domain/host is owned by the requestor and issues the certificate to cert-manager
  7. cert-manager stores the issued TLS certificate as a k8s secret and the Ingress Controller uses this for TLS termination for the app’s Ingress definition.
  8. The app can be accessed via its Ingress host name w/ valid TLS security! boom!
  9. All is good! … except for the manual steps at #2….

What can be improved

Well ideally we would like this to be fully automated and eliminate the manual steps #1 and #2 in the process described above. Doing so would permit new applications w/ Ingress definitions to simply be deployed to the k8s cluster and automatically have everything described previously with no human intervention. How can we achieve this?

During my time working with a prior team I lived through this scenario. It was annoying and made things slower. When presented an opportunity to create a new Kubernetes platform for a new group, I decided to solve this problem and ended up developing kubernetes-acme-dns-registrar to provide the glue that fills this gap in the automation. So what exactly does this project do?

There are 3 key things that kubernetes-acme-dns-registrar automates for you, that cover what was described in #1 and #2 above

  1. Automatic detection of Ingress objects and ensuring a acme-dns registration exists in the acme-dns server
  2. Automatic creation of the required DNS CNAME in a target DNS server/service for the host/domain per the acme-dns registration
  3. Automatic updating of the cert-manager acme-dns.json secret with the new registration info

Below is a diagram that helps to explain the overall flow of automation:

With kubernetes-acme-dns-registrar running, it provides that extra little bit of automation glue that can make end to end TLS automation for Ingress based workloads a bit more seamless. Now when an Ingress appears on a cluster, within about 1-3 minutes it is fully accessible at its defined hostname with no human intervention. This project is currently in use on some of the team’s production clusters.

Please checkout the kubernetes-acme-dns-registrar project and its README for full information on install, setup, configuration etc. I hope this helps others out there with this kind of need!

Leave a comment