Architecture for a data lake REST API using Delta Lake, Fugue & Spark

"Hey, we need some kind of a REST API over all our data lakes to let analysts and other integrations query records on demand . Can we please get this done?" That was the use-case laid out that needed a solution. If you've had any experience with data lakes you know that they can be … Continue reading Architecture for a data lake REST API using Delta Lake, Fugue & Spark

Architecture for generative Terragrunt & Terraform infrastructure as code (IaC)

This article covers a specific scenario where despite trying to leverage as many DRY (don't repeat yourself) principles made available to us by the underlying IaC (infrastructure as code) frameworks, sometimes we still need to elevate the abstraction to another level to fully reduce code duplication and gain larger economies of scale deploying large platforms … Continue reading Architecture for generative Terragrunt & Terraform infrastructure as code (IaC)

Fully Automated Lets Encrypt TLS certs with ACME-DNS on Kubernetes

This article covers fully automating DNS and the issuance of TLS certificates of Kubernetes for Ingress based workloads (both public and private) utilizing cert-manager, external-dns, acme-dns and kubernetes-acme-dns-registrar Scenario You are a busy DevOps professional. You want to setup an Kubernetes platform that can accept any typical HTTP based workload (Ingress based) with minimal management … Continue reading Fully Automated Lets Encrypt TLS certs with ACME-DNS on Kubernetes

Reacting to K8s Events with k8s-watcher

As part of a recent project which needs to automatically issue new TLS certificates for hosts defined in Kubernetes Ingress objects, I ended up having to create a library that would let me detect such events in a simplified manner for part of a larger Python program which needs to react to such events. My … Continue reading Reacting to K8s Events with k8s-watcher

Architecture for non-deterministic mass data collection: part 2: dynamic data lake schemas

Note, this is the final part of a two part series about this project; article #1 is here. Continuing on from where we last left off, now that we had a functioning collection engine producing full graphs of crawled data all the way down to interrogable dataset_items, it was now time to get down to … Continue reading Architecture for non-deterministic mass data collection: part 2: dynamic data lake schemas

Architecture for non-deterministic mass data collection: part 1: collection engine

Note, this is part one of a two part series about this project; article #2 is here. One of my more recent projects was spawned from a pretty interesting idea. The team wanted to build a system that would permit them to scour the Internet for information regarding a particular set of targets; a "target" … Continue reading Architecture for non-deterministic mass data collection: part 1: collection engine

Serverless AWS Lambda architecture for large scale data ingestion

Recently was faced with a requirement to build out an extensible data import framework that would be able to consume various file formats provided by 3rd parties.... but make it faster than the current implementation. The current mechanism that was in place was using a proprietary packaged legacy file ETL product who's output was an … Continue reading Serverless AWS Lambda architecture for large scale data ingestion

Using private Python Azure Artifacts feeds in Alpine Docker builds

This one will be relatively short, figured I'd post this for anyone else who was struggling with use case. Your goal: your application needs to use a Python module that is available in a private Azure Artifact's feed and you want to pip install this module in a Alpine based docker build. Was recently working … Continue reading Using private Python Azure Artifacts feeds in Alpine Docker builds

AWS Glue: Continuation for job JobBookmark does not exist

This will be a quick post but could not find much on this error, so figured I'd post it for others. {"service":"AWSGlue","statusCode":400,"errorCode":"EntityNotFoundException","requestId":"xxxxx","errorMessage":"Continuation for job JobBookmark for accountId=xxxxx, jobName=myjob, runId=jr_xxxxx does not exist. not found","type":"AwsServiceError"} Was recently working on a PySpark job in AWS Glue and was attempting to use the Job Bookmarks feature which lets … Continue reading AWS Glue: Continuation for job JobBookmark does not exist

Immutable health check management

If you've ever had to monitor an application, endpoint or website, you've likely come across literally hundreds of monitoring services that can execute simple HTTP based checks from N global endpoints then notify an operator when certain thresholds are met. One of the more widely know services that can do this is Pingdom. On a … Continue reading Immutable health check management