Architecture for non-deterministic mass data collection: part 2: dynamic data lake schemas

Note, this is the final part of a two part series about this project; article #1 is here. Continuing on from where we last left off, now that we had a functioning collection engine producing full graphs of crawled data all the way down to interrogable dataset_items, it was now time to get down to … Continue reading Architecture for non-deterministic mass data collection: part 2: dynamic data lake schemas

Architecture for non-deterministic mass data collection: part 1: collection engine

Note, this is part one of a two part series about this project; article #2 is here. One of my more recent projects was spawned from a pretty interesting idea. The team wanted to build a system that would permit them to scour the Internet for information regarding a particular set of targets; a "target" … Continue reading Architecture for non-deterministic mass data collection: part 1: collection engine

Serverless AWS Lambda architecture for large scale data ingestion

Recently was faced with a requirement to build out an extensible data import framework that would be able to consume various file formats provided by 3rd parties.... but make it faster than the current implementation. The current mechanism that was in place was using a proprietary packaged legacy file ETL product who's output was an … Continue reading Serverless AWS Lambda architecture for large scale data ingestion

AWS Glue: Continuation for job JobBookmark does not exist

This will be a quick post but could not find much on this error, so figured I'd post it for others. {"service":"AWSGlue","statusCode":400,"errorCode":"EntityNotFoundException","requestId":"xxxxx","errorMessage":"Continuation for job JobBookmark for accountId=xxxxx, jobName=myjob, runId=jr_xxxxx does not exist. not found","type":"AwsServiceError"} Was recently working on a PySpark job in AWS Glue and was attempting to use the Job Bookmarks feature which lets … Continue reading AWS Glue: Continuation for job JobBookmark does not exist

Hazelcast discovery with Etcd

I've used Hazelcast for years and have generally relied upon the availability of multicast for Hazelcast cluster discovery and formation (within a single data-center). Recently was faced with two things, expand the footprint into a non-multicast enabled data-center and secondly pre-prep the service for containerization where nodes will come and go as scaling policies dictate … Continue reading Hazelcast discovery with Etcd

Hazelcast discovery with Consul

I've used Hazelcast for years and have generally relied upon the availability of multicast for Hazelcast cluster discovery and formation (within a single data-center). Recently was faced with two things, expand the footprint into a non-multicast enabled data-center and secondly pre-prep the service for containerization where nodes will come and go as scaling policies dictate … Continue reading Hazelcast discovery with Consul

Copying lots of files into S3 (and within S3) using s3-bucket-loader

Recently a project I've been working on had the following requirements for a file-set containing roughly a million files varying in individual size from one byte to over a gigabyte; and the file-set size in total being sized between 500gb and one terabyte Store this file-set on Amazon S3 Make this file-set accessible to applications … Continue reading Copying lots of files into S3 (and within S3) using s3-bucket-loader

Clustering Liferay globally across data centers (GSLB) with JGroups and RELAY2

Recently I've have been looking into options to solve the problem of GSLB'ing (global server load balancing) a Liferay Portal instance. This article is a work in progress... and a long one. Jan Eerdekens states it correctly in his article, "Configuring a Liferay cluster is part experience and part black magic" .... however doing it … Continue reading Clustering Liferay globally across data centers (GSLB) with JGroups and RELAY2

Testing yas3fs: a distributed S3 FUSE filesystem

I've recently been doing quite a bit of evaluation of  a few S3 filesystems, one in particular is yas3fs which so far is quite impressive. I plan on doing a more detailed post about it later, however for now I'd like to share a little tool I wrote to help me in my evaluation of it. You … Continue reading Testing yas3fs: a distributed S3 FUSE filesystem