Tagged: AWS

Hazelcast discovery with Etcd

I’ve used Hazelcast for years and have generally relied upon the availability of multicast for Hazelcast cluster discovery and formation (within a single data-center). Recently was faced with two things, expand the footprint into a non-multicast enabled data-center and secondly pre-prep the service for containerization where nodes will come and go as scaling policies dictate it…. hardwired Hazelcast clustering via an XML configuration and/or reliance on multicast is a no-go.

With Hazelcast 3.6, they now support a pluggable implementation for a cluster discovery mechanism called the Discovery SPI. (Discovery Strategy) Perfect timing, given we are already playing with Etcd as part of our Docker container strategy, this was an opportunity to let our application’s native clustering mechanism (coded on top of Hazelcast) to leverage Etcd as well as discover/remove peers both within, and potentially across data-centers.

So I coded up hazelcast-etcd-discovery-spi available on GitHub.


This works with Hazelcast 3.6-EA+ and Etcd to provide (optional) automatic registration of your hazelcast nodes as Etcd services and automatic peer discovery of the Hazelcast cluster.

Note that the automatic registration of each hazelcast instance as a Etcd service is OPTIONAL if you want to manually maintain these key-paths in etcd. I added that in simply because I think it will be convenient for folks, especially when containerizing a Hazelcast enabled app (such as via Docker) where the less “dependencies” and manual things to do (i.e. register your hazelcast nodes manually).. the better. You can totally embedded this functionality with this discovery strategy SPI.

I hope others find this helpful, and please leave your feedback, pull-requests or issues on the project!

NOTE, if you are running your app in Docker you have a separate issue where you need to determine your own externally accessible IP/PORT that the docker host has mapped for you on 5701… well how can you determine that so that you can publish the correct IP/PORT info to Etcd? Check out: https://github.com/bitsofinfo/docker-discovery-registrator-consul

NOTE! Interested in consul? There is a separate project which is built around Consul for your discovery strategy located here: https://github.com/bitsofinfo/hazelcast-consul-discovery-spi



Hazelcast discovery with Consul

I’ve used Hazelcast for years and have generally relied upon the availability of multicast for Hazelcast cluster discovery and formation (within a single data-center). Recently was faced with two things, expand the footprint into a non-multicast enabled data-center and secondly pre-prep the service for containerization where nodes will come and go as scaling policies dictate it…. hardwired Hazelcast clustering via an XML configuration and/or reliance on multicast is a no-go.

With Hazelcast 3.6, they now support a pluggable implementation for a cluster discovery mechanism called the Discovery SPI. (Discovery Strategy) Perfect timing, given we are already playing with Consul as part of our Docker container strategy, this was an opportunity to let our application’s native clustering mechanism (coded on top of Hazelcast) to leverage Consul as well as discover/remove peers both within, and potentially across data-centers.

So I coded up hazelcast-consul-discovery-spi available on GitHub.


This works with Hazelcast 3.6-EA+ and Consul to provide automatic registration of your hazelcast nodes as Consul services (without having to run a local Consul agent) and automatic peer discovery of the Hazelcast cluster.

Note that the automatic registration of each hazelcast instance as a Consul service is OPTIONAL if you already have Consul agents running that define your Hazelcast service nodes. I added that in simply because I think it will be convenient for folks, especially when containerizing a Hazelcast enabled app (such as via Docker) where the less “dependencies” like a Consul agent available on the host, or in the container (or another container).. the better. You can totally embedded this functionality with this discovery strategy SPI.

I hope others find this helpful, and please leave your feedback, pull-requests or issues on the project!

NOTE, if you are running your app in Docker you have a separate issue where you need to determine your own externally accessible IP/PORT that the docker host has mapped for you on 5701… well how can you determine that so that you can publish the correct IP/PORT info to Consul? Check out: https://github.com/bitsofinfo/docker-discovery-registrator-consul

NOTE! Interested in etcd? There is a separate project which is built around etcd for your discovery strategy located here: https://github.com/bitsofinfo/hazelcast-etcd-discovery-spi


Copying lots of files into S3 (and within S3) using s3-bucket-loader

Recently a project I’ve been working on had the following requirements for a file-set containing roughly a million files varying in individual size from one byte to over a gigabyte; and the file-set size in total being sized between 500gb and one terabyte

  1. Store this file-set on Amazon S3
  2. Make this file-set accessible to applications via the filesystem; i.e. access should look no different then any other directory structure locally on the Linux filesystem
  3. Changes on nodeA in regionA’s data-center should be available/reflected on nodeN in regionN’s data-center
  4. The available window to import this large file-set into S3 would be under 36 hours (due to the upgrade window for the calling application)
  5. The S3 bucket will need to be backed up at a minimum every 24 hours (to another bucket in S3)
  6. The application that will use all of the above generally treats the files as immutable and they are only progressively added and not modified.

If you are having to deal w/ a similar problem perhaps this post will help you out. Let go through each item.

Make this file-set accessible to applications via the filesystem; i.e. access should look no different then any other directory structure locally on the Linux filesystem. Changes on node-A in region-A’s data-center should be available/reflected on node-N in region-N’s data-center.

So here you are going to need an abstraction that can present the S3 bucket as a local directory structure; conceptually similar to an NFS mount. Any changes made to the directory structure should be reflected on all other nodes that mount the same set of files in S3. Now there are several different kinds of S3 file-system abstractions and they generally fall into one of three categories (block based, 1 to 1, and native), the type has big implications for if the filesystem can be distributed or not. This webpage (albeit outdated) gives a good overview that explains the different types.  After researching a few of these we settled on attempting to use YAS3FS (yet another, S3 filesystem). YAS3FS, written in Python, presents an S3 bucket via a local FUSE mount; what YAS3fs adds above other S3 filesystems is that it can be “aware” of events that occur on other YA3FS nodes who mount the same bucket, and can be notified of changes via SNS/SQS messages. YAS3FS keeps a local cache on disk, so that it gives the benefits (up to a point) of local access and can act like a CDN for the files on S3. Note that FUSE based filesystems are slow and limited to a block size (IF the caller will utilize it) of 131072. YAS3FS itself works pretty good, however we are *still* in evaluation process as we work through many issues that are creeping up in our beta-environment, the big ones being unicode support and several concurrency issues that keep coming up. Hopefully these will be solvable in the existing code’s architecture…


The available window to import this large file-set into S3 would be under 36 hours

Ok no problem, lets just use s3cmd. Well… tried that and it failed miserably. After several crashes and failed attempts we gave up. S3cmd is single-threaded and extremely slow to do anything against a large file-set, much less load it completely into S3. I also looked at other tools, (like s4cmd which is multi-threaded), but again, even these other “multi-threaded” tools eventually bogged down and/or became non-responsive against this large file-set.

Next we tried mounting the S3 bucket via YAS3fs and executing rsync’s from the source files to the target S3 mount…. again this “worked” without any crashing, but was single threaded and took forever. We also tried running several rsyncs in parallel, but managing this; and verifying the result, that all files were actually in S3 correctly w/ the correct meta-data, was a challenge. The particular challenge being that YAS3FS returns to rsync/cp immediately after the file is written to the local YAS3FS cache, and then proceeds to push to S3 asynchronously in the background (which makes it more difficult to check for failures).

Give the above issues, it was time to get crazy with this, so I came up with s3-bucket-loader. You can read all about how it works here, but the short of it is that s3-bucket-loader uses massive parallelism via orchestrating many ec2 worker nodes to load (and validate!) millions of files into an S3 bucket (via an s3 filesystem abstraction) much quicker than other tools. Rather than sitting around for days waiting for the copy process to complete with other tools, s3-bucket-loader can do it in a matter of hours (and validate the results). Please check it out for more details, as the github project explains it in more details.

The S3 bucket will need to be backed up at a minimum every 24 hours (to another bucket in S3)

Again, this presents another challenge; at least with copying from bucket to bucket you don’t actually have to move the files around yourself (bytes), and can rely on s3’s key-copy functionality. So again here we looked at s3cmd and s4cmd to do the job, and again they were slow, crashed, or bogged down due to the large file-set. I don’t know how these tools are managing their internal work queue, but it seems to be so large they just crash or slow down to the point where they become in-efficient. At this point you have two options for very fast bucket copying

  1. s3-bucket-loader: I ended up adding key-copy support to the program and it distributes the key-copy operations across ec2 worker nodes. It copies the entire fileset in under an hour, and under 20 minutes with more ec2 nodes.
  2. s3s3mirror: After coding #1 above, I came across s3s3mirror. This program is a multi-threaded, well coded power-house of a program that just “worked” the first time I used it. After contributing SSL, aws-encryption and storage-class support for it, doing a full bucket copy of over 600gb and ~800k s3 objects took only 45 minutes! (running w/ 100 threads). It has good status logging/output and I highly recommend it

Overall for the “copying” bucket to bucket requirement, I really like s33mirror, nice tool.



Clustering Liferay globally across data centers (GSLB) with JGroups and RELAY2

Recently I’ve have been looking into options to solve the problem of GSLB’ing (global server load balancing) a Liferay Portal instance.

This article is a work in progress… and a long one. Jan Eerdekens states it correctly in his article, “Configuring a Liferay cluster is part experience and part black magic” …. however doing it across data-centers however is like wielding black magic across N black holes….

Footnotes for this article are here: https://bitsofinfo.wordpress.com/2014/05/21/liferay-clustering-internals/

The objective is a typical one.

  • You have a large Liferay portal with users accessing the portal from lots of different location across the world
  • The Liferay “cluster” is hosted in a single data-center in a particular region
  • The users outside of that region complain that using Liferay is slow for them
  • Your goal: extend the Liferay cluster across multiple regions while keeping all the functionality intact and boosting response times for users across the world by globally load balancing (GLSB) the application so users in Asia hit a local data-center (DC) while users in South America hit a DC closer to them in their respective region
  • Solution: Good luck finding publicly available documentation on this for Liferay

Hopefully this article will help others out there, point them in a new direction and give them some ideas on how to put something like this together.

I’d like to note that this is not necessarily the ideal way to do this…. but just one way you can do it “out of the box” without extending Liferay in other ways. (I say this because there are many optimizations one could do for slimming down what Liferay sends around a cluster by default… i.e. lots of heavier object serializations as events happen) I’d also like to note that I am not a Liferay expert and I’m sure some things I am describing here are not 100% accurate. What is described below is the result of a lot of analysis of the uncommented, undocumented Liferay source code. I’m sure the developers at Liferay could provide additional insight, corrections and clarifications to what is stated below, but in lieu of design documentation, code-comments and the alike this is all that we in the community have to go on.

For my use case I tested this with two data-centers. One home-grown bare-metal DC in “regionA” and the other in AWS “regionB”. Point being we have two DC’s that are geographically separated over a WAN; could be dedicated line, VPN tunnel; whatever; point being the bandwidth available between DC’s is nothing compared to within DC.


Little bit of background:

Liferay has some light documentation for clustering, however this documentation is focused on a Liferay cluster within a single DC. I say “light documentation” because that is the nicest way I can state it. The Liferay project in general is quite void of any real documentation when it comes to how things work internally “under the hood” in the Liferay core codebase (design documents, code comments etc). If you want to know how things work, you have to crawl through tons of un-commented source code and figure it out for yourself.

First off Liferay uses JGroups under the hood (specifically JGroups version 3.2.10 in Liferay 6.2 as of this post). JGroups is pretty much one of the de-facto “go-to’s” for building clusters in Java applications and has been around a very long time; many Java stacks use this. If you want to see a good example of a open-source project with good design documents that explain the internals, see JGroups (hint, hint Liferay guys) I’m not going to go much further into describing JGroups, you can do that on your own; as I’ll be using some JGroups terminology below.

Liferay & JGroups basics:

Liferay defines two primary JGroups channels for what Liferay calls “cluster link”.  You enable this in your portal-ext.properties by setting cluster.link.enabled=true. By default all channels in Liferay are UDP (multicast); if you are trying to cluster Liferay in a DC that does not support multicast (like AWS) you will want to configure it to use unicast (see this article)

  • Channel “control” portal-ext.properties entry = cluster.link.channel.properties.control 
  • Channel “transport” portal-ext.properties entry = cluster.link.channel.properties.transport.0(?N)

Both of these channels are used for various things such as notifying other cluster members of a give nodes state (i.e. notifying they are up/down etc) and sending ClusterRequests to other nodes for invoking operations across the cluster. There does not seem to be any consistency as to why one is used over the other. For example streaming a Lucene index over from another node uses the control channel, which reindex() requests use the transport channel.

Out of the box, if you bring up more than one node in a local DC (configured w/ UDP multicast), the Liferay nodes will automatically discover each other, peer up in to a cluster and begin to send data to each other when appropriate. For unicast, again you have you make some changes to your portal-ext.properties to use unicast, but effectively the result is the same.

Great! Now what, we have have one DC with N local nodes that are all peered up with each other….. but how can this DC exchange state with DC2? Good question, when attempting to GSLB an application there are many considerations, specifically for Liferay the big ones that I noted that need to be addressed are below; note there are some more hidden in Liferay’s internals, but for the big picture lets just focus on these 🙂

Cluster “master” determination:

Liferay has the concept of a logical “master” and who is the “master” is determined by ownership of a named lock called “com.liferay.portal.cluster.ClusterMasterExecutorImpl” that resides in the “Lock_” table in the database. @see ClusterMasterTokenClusterEventListener. Note that the database is shared by all nodes across all DC’s (see database section below), and this presents a huge problem if clusters in separate DC’s (which are only locally aware of peer-DC-nodes by default w/ Jgroups) can’t talk to each other across DC’s; which is what this article is about. i.e. node1 in DC1 might acquire this Lock_ first, but the nodes in DC2 cannot communicate w/ the “master” because it is in an unreachable DC.


This is an enormous topic on its own, but for this article lets keep it dumb simple. Leverage Liferay’s reader-writer database configuration. For example in DC1 setup your master instance for your database and designate as your “write” database, then configure two read-slave instances; one slave in DC1 and another in DC2. In your portal-ext.properties files for nodes in both DC’s configure “jdbc.write.url” to hit the master instance in DC1 and “jdbc.read.url” to hit whichever read-slave instance is local within each DC.

Cached data:

Liferay leverages Ehcache for caching within the application. Clustering for Ehcache can be enabled by reading this article. The configuration relies on a separate JGroups channel within the Liferay cluster and you need to properly configure it for unicast if your DC does not support multicast just like previously described.

Fire up two separate DC’s pointing to a database setup as previously described, go change some data in the app via a node in DC1, and because it is likely cached, when you view that same page/data in DC2 you won’t see the change visible in the UI. Why? Well because the clustering is only local to each DC. When model objects are updated in Liferay they are saved in the database, and then an event occurs (distributed via JGroups) that tells peer nodes to dump the cache entry…. but only local to that DC. So you say, “well why not just enable unicast for all nodes in every DC so they are aware of all other nodes in all other DCs?” You could, but imagine the cross-talk; no thanks. There are a few solutions to this, one will be described below (via RELAY2) and another could be provided by the ehcache-jms-wan-replicator project.

Indexed data:

Liferay’s default search index is implemented using Lucene. There are several ways to configure Lucene for a Liferay cluster... but for this article lets keep it simple and not setup Solr to avoid the complexity of having to GSLB that as well… and just enable “lucene.replicate.writes=true”. So each peer node within a local DC has its own Lucene index copy, and once again Liferay leverages JGroups (via ClusterExecutorImpl) triggered by all sorts of fancy AOP (see IndexableAdvice.java, @Indexable annotation, IndexWriterProxyBean + review search-spring.xml in the codebase) to essentially intercept a index write and broadcast a”reindex this thing” message to peer nodes in the local DC cluster. Note that Liferay sends the entire object w/ data to be reindexed to all peer nodes over the JGroups channel (which is not necessarily efficient over a WAN). Secondly, when you go to Server Administration in the GUI and hit the “reindex all index data” button, the server you invoke this on also invokes that operation against all peer nodes. Lastly, another hidden thing is that peer nodes will suck over the entire Lucene index via an HTTP call on startup from a donor node…..again we’ll touch on this later and the considerations to think about.

Again, fire up two separate DC’s pointing to a database/caching setup as previously described. Go to control panel and add a new user in DC1. Great you see the new user in the user’s screen when accessing through a DC1 node. Go view the users from DC2’s perspective. You won’t see the user, nor can you find them in a search despite them being in the database. Why? Well two things, first that Lucene “reindex this thing” message did not make it to DC2, and secondly (unless at this point you have either RELAY2 setup OR ehcache-jms-wan-replicator configured) these screens are also reliant on what is in Ehcache and a combination of what is in the local Lucene index.

Document library files:

This is most definitely a consideration for GSLB’ing Liferay across DC’s, however it really does not have anything to do w/ JGroups and RELAY2 in particular so I’m not going to discuss it here. I’ll point you in a direction… consider putting Liferay’s files in S3 and abstracting that with something like YAS3FS which is quite promising for building a distributed file store with local node CDN read performance. Much faster than Liferay’s S3 implementation and globally aware of changes to the FS.

Job scheduling:

Liferay has two classes of “jobs” (implemented via Quartz); master jobs which run only on ONE node, the “master” node and “slave” jobs which can run on any node. Again who is the “master” job runner is determined by an entry in the Lock_ table named “com.liferay.portal.kernel.scheduler.SchedulerEngine”. Who gets the lock is essentially which server boots up and acquires it first. The same problem that exists as noted above with regards to slave->master communication exists here in separated DC to DC environments where the nodes in separate DC’s cannot talk to one another over a WAN. Liferay has a few job types denoted by com.liferay.portal.kernel.scheduler.StorageType of:

  • MEMORY: run on any node and transient in memory
  • MEMORY_CLUSTERED: run only on the “master” node but transient in memory
  • PERSISTED: runs on any node but state persisted in database

Point being is that the job engine in Liferay has a dependency to be ably to communicate to all other nodes.

“Live Users”:

If you have the “live.users.enabled=true” option set in your portal-ext.properties you can do user monitoring in the cluster and this as well, if clustering is enabled needs to see all nodes in the cluster. When a cluster node comes up it sends a ClusterMessageType.NOTIFY/UPDATE which goes to all nodes which in turn broadcast a local ClusterEvent.JOIN which is reacted to by sending a unicast ClusterRequest.EXECUTE for LiveUsers.getLocalClusterUsers() on the remote node to effectively sign-in the users on the other node locally on the requesting node. This appears to be there so that each node will locally reflect all logged in users across the cluster. Again this will be incomplete if node1 on DC1 cannot talk to node2 @ DC2.


There are a few other functions in Liferay that appear to leverage the underlying Jgroups channels (i.e. EE Licensing, JarUtil, DataSourceFactoryUtil, SetupWizard, PortalManagerUtil.manage() (related to JMX?), PortalImpl.resetCDNHosts() and EhcacheStreamBootstrapCacheLoader. You can see my other notes article for these tidbits.


It might be helpful to diagnose whats going on by tweaking the logging settings for Liferay. To do this you should read this article to permanently change your log settings (which will dump more info on bootup which is important). Don’t rely on changing your logging settings via the GUI screen as those are not permanently saved and are transient. Below is an example webapps/ROOT/WEB-INF/classes/META-INF/portal-log4j-ext.xml file I used to triage various issues on bootup related to clustering.

<?xml version="1.0"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">

<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">

<category name="com.liferay.portal.cluster">
<priority value="TRACE" />

<category name="com.liferay.portal.license">
<priority value="TRACE" />

<category name="org.jgroups">
<priority value="DEBUG" />

<category name="com.liferay.portal.kernel.cluster">
<priority value="TRACE" />



How JGroups RELAY2 can bridge the gap.

So at this point it should be clear to anyone reading that we need a way to get these separated clusters in DC1 and DC2 to talk to one another. First off we could just change the control/transport channels in Liferay to force use TCP UNICAST and specifically list all nodes globally across DC1 and DC2. This would let every node know about every other node globally, however this won’t scale well as each node would talk to every other node. The other option is the RELAY2 functionality available in JGroups.

Essentially what RELAY2 provides is a “bridge” between N different JGroups clusters that are physically separated by a WAN or other network arrangement. By adding RELAY2 into your JGroups stack you can somewhat easily ensure that ALL messages that are passed through a JGroups channel will be distributed over the bridge to all other JGroups clusters in other DCs. Think of RELAY2 as secondary “cluster” in addition to your local cluster, however only ONE node in your local JGroups cluster (the coordinator) is responsible for “relaying” all messages over the bridge to the other “coordinators” in the relay cluster, for distribution to the other coordinators local JGroups cluster. So “out of the box” this can let you ensure that all Liferay cluster operations that occur in DC1 get transmitted to DC2 transparently. So with this enabled, when we add that user to our local Lucene index across all local nodes via UDP in the local cluster, the coordinator node in DC2 will also receive that event and transmit it locally to all nodes in the DC2 cluster. Now when you view the “users” on node2 in DC2 you will see the data that was added by node1 in DC1

One IMPORTANT caveat is that using the RPC building blocks in JGroups/RELAY2 does not effectively cascacde RPC calls over the bridge. See here and this thread. To Liferay’s credit they did not implement the “RPC” like method invocations across Liferay clusters’s using RPC in JGroups (i’m not sure why) but rather they serialize ClusterRequests which encapsulate what is to be invoked on the other side via method reflection, and just send these messages as serialized objects over the wire. Had they used RPC one would have to modify Liferay’s code to get these RPC invocations across the RELAY2 bridge.

Screen Shot 2014-05-30 at 3.44.38 PM

How to configure in Liferay:

What I am describing below assumes you are just using the Liferay cluster defaults of a local UDP multicast cluster, again if you are in AWS you will just need to adjust your unicast TCP JGroups stack accordingly, the configuration is pretty much the same w/ regards to where RELAY2 in configured the stack

IMPORTANT: this only covers the transport and control channels. If you want to enable this kind of relay bridge for the separate Ehcache channels that Liferay uses, you will repeat the process (described in the steps below) for the Ehcache JGroups channel definitions in Liferay as well… summary high-level steps below (note alternatively you could leave the ehcache jgroups configuration alone in liferay and just leverage the ehcache jms wan replicator.)

  • UNCOMMENT: ehcache.cache.manager.peer.provider.factory = net.sf.ehcache.distribution.jgroups.JGroupsCacheManagerPeerProviderFactory
  • MODIFY: “ehcache.multi.vm.config.location.peerProviderProperties” and add a “connect” property to manually define a JGroups stack that incorporates RELAY2 similar in the fashion to how we do it below for the “control” and “transport” channels. @see the ehcache documentation here.


Ok, first lets define our “relay clusters” for both the control and transport channels in Liferay. Note all the steps below need to be done for all nodes across all DC’s and you need to adjust certain things relative to what DC you are running in, particularly the “site” names in the RELAY2 configurations

1. Create a file in your WEB-INF/classes dir called “relay2-transport.xml”.

<RelayConfiguration xmlns="urn:jgroups:relay:1.0">
        <site name="dc1" id="0">
                <bridge config="relay2_global_transport_tcp.xml" name="global_transport" />

        <site name="dc2" id="1">
                <bridge config="relay2_global_transport_tcp.xml" name="global_transport" />

2. Create a file in your WEB-INF/classes dir called “relay2-control.xml”

<RelayConfiguration xmlns="urn:jgroups:relay:1.0">
        <site name="dc1" id="0">
                <bridge config="relay2_global_control_tcp.xml" name="global_control" />

        <site name="dc2" id="1">
                <bridge config="relay2_global_control_tcp.xml" name="global_control" />

3. Next we need to configure the actual TCP relay clusters for both the control/transport relay configurations. The sample below is a TEMPLATE you can use and copy twice, once for “relay2_global_control_tcp.xml” and another for “relay2_global_transport_tcp.xml”. Change the TCP.bind_port and TCPPING.initialhosts appropriately for each file. For TCPPING.inititalhosts, you will want to list ONE known node that lives in each DC. These will be the initial relay coordinator nodes.

    <config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd">





    <TCPPING timeout="3000"
    <MERGE2  min_interval="10000"
    <FD timeout="3000" max_tries="3" />
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK2 use_mcast_xmit="false"
    <UNICAST2 />
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
    <pbcast.GMS print_local_addr="true" join_timeout="3000"

    <MFC max_credits="2M"
    <FRAG2 frag_size="60K"  />
    <!--RSVP resend_interval="2000" timeout="10000"/-->


4. At this point we have the relay cluster configuration defined. Now we need to adjust Liferay’s “control” and “transport” channel JGroups stack to utilize them. Open up portal-ext.properties and add the following to the control and transport channels. Note these are long lines, copied from portal.properties default to portal-ext.properties and then appending to the stack:

:FORWARD_TO_COORD:relay.RELAY2(site=[DC1 | DC2];config=${configRootDir}/relay2-[transport|control].xml;relay_multicasts=true)

IMPORTANT: be sure to set the “site=” property of RELAY2 to match the DC you are configuring this for and must match the “site” name in the relay2-transport.xml and relay2-control.xml files accordingly.

cluster.link.channel.properties.control=UDP(bind_addr=${cluster.link.bind.addr["cluster-link-control"]};mcast_group_addr=${multicast.group.address["cluster-link-control"]};mcast_port=${multicast.group.port["cluster-link-control"]}):PING(timeout=2000;num_initial_members=20;break_on_coord_rsp=true):MERGE3(min_interval=10000;max_interval=30000):FD_SOCK:FD_ALL:VERIFY_SUSPECT(timeout=1500):pbcast.NAKACK2(xmit_interval=1000;xmit_table_num_rows=100;xmit_table_msgs_per_row=2000;xmit_table_max_compaction_time=30000;max_msg_batch_size=500;use_mcast_xmit=false;discard_delivered_msgs=true):UNICAST2(max_bytes=10M;xmit_table_num_rows=100;xmit_table_msgs_per_row=2000;xmit_table_max_compaction_time=60000;max_msg_batch_size=500):pbcast.STABLE(stability_delay=1000;desired_avg_gossip=50000;max_bytes=4M):pbcast.GMS(join_timeout=3000;print_local_addr=true;view_bundling=true):UFC(max_credits=2M;min_threshold=0.4):MFC(max_credits=2M;min_threshold=0.4):FRAG2(frag_size=61440):RSVP(resend_interval=2000;timeout=10000):FORWARD_TO_COORD:relay.RELAY2(site=[DC1 | DC2];config=${configRootDir}/relay2-control.xml;relay_multicasts=true)

cluster.link.channel.properties.transport.0=UDP(bind_addr=${cluster.link.bind.addr["cluster-link-udp"]};mcast_group_addr=${multicast.group.address["cluster-link-udp"]};mcast_port=${multicast.group.port["cluster-link-udp"]}):PING(timeout=2000;num_initial_members=20;break_on_coord_rsp=true):MERGE3(min_interval=10000;max_interval=30000):FD_SOCK:FD_ALL:VERIFY_SUSPECT(timeout=1500):pbcast.NAKACK2(xmit_interval=1000;xmit_table_num_rows=100;xmit_table_msgs_per_row=2000;xmit_table_max_compaction_time=30000;max_msg_batch_size=500;use_mcast_xmit=false;discard_delivered_msgs=true):UNICAST2(max_bytes=10M;xmit_table_num_rows=100;xmit_table_msgs_per_row=2000;xmit_table_max_compaction_time=60000;max_msg_batch_size=500):pbcast.STABLE(stability_delay=1000;desired_avg_gossip=50000;max_bytes=4M):pbcast.GMS(join_timeout=3000;print_local_addr=true;view_bundling=true):UFC(max_credits=2M;min_threshold=0.4):MFC(max_credits=2M;min_threshold=0.4):FRAG2(frag_size=61440):RSVP(resend_interval=2000;timeout=10000):FORWARD_TO_COORD:relay.RELAY2(site=[DC1 | DC2];config=${configRootDir}/relay2-transport.xml;relay_multicasts=true)


 5. Ok, now startup Liferay and adjust its startup java options to add the following. This is required so that the JGroups stacks can find the relay2-transport.xml and relay2-control.xml relay configuration files.


 6. In Liferay’s logs you should see some additional JGroups control/transport channels showing up that represent the bridge between the report sites.


At this point you should now have two separate Liferay clusters in two separate data-centers “bridged” using JGroups RELAY2 so that many of the issues described in this article are resolved and Liferay cluster events/messages are received across both data-centers. Note that unless you also tweaked the Ehcache JGroups configuration (as noted earlier) or are using the ehcache jms wan replicator, the Ehcache clustered cache replication events will not be sent to the other DC’s

That said, again this is not necessarily the best way to do this. What is the best way is to be determined as this is an experiment in progress. There are many things that may or may not really be necessary to warrant utilizing RELAY2 as the means to bridge multiple separated Liferay clusters. Liferay can generate a lot of cluster traffic and if you are bridging over a WAN this may not be efficient, or could potentially block operations in DC1 while waiting on a coordinator or timeout in DC2, resulting in perceived slowness on the sender side depending on if Liferay is invoking things remotely via ClusterRequests synchronously or asynchronously.

The latter could potentially be alleviated by tweaking your TCP bridge configuration to optimize the TCP “max_bundle_size” and “max_bundle_timeout” parameters. Doing so you could reduce the sheer amount of messages sent back and forth over the bridge. I.E. let the bridge queue up N messages or until the total amount of data to be sent is >= N(size); effectively “batching” the data to be sent around the WAN. Note when tweaking these configuration settings you may encounter this kind of warning, that you will need to adjust your OS settings accordingly :

“WARN  [localhost-startStop-1][UDP:547] [JGRP00014] the receive buffer of socket MulticastSocket was set to 500KB, but the OS only allocated 124.93KB. This might lead to performance problems. Please set your max receive buffer in the OS correctly (e.g. net.core.rmem_max on Linux)”

Potential alternatives to RELAY2 are implementing point-specific solutions for transmitting the most visible/critical information to other DC’s such as cache events in a batched optimized fashion by using something like the Ehcache JMS WAN replicator. Another example is writing an extension for Lifferay which would do something similar for batching “reindex()” requests across DC’s, so rather than relying on RELAY2 which would transmit full Lucene documents over the wire… one-by-one… and extension could be developed to batch these in an optimized fashion where only model ids, type and operation request are conveyed rather than the entire Lucene document.

Also note that because Liferay streams the entire contents of its Lucene indexes to peer nodes…  it is important to understand “who” Liferay considers a peer node. This is determined by calling its “control” JChannel’s, “receiver” who is Liferay’s BaseReceiver’s getView() method which should only return local DC peers, and not those across a RELAY2 bridge which should avoid it connecting to a node in a WAN separated DC to stream the index (note according to these docs, a JGroups channel stack with RELAY2 enabled does NOT return views that span bridges).  If you didn’t use RELAY2 but instead manually configured a giant UNICAST cluster across WANs, one would have to consider how clusters boot up, because if you start one node in DC1, then node2 in DC1 would get its index update from node1(DC1) (which is fine because they would be on the same local network). However when node3 comes up in DC2, because it recognizes the “master” node (via the shared database Lock_ table entry) as living in DC1 it would have to stream its index over the WAN. I’d also like to note that Liferay WILL attempt to stream an index from a node in another DC if you do a manual “reindex” that is initiated by a remote DC, however this will fail due to the way the ClusterRequest to reindex() in other DC’s sends along the jgroups “address” (see the footnotes article section on “ClusterLoadingSyncJob” for more details.)

There is also one other oddity I noted in “who” Liferay considers are “peers” when RELAY2 is used. See the section on ClusterMasterExecutorImpl in the footnotes for more information.

Regardless, doing any of the latter customizations requires analysis of Liferays behavior/code to determine the impact on what “events” in Liferay could be missed by not using RELAY2 (which will catch everything). You also would then be responsible for keeping tabs on what changes as Liferay does new releases, modifies the way one particular “clusterable” action behaves or adds totally new features that send messages over the cluster.

Hopefully this document will help others out there trying to solve this kind of problem and save others some valuable time!






Testing yas3fs: a distributed S3 FUSE filesystem

I’ve recently been doing quite a bit of evaluation of  a few S3 filesystems, one in particular is yas3fs which so far is quite impressive. I plan on doing a more detailed post about it later, however for now I’d like to share a little tool I wrote to help me in my evaluation of it. You can check it out at https://github.com/bitsofinfo/yas3fs-cluster-tester

Part 2: Nevado JMS, Ehcache JMS WAN replication and AWS

This post is a followup to what is now part one, of my research into using Ehcache, JMS, Nevado, AWS, SNS and SQS to permit cache events to be replicated across different data-centers.

@see https://github.com/bitsofinfo/ehcache-jms-wan-replicator

In my first post I was able to successfully get Ehcache’s JMS Replication to function using Nevado JMS as the provider after patching the serialization issues, however as noted in that article, the idea of sending the actual cached data when any put/update occurred (if the cache replicator is configured that way) in Ehcache on any given node sounded like it might get out of hand. Secondly, polling from SNS is slow! Apps can generate thousands of remove events that need to be distributed globally, consumers in other DC’s will get way behind; batching optimized for SNS is needed. Given that I started looking at writing my own prototype of something lighter weight but eventually came to the realization that the existing Ehcache JMS replication framework could be customized/extended to permit the modifications that were needed.

Out of this came the ehcache-jms-wan-replicator project on GitHub. There is a diagram below showing the concept in general however I suggest you read the README.md on Github instead because as this research evolves I’ll be updating that project. So far it seems to work pretty well and is plays fine running side-by-side with any existing Ehcache (RMI/JGroups) replication you already have configured. This is been intergrated in a Liferay 6.2 Portal cluster across two data-centers in a test setup and so far is working as expected.

Hopefully this will give others some ideas or be useful to someone else as well.

Ehcache replicated caching with JMS, AWS, SQS, SNS & Nevado

Read part 2 of this research here

Recently I’ve been researching ways to GSLB a very large app that relies on Ehcache for numerous things internally such as; cached page output, cached lists etc etc. Anyone who has experience getting large scale applications to function active-N in a GSLB’d setup knows the challenges such topologies present. The typical challenge you will face is how to bridge events that occur in locally (dc) clustered applications, for example in: DC-A (data center), with another logical instance footprint of the same application living in DC-B. This extends all the way from the from the data-source, all the way up the stack.

So for example, lets say user A is accessing the application and hitting instances of it residing in DC-A. This user updates some inventory data that is cached locally in the cluster in DC-A; subsequently this cached inventory also resides in the cluster running in DC-B (also being access by different users in some other region). When user A updates this inventory data, the local application instance, writes it to the data-source, and then does some sort of cache operation, such as a cache key remove, or put (replace). Forgetting the entirely separate issue of how the data-source write is itself is visible across both DC’s, point being is that the cache update in DC-A is visible only to participating instances in DC-A….. DC-B’s cache knows nothing of this; only its data-source is aware of this new information…. so we need a way to get DC-B aware this cache event. There are a few ways this can happen; for example we could just configure the caches to rely solely or LRU/TTL driven expiry, or actually respond to events in a near-real-time fashion.

Now before we go on I’ll state up-front that despite what I am about to describe would work (to an extent), ultimately I likely will NOT go with this setup due to the inefficiencies involved, particularly the amount of data being moved across WANs if you just use the Ehcache JMS replicated caching feature alone. (i.e. cached data is actually moved around, rather than just operation events with the JMS replicated Ehcache feature)

Continuing with that train of thought, after the latter caveat…. so one thing I started looking at was the Ehcache JMS Replicated Caching feature. Basically this feature boils down to permitting you to configure any cache to utilize JMS (Java message service) for publishing cache events. So when a PUT/REMOVE happens, Ehcache wires up a cache listener that responds and subsequently relays these events (including the cached data on puts) to a JMS topic. Then any other Ehcache node configured w/ this same setup can subscribe to those topics and receive those events. Couple this with a globally accessible messaging system, you now can have a backbone for distributing these events across multiple disparate data-centers…… but who in the hell wants to setup their own globally accessible, fault-tolerant messaging system implementation…. not me.

Enter AWS’s SNS (Simple Notification Service,  topics) & SQS (Simple queuing service) services. I decided I’d try to get Ehcache’s JMS Replicated Caching feature to utilize AWS as the JMS provider….. now enter Nevado JMS from the Skyscreamer Team. (github link). Nevado is a cool little JMS implementation that front’s SNS/SQS, and it works pretty good!

Note the code is at the end of this post ….. and yes the code is very basic and NOT production ready/tested; it was just for a prototype/research and is a bit hacked together. Also note this code is reliant upon this PATCH to Nevado, which is pending discussion/approval

  1. The first step was creating an Ehcache CacheManagerPeerProviderFactory (NevadoJMSCacheManagerPeerProviderFactory), which returns a JMSCacheManagerPeerProvider to Ehcache that is configured to use Nevado on the backend
  2. The NevadoJMSCacheManagerPeerProviderFactory boots a little spring context that sets up the NevadoTopic etc
  3. Created a little test program (below) EhcacheNevadoJMSTest. I just ran several instances of this concurrently w/ breakpoints to validate that events in one JVM/ehcache instance were indeed being broadcast over JMS -> AWS -> back to other Ehcache instances on other JVM instances.
  4. The first thing I noticed was that while the events were indeed being sent over JMS to AWS and received by other Ehcache peers, the actual cached data (Element) embedded within the JMSEventMessage were NOT being included, resulting in NullPointerException’s by the Ehcache peers who received the event.
  5. The latter was due to an Object serialization issue, and transient soft references as described in this Nevado Github issue #81
  6. Once I created a patch for Nevado to use the ObjectOutputStream things worked perfectly.


  • Again this code was for research/prototyping
  • The viability of having the actual cached element being moved around to AWS, across WANs and back to other data-centers is likely not too optimal. It would work, but under high-volume you could spend a lot of $$ and bandwidth.
  • SQS/SNS has message size limitations…. which if your cached data is beyond that would get truncated and lost effectively making the solution useless.
  • Ideally, all one really cares about is “what happened”, meaning Ehcache KEY-A was PUT or REMOVED etc. Then let the receiving DC decide what to do (i.e. remove the cached KEY locally and let next user driven request re-populated from the primary source, the real data-source). This results in much smaller message sizes. The latter is what I’m now looking at, using the Ehcache listener framework w/ some custom calls to SNS/SQS would suffice for this kind of implementation.




Github patch for Nevado @ https://github.com/skyscreamer/nevado/issues/81

NevadoJMSCacheManagerPeerProviderFactory, Ehcache uses this as its cacheManagerPeerProviderFactory

package com.bitsofinfo.ehcache.jms;

import java.util.Properties;

import javax.jms.ConnectionFactory;
import javax.jms.Queue;
import javax.jms.QueueConnection;
import javax.jms.Topic;
import javax.jms.TopicConnection;

import org.skyscreamer.nevado.jms.NevadoConnectionFactory;
import org.skyscreamer.nevado.jms.destination.NevadoQueue;
import org.skyscreamer.nevado.jms.destination.NevadoTopic;
import org.springframework.context.ApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;

import net.sf.ehcache.CacheManager;
import net.sf.ehcache.distribution.CacheManagerPeerProvider;
import net.sf.ehcache.distribution.CacheManagerPeerProviderFactory;
import net.sf.ehcache.distribution.jms.AcknowledgementMode;
import net.sf.ehcache.distribution.jms.JMSCacheManagerPeerProvider;

public class NevadoJMSCacheManagerPeerProviderFactory extends CacheManagerPeerProviderFactory {

    public CacheManagerPeerProvider createCachePeerProvider(CacheManager cacheManager, Properties props) {
        try {
            ApplicationContext context = new ClassPathXmlApplicationContext("/com/bitsofinfo/ehcache/jms/nevado.xml");
            NevadoConnectionFactory nevadoConnectionFactory = (NevadoConnectionFactory)context.getBean("connectionFactory");
            TopicConnection topicConnection = nevadoConnectionFactory.createTopicConnection();
            QueueConnection queueConnection = nevadoConnectionFactory.createQueueConnection();
            Topic nevadoTopic = (NevadoTopic)context.getBean("ehcacheJMSTopic");
            Queue nevadoQueue = (NevadoQueue)context.getBean("ehcacheJMSQueue");
            return new JMSCacheManagerPeerProvider(cacheManager,
        } catch(Exception e) {
            return null;




nevado.xml (NevadoJMSCacheManagerPeerProviderFactory boots this to init Nevado topic/queue @ AWS)

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
    <bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
      <property name="locations">
    <bean id="sqsConnectorFactory" class="org.skyscreamer.nevado.jms.connector.amazonaws.AmazonAwsSQSConnectorFactory" />
    <bean id="connectionFactory" class="org.skyscreamer.nevado.jms.NevadoConnectionFactory">
      <property name="sqsConnectorFactory" ref="sqsConnectorFactory" />
      <property name="awsAccessKey" value="${aws.accessKey}" />
      <property name="awsSecretKey" value="${aws.secretKey}" />
    <bean id="ehcacheJMSTopic" class="org.skyscreamer.nevado.jms.destination.NevadoTopic">
          <constructor-arg value="ehcacheJMSTopic" />
    <bean id="ehcacheJMSQueue" class="org.skyscreamer.nevado.jms.destination.NevadoQueue">
          <constructor-arg value="ehcacheJMSQueue" />






   <diskStore path="user.home/ehcacheJMS"/>

       propertySeparator="," />
     <cache name="testCache"


EhcacheNevadoJMSTest – little test harness program, run multiple instances of this w/ breakpoints to see ehcache utilize JMS(nevado/sns) to broadcast cache events

package com.bitsofinfo.ehcache.jms;

import net.sf.ehcache.Cache;
import net.sf.ehcache.CacheManager;
import net.sf.ehcache.Element;

import org.springframework.context.ApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;

public class EhcacheNevadoJMSTest {

    public static void main(String[] args) throws Exception {
        ApplicationContext context = new ClassPathXmlApplicationContext("/com/bitsofinfo/ehcache/jms/bootstrap.xml");
        CacheManager cacheManager = (CacheManager)context.getBean("cacheManager");
        Cache testCache =cacheManager.getCache("testCache");

        Element key1 = testCache.get("key1");
        Element key2 = testCache.get("key2");
        key1 = testCache.get("key1");
        testCache.put(new Element("key1", "value1"));
        testCache.put(new Element("key2", "value2"));




bootstrap.xml – used by the test harness

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"


<bean id="cacheManager" class="org.springframework.cache.ehcache.EhCacheManagerFactoryBean">
<property name="configLocation" value="classpath:/com/bitsofinfo/ehcache/jms/ehcache.xml"/>











Review: Cloud Application Architectures

caaThis is a review of the book “Cloud Application Architectures” by George Reese

At about 200 pages, this book packs a lot of solid recommendations on deploying and managing an application within the cloud. The book has an admitted AWS slant, however the author covers two other providers, GoGrid and Rackspace in the appendix. That said, the book does not treat the cloud computing topic with low-level command references or specific examples using AWS; you will not find those here (except a handy AWS command reference in the appendix), however this book covers the higher level application architectural issues within an AWS framework. The book does this nicely as the author covers many issues that face architects who need to deploy to the cloud, from both the technical and business consideration perspectives. This includes such things as application design issues, machine images, performance and disaster recovery, but also security, regulatory compliance and cost issues from the business side of things.

The author gives good coverage to the various issues you will need to keep in mind when it comes to using cloud services, specifically backup strategies, security, strategies, database performance and capacity planning. However when reading this book, one cannot help but to think… “How are these issues that much different than a non-cloud deployment? Why are they more important in the cloud than outside of it?” Well, the answer is that they are not, they apply to both worlds. When it comes to application design, database strategies, backups, security and capacity planning, all of these details and strategies laid out in this book are great advice for operating outside of the cloud as well. But what you will find in this book are some of those AWS nuances that the author has encountered which are very important to be aware of and will vary the ways you approach different problems when using such a service.

That said, I really recommend this book for any architect who wants to learn more about some of the issues you will face when deploying in the cloud, as well as simply a great book on general architectural and business issues that any application will face; whether it is deployed within or outside of a cloud service.

Recommended: Yes
Skill Level: Intermediate to advanced system architects, CTOs etc.

Review: Programming Amazon Web Services

awsReview of the book “Programming Amazon Web Services” by James Murty

So I bought this book out of curiosity and the desire to start poking around with EC2. So I sat down over a weekend and plowed through most of this thing with my laptop and brand new AWS account. This is a good book, however I don’t recommend this book for newcomers to the world of programming, network and infrastructure management as this book requires a solid baseline of knowledge in all areas in order to get through the book. In short, this book is for an experienced technical audience.

That said, this book covers (with detailed examples) about everything you will want to do with AWS. Its all here. S3: Simple Storage Service, EC2, SQS: Simple Queue Service, FPS: Flexible Payment Service and SimpleDB. My only issue with the book was that all of the examples are coded in Ruby, which being mainly a Java guy, required more fumbling around than it would have otherwise. The other concern is that this book is likely to become quickly outdated as AWS appears to be a constantly changing and evolving service.

Overall I enjoyed the book, it is filled with details and enabled me to get my first few EC2 instances up and running in no time. Let me tell you; wow is it cool to be able to programatically fire up an Ubuntu server with a few quick keystrokes!

Skill level: Advanced
Recommend: Definitely!