Execute Powershell commands via Node.js, REST, AngularJS

Building on my last post on stateful-process-command-executor this post will cover how you can leverage that node.js module to expose the capabilities of Powershell cmdlets over a REST api presented through an AngularJS interface.  Why would one want to do this you ask? Well I’ve covered this in my last post but I will briefly explain it here.

(Note, what is described below could just as easily be built for Bash processes as well as the underlying module supports it)

The use case came out of the need to automate certain calls to manage various objects within Microsoft o365’s environment. Unfortunately Microsoft’s GraphAPI, does not expose all of the functionality that is available via its suite of various Powershell cmdlets for o365 services. Secondly when you need to do these operations via Powershell, its requires a per-established remote PSSession to o365…. and establishing (and tearing down) a new remote PSSession whenever you need to invoke a cmdlet against a remote resource (remote server, or o365 endpoint) is expensive. Lastly, who wants to actually sit there and manually run these commands when they could be automated and invoked on demand via other means… such as via a web-service etc. Hence this is how stateful-process-command-proxy came to be… it provides the building block bridge between node.js and a pool of pre-established Powershell consoles. Once you have node.js talking to stateful-process-command-proxy, you can build whatever you want on top of that in node.js to mediate the calls.

Layer one

The first higher level NPM module that builds on stateful-process-command-proxy is powershell-command-executor

What this adds on top of stateful-process-command-proxy is probably best described by this diagram:


So the main thing to understand is that the module provides the PSCommandService class which takes a registry of pre-defined “named” commands and respective permissible arguments. The registry is nothing more than a object full of configuration and is easy to define. You can see an example here in the project which defines a bunch of named “commands” and their arguments usable for o365 to manipulate users, groups etc.  PSCommandService is intended to serve as a decoupling point between the caller and the StatefulProcessCommandProxy… in other words a place where you can restrict and limit the types of commands, and arguments (sanitized) that can ever reach the Powershell processes that are pooled within StatefulProcessCommandProxy.

It is PSCommandService‘s responsibility to lookup the named command you want to execute, sanitize the arguments and generate a literal Powershell command string that is then sent to the StatefulProcessCommandProxy to be execute. StatefulProcessCommandProxy, once the command is received is responsible for checking that the command passes its command whitelist and blacklist before executing it. The sample o365Utils.js config file provides a set of pre-canned (usable) examples of init/destroy commands, auto-invalidation commands and whitelist/blacklist configs that you can use when constructing the StatefulProcessCommandProxy that the PSCommandService will use internally.

Layer two

The next logic step is to expose some sort of access to invoking these pre-canned “commands” to callers. One way to do this is via exposing it via a web-service.

WARNING: doing such a thing, without much thought can expose you to serious security risks. You need to really think about how you will secure access to this layer, the types of commands you expose, your argument sanitiziation and filtering of permissible commands via whitelists and blacklists etc for injection protection. Another precaution you may want to take is running it only on Localhost for experimental purposes only. READ OWASPs article on command injection.

Ok with that obvious warning out of the way here is the next little example project which provides this kind of layer that builds on top of the latter: powershell-command-executor-ui

This project is a simple Node.js ExpressJS app that provides a simple set of REST services that allows the caller to:

  • get all available named commands in the PSCommandService registry
  • get an individual command configuration from the registry
  • generate a command from a set of arguments
  • execute the command via a set of arguments and get the result
  • obtain the “status” of the underlying StatefulProcessCommandProxy and its history of commands

Given the above set of services, one can easily build a user-interface which dynamically lets the user invoke any command in the registry and see the results… and this is exactly what this project does via an AngularJS interface (albeit a bit crude…). See diagrams below.

Hopefully this will be useful to others out there, enjoy.




Encrypting and storing powershell credentials

Please see: https://github.com/bitsofinfo/powershell-credential-encryption-tools

Recently I had the need to store some credentials for a powershell script (i.e. credentials that I ultimately needed in a PSCredential object). The other requirement is that these credentials be portable and “user” independent, meaning that they could not be encrypted using the DPAPI (windows data protection api) as that binds the “secret” used for the encryption to the currently logged in user (which reduces your portability and usage of these encrypted credentials). The way to avoid this is to specify the secret key parameters in the ConvertTo-SecureString and ConvertFrom-SecureString commands which will force it to use AES (strength determined by your key size)

I ended up coding a few powershell scripts that assist with the creation of a JSON AES-256 encrypted credentials file + secret key, as well as functions you can include in other powershell scripts to load these credentials into usable formats such as PSCredentials, SecureStrings etc.

Please see: https://github.com/bitsofinfo/powershell-credential-encryption-tools

NOTE! The most important thing about using the output from this tool, is properly locking down (i.e. file permissions) the secret key!

The format of the resulting file looks something like this:

{ "username" : "AESEncryptedValue", "password": "AESEncryptedValue" }

Executing stateful shell commands with Node.js – powershell, bash etc

Hoping this will be useful for others out there, I’ve posted some code that could to serve as a lower level component/building block in a node.js application who has a need to mediate interaction with command line programs on the back-end. (i.e. bash shells, powershell etc.)

The project is on github @ stateful-process-command-proxy and also available as an NPM module

This is node.js module for executing os commands against a pool of stateful child processes such as bash or powershell via stdout and stderr streams. It is important to note, that despite the use-case described below for this project’s origination, this node module can be used for proxying long-lived bash process (or any shell really) in addition to powershell etc. It works and has been tested on both *nix, osx and windows hosts running the latest version of node.

This project originated out of the need to execute various Powershell commands (at fairly high volume and frequency) against services within Office365/Azure bridged via a custom node.js implemented REST API; this was due to the lack of certain features in the REST GraphAPI for Azure/o365, that are available only in Powershell (and can maintain persistent connections over remote sessions)

If you have done any work with Powershell and o365, then you know that there is considerable overhead in both establishing a remote session and importing and downloading various needed cmdlets. This is an expensive operation and there is a lot of value in being able to keep this remote session open for longer periods of time rather than repeating this entire process for every single command that needs to be executed and then tearing everything down.

Simply doing an child_process.exec per command to launch an external process, run the command, and then killing the process is not really an option under such scenarios, as it is expensive and very singular in nature; no state can be maintained if need be. We also tried using edge.js with powershell and this simply would not work with o365 exchange commands and heavy session cmdlet imports (the entire node.js process would crash). Using this module gives you full un-fettered access to the externally connected child_process, with no restrictions other than what uid/gid (permissions) the spawned process is running under (which you really have to consider from security standpoint!)

The diagram below should conceptually give you an idea of what this module does: process pooling, custom init/destroy commands, process auto-invalidation configuration and command history retention etc. See here for full details: https://github.com/bitsofinfo/stateful-process-command-proxy

Obviously this module can expose you to some insecure situations depending on how you use it… you are providing a gateway to an external process via Node on your host os! (likely a shell in most use-cases). Here are some tips; ultimately its your responsibility to secure your system.

  • Ensure that the node process is running as a user with very limited rights
  • Make use of the uid/gid configuration appropriately to further limit the processes
  • Never expose calls to this module directly, instead you should write a wrapper layer around StatefulProcessCommandProxy that protects, analyzes and sanitizes external input that can materialize in a command statement.
  • All commands you pass to execute should be sanitized to protect from injection attacks
  • Make use of the whitelist and blacklist command features of this module
  • WRAP this service via additional code that sanitizes all arguments to protect from command injection

Hopefully this will help others out there who have a similar need: https://github.com/bitsofinfo/stateful-process-command-proxy

Configuring PowerShell for Azure AD and o365 Exchange management

Ahhh, love it! So you need to configure a Windows box to be able to utilize DOS, sorry PowerShell, to remotely manage your Azure AD / o365 / Exchange online services via “cmdlets”. You do some searching online and come across a ton of seemingly loosely connected Technet articles, forum questions etc.

Well I hope to summarize it up for you in this single blog post and I’ll try to keep it short without a lot of “why this needs to be done” explanations. You can read up on that on your own w/ the reference links below.

#1: The first thing we need to do is setup a separate user account that we will use when connecting via PowerShell to the remote services we want to manage with it:

  1. Using an account with administrative privileges, login to your Azure account/tenant at https://manage.windowsazure.com
  2. Once logged in click on “Active Directory” and select the instance you want to add the new user account too
  3. Click on “Add user”, fill out the details. Be sure to select “Global Administrator” as the role (or a lesser one, if need be depending on what you will be doing with PowerShell)
  4. Click create and it will generate a temporary password and email it to that user + the user listed for the secondary email that you filled out
  5. Logout of the Azure management portal
  6. Login again at https://manage.windowsazure.com, however this time login as the new user you just created with the temporary password. Once logged in, reset the password to a better one, click next.
  7. You should now be logged in as the new user you just created and on the main Azure management dashboard screen
  8. Find the link for managing “Exchange” and click on it
  9. You will now be redirected to the o365 Exchange admin center
  10. Click on “Permissions”, you will now see a bunch of groups/roles, the one we care about is Organization Management.
  11. Highlight the “Organization Management” role/group and ensure that the user you are logged in as (the new user you just created) is a member of this group directly or indirectly. You need to be a member of this group in order to get the “Remote Shell” permission that lets you download the Exchange cmdlets and manage exchange remotely via PowerShell. (See here for info on this group and the Remote Shell permission)

#2: Now that our special admin user is created with all the needed permissions, we can now get our PowerShell environment ready:

  1. Get on the Windows box that you intend to run the PowerShell commands from
  2. Download and install the “Microsoft Online Services Sign-In Assistant for IT Professionals” (its ok even if you are not a “professional”)
  3. Its 2014… you need to reboot after the last step…
  4. Download and install the “Azure AD Module for Windows PowerShell 64 bit”

#3: Ok, lets verify basic Azure AD PowerShell cmdlet capabilities

  1. Now on your Desktop RIGHT click on “Windows Azure Active Directory Module for Windows PowerShell” and “Run as Administrator”
  2. In PowerShell run this command “Set-ExecutionPolicy Unrestricted”
  3. In PowerShell run this command “Connect-MsolService” a nice dialog will prompt you for your credentials (use the creds that you setup above)
  4. In PowerShell run this command “Get-Msoluser”, get data back?? Great you are good to go for basic connectivity

#4: Finally…. lets verify o365 Exchange PowerShell cmdlet capabilities

  1. In the same PowerShell as you started above…
  2. Type: “$UserCredential = Get-Credential”… again enter your user credentials
  3. Type:
    $Session = New-PSSession -ConfigurationName Microsoft.Exchange -ConnectionUri https://outlook.office365.com/powershell-liveid/ -Credential $UserCredential -Authentication Basic -AllowRedirection
  4. Type: “Import-PSSession $Session”
  5. At this point you should see some activity at the top of your PowerShell window as 300+ Exchange online cmdlets are downloaded to your system for use
  6. Quickly verify the Exchange Online Remote Shell permission with: “Get-User YOUR_UPN | Format-List RemotePowerShellEnabled”
  7. You should get back “RemotePowerShellEnabled: true”

DONE, proceed to the next quagmire…



Managing Azure AD Using PowerShell:

o365 Exchange online: Remote Shell Permission and Organization Management

Connect to Exchange Online using Remote PowerShell:

Series: Using remote PowerShell to manage o365

Copying lots of files into S3 (and within S3) using s3-bucket-loader

Recently a project I’ve been working on had the following requirements for a file-set containing roughly a million files varying in individual size from one byte to over a gigabyte; and the file-set size in total being sized between 500gb and one terabyte

  1. Store this file-set on Amazon S3
  2. Make this file-set accessible to applications via the filesystem; i.e. access should look no different then any other directory structure locally on the Linux filesystem
  3. Changes on nodeA in regionA’s data-center should be available/reflected on nodeN in regionN’s data-center
  4. The available window to import this large file-set into S3 would be under 36 hours (due to the upgrade window for the calling application)
  5. The S3 bucket will need to be backed up at a minimum every 24 hours (to another bucket in S3)
  6. The application that will use all of the above generally treats the files as immutable and they are only progressively added and not modified.

If you are having to deal w/ a similar problem perhaps this post will help you out. Let go through each item.

Make this file-set accessible to applications via the filesystem; i.e. access should look no different then any other directory structure locally on the Linux filesystem. Changes on node-A in region-A’s data-center should be available/reflected on node-N in region-N’s data-center.

So here you are going to need an abstraction that can present the S3 bucket as a local directory structure; conceptually similar to an NFS mount. Any changes made to the directory structure should be reflected on all other nodes that mount the same set of files in S3. Now there are several different kinds of S3 file-system abstractions and they generally fall into one of three categories (block based, 1 to 1, and native), the type has big implications for if the filesystem can be distributed or not. This webpage (albeit outdated) gives a good overview that explains the different types.  After researching a few of these we settled on attempting to use YAS3FS (yet another, S3 filesystem). YAS3FS, written in Python, presents an S3 bucket via a local FUSE mount; what YAS3fs adds above other S3 filesystems is that it can be “aware” of events that occur on other YA3FS nodes who mount the same bucket, and can be notified of changes via SNS/SQS messages. YAS3FS keeps a local cache on disk, so that it gives the benefits (up to a point) of local access and can act like a CDN for the files on S3. Note that FUSE based filesystems are slow and limited to a block size (IF the caller will utilize it) of 131072. YAS3FS itself works pretty good, however we are *still* in evaluation process as we work through many issues that are creeping up in our beta-environment, the big ones being unicode support and several concurrency issues that keep coming up. Hopefully these will be solvable in the existing code’s architecture…


The available window to import this large file-set into S3 would be under 36 hours

Ok no problem, lets just use s3cmd. Well… tried that and it failed miserably. After several crashes and failed attempts we gave up. S3cmd is single-threaded and extremely slow to do anything against a large file-set, much less load it completely into S3. I also looked at other tools, (like s4cmd which is multi-threaded), but again, even these other “multi-threaded” tools eventually bogged down and/or became non-responsive against this large file-set.

Next we tried mounting the S3 bucket via YAS3fs and executing rsync’s from the source files to the target S3 mount…. again this “worked” without any crashing, but was single threaded and took forever. We also tried running several rsyncs in parallel, but managing this; and verifying the result, that all files were actually in S3 correctly w/ the correct meta-data, was a challenge. The particular challenge being that YAS3FS returns to rsync/cp immediately after the file is written to the local YAS3FS cache, and then proceeds to push to S3 asynchronously in the background (which makes it more difficult to check for failures).

Give the above issues, it was time to get crazy with this, so I came up with s3-bucket-loader. You can read all about how it works here, but the short of it is that s3-bucket-loader uses massive parallelism via orchestrating many ec2 worker nodes to load (and validate!) millions of files into an S3 bucket (via an s3 filesystem abstraction) much quicker than other tools. Rather than sitting around for days waiting for the copy process to complete with other tools, s3-bucket-loader can do it in a matter of hours (and validate the results). Please check it out for more details, as the github project explains it in more details.

The S3 bucket will need to be backed up at a minimum every 24 hours (to another bucket in S3)

Again, this presents another challenge; at least with copying from bucket to bucket you don’t actually have to move the files around yourself (bytes), and can rely on s3’s key-copy functionality. So again here we looked at s3cmd and s4cmd to do the job, and again they were slow, crashed, or bogged down due to the large file-set. I don’t know how these tools are managing their internal work queue, but it seems to be so large they just crash or slow down to the point where they become in-efficient. At this point you have two options for very fast bucket copying

  1. s3-bucket-loader: I ended up adding key-copy support to the program and it distributes the key-copy operations across ec2 worker nodes. It copies the entire fileset in under an hour, and under 20 minutes with more ec2 nodes.
  2. s3s3mirror: After coding #1 above, I came across s3s3mirror. This program is a multi-threaded, well coded power-house of a program that just “worked” the first time I used it. After contributing SSL, aws-encryption and storage-class support for it, doing a full bucket copy of over 600gb and ~800k s3 objects took only 45 minutes! (running w/ 100 threads). It has good status logging/output and I highly recommend it

Overall for the “copying” bucket to bucket requirement, I really like s33mirror, nice tool.



Is the Jasypt project dead?

I’m posting this because I don’t know where else to ask…. is the Jasypt project dead? I ask this because I have no idea where to ask questions or get community support for this Java JCE encryption library. The mailing lists seem dormant, the “forums” do not load and the issue tracker seems pretty idle….

Is anyone using this library anymore, does a support community still exist?

UPDATE: (from the project maintainer) https://sourceforge.net/p/jasypt/feature-requests/25/

Jasypt is DEFINITELY alive! Last version was released on February 25th, 2014.

Both the mailing lists and the forums were removed because of very low traffic, in favour of the SourceForge tracking system you are now using.

As for users… figures indicate almost 60,000 downloads in July 2014.

Generating Java classes for the Azure AD Graph API

NOTE: I’ve since abandoned this avenue to generate pojos for the GraphAPI service. The Restlet Generator simply has too many issues in the resulting output (i.e. not handling package names properly, generics issues, not dealing with Edm.[types] etc). However this still may be of use to someone who wants to explore it further

Recently had to write some code to talk to the Azure AD Graph API. This is a REST based API that exchanges data via typical JSON payloads. For those having to a Java client to talk to this, a good starting point is taking a look at this sample API application to get your feet wet. However to those familiar in Java, this code is less that desirable;  however it has no dependencies which is nice.

When coding something against a REST service its often nice to have a set of classes that you can marshal to/from the JSON payloads to you are interacting with.  Behind the scenes it appears that this Azure Graph API is an OData app, which does present “$metadata” about itself… cool! So now we can generate some classes…..




So what can we use to generate some Java classes against this? Lets use the Restlet OData Extension. This Restlet extension can generate code off OData schema documents.  You will want to follow these instructions as well for the code generation.

IMPORTANT: You will also need a fork/version of Restlet that incorporates this pull request fix for a NullPointer that you will encounter with the code generation. (The bug exists in Restlet 2.2.1)

The command I ran to generate the Java classes for all the Graph API entities was as follows (run this from WITHIN the lib/ directory in the extracted Restlet zip/tarball you downloaded). In the command below, do NOT specify “$metadata” in the Generator URI as the tool appends that automatically.

java -cp org.restlet.jar:org.restlet.ext.xml.jar:org.restlet.ext.atom.jar:org.restlet.ext.freemarker.jar:org.restlet.ext.odata.jar:org.freemarker_2.3/org.freemarker.jar org.restlet.ext.odata.Generator https://graph.windows.net/YOUR_TENANT_DOMAIN/ ~/path/to/your/target/output/dir


  •  If you run the command and get the error “Content is not allowed in prolog.” (which I did). There might be some extra characters that are being prepended to the starting “<edmx:Edmx”  in the “$metadata” schema document that this endpoint is returning. If this is the case do the following:
    • Download the source XML $metadata document to your hard-drive
    • Open it up in a editor and REMOVE the “<?xml version=”1.0″ encoding=”utf-8″?>” that precedes the first “<edmx…” element
    • Next, just to ensure there are no more hidden chars at the start of the document, open up a hex editor on the document to get rid of any other hidden chars that precede the first “<edmx…”  element.
    • Save the changes locally, and fire-up a local webserver (or a webserver that lives anywhere) and set it up so that http://yourwebserver/whatever/$metadata will serve up that XML file.
    • Then proceed to alter the Restlet Generator command above to reference this modified URI as appropriate. Remember that you do NOT specify “$metadata” in the Generator URI as the tool appends that automatically.

Logstash: Failed to flush outgoing items UndefinedConversionError workaround

If you have ever seen an error similar to this in Logstash it can be frustrating and can take your whole pipeline down (blocks). It appears that there are some outstanding tickets on this, one of which is here. This error can occur if you have an upstream input where the charset is defined as US-ASCII (such as from an ASCII file input), where that file contains some extended chars, and then being sent to an output where it needs to be converted (in my case it was elasticsearch_http). This was w/ Logstash 1.4.1

:message=>"Failed to flush outgoing items", :outgoing_count=>1,
:exception=>#<Encoding::UndefinedConversionError: ""\xC2"" from ASCII-8BIT to UTF-8>

For anyone else out there, here is a simple workaround fix I put in my filters which pre-processes messages from “known” upstream ASCII inputs. This fixed it for me: (using the Ruby code filter)

ruby {
code => "begin; if !event['message'].nil?; event['message'] = event['message'].force_encoding('ASCII-8BIT').encode('UTF-8', :invalid => :replace, :undef => :replace, :replace => '?'); end; rescue; end;"

To create a little log file to reproduce this try:

perl -e 'print "\xc2\xa0\n"' > test.log

Encrypting Logstash data

Note, the patch described below is now merged into the official logstash-filter-cipher plugin as of January 2016, version 2.0.3

UPDATE: Note the pending patch to fix various issues and add random IV support for encrypting logstash event messages is located here here: https://github.com/logstash-plugins/logstash-filter-cipher/pull/3

Logstash can consume a myriad of data from many different sources and likewise send this to all sorts of different outputs across your network. However the question that sometimes comes up is:  “how sensitive is this data?”

The answer to that question will vary as wildly as your inputs, however if it is sensitive in nature, you need to secure it at rest and in transit. There are many ways to do this and they can vary on the combination of your inputs/outputs and what protocols are supported and whether or not they use TLS or some other form of crypto. Your concerns might also be mitigated if you are transferring this data over a secured/trusted network, VPN or destination specific ssh tunnel.

However if none of the above apply, or you simply don’t have control over the parts of the infrastructure in between your Logstash agent and your final destination, there is one cool little filter plugin available to you in the Logstash Contrib Plugins project called the cipher filter.

If the only thing you have control over is the ability to touch your config file on both ends you can use this without having to fiddle with other infrastructure to provide a modest level of security for your data.  This is assuming you have a topology where Logstash agents, send data to some queueing system, and then they are consumed by another Logstash system downstream, which subsequently filters the data and sends it to its final resting place. Using the cipher filter comes into play where your “event” field(s) are encrypted at your source agent, flows through the infrastructure, and decrypted by your Logstash processor on the other end.

The cipher filter has been around for some time, but recently I had a use case for it similar to the above so I attempted using it and ran into some problems. In lieu of any answers, I ended up fixing the issues and adding some new features in this pull request. So if you are reading this, and that pull request is not yet approved, you might need to use my fork instead.

IMPORTANT: The security of your “key” is only as secure as the readability of your logstash config file that contains it! If this config file is world readable, your encryption is worthless. When using this method, be sure to analyze how and who could potentially access this configuration file.

Here is a simple Logstash conf example of using the cipher filter for encryption

input {

    file {
        # to test, just starting writing
        # line after line to the log
        # this will become the 'message'
        # in your logstash event
        path = "/path/to/test.log"
        sincedb_path = "./.sincedb"
        start_position = "end"

filter {

    cipher {
        algorithm = "aes-256-cbc"
        cipher_padding = 1

        # Use a static "iv"
        #iv = "1234567890123456"

        # OR use a random IV per encryption
        iv_random_length = 16

        key = "12345678901234567890123456789012"
        key_size = 32

        mode = "encrypt"
        source = "message"
        target = "message_crypted"
        base64 = true

        # the maximum number of times the
        # internal cipher object instance
        # should be re-used before creating
        # a new one, default to 1
        # On high volume systems bump this up
        max_cipher_reuse = 1

    cipher {
        algorithm = "aes-256-cbc"
        cipher_padding = 1

        # Use a static "iv"
        #iv = "1234567890123456"

        # OR use a random IV per encryption
        iv_random_length = 16

        key = "12345678901234567890123456789012"
        key_size = 32

        mode = "decrypt"
        source = "message_crypted"
        target = "message_decrypted"
        base64 = true

        # the maximum number of times the
        # internal cipher object instance
        # should be re-used before creating
        # a new one, default to 1
        # On high volume systems bump this up
        max_cipher_reuse = 1


output {

    # Once output to the console you should see three fields in each event
    # The original "message"
    # The "message_crypted"
    # The "message_decrypted" verifying that the decryption worked.
    # In a real setup, you would only send along an encrypted version to an OUTPUT
    stdout {
        codec = json



Liferay clustering internals

Ever wondered how Liferay’s internal clustering works? I had to dig into it in context of my other article on globally load balancing Liferay across separate data-centers. This blog post merely serves as a place for my research notes and might be helpful for someone else who is trying to follow what is going on under the covers in Liferay in lieu of any real documentation, design document or code comments (which Liferay seems to have none of)

Current set of mostly unanswered questions at the Liferay community forums: https://www.liferay.com/community/forums/-/message_boards/recent-posts?_19_groupThreadsUserId=17865971

Liferay Cluster events, message types, join/remove

  • Anytime ClusterExecutorImpl.memberJoined/memberRemoved() is invoked all implementations of ClusterEventListener have processClusterEvent() invoked
    • MemorySchedulerClusterEventListener: calls ClusterSchedulerEngine.getMasterAddressString() which can trigger slaveToMaster or masterToSlave @see EXECUTE notes below
    • LuceneHelperImpl.LoadIndexClusterEventListener: @see NOTIFY/JOIN section below
    • LiveUsersClusterEventListenerImpl: @see DEPART/JOIN section below
    • DebuggingClusterEventListenerImpl: logs the event to the log
    • ClusterMasterTokenClusterEventListener: on any event, invokes getMasterAddressString() to determine who the master is, or become so itself if nobody is (controlled via entry in Lock_ table) After this runs it invokes notifyMasterTokenTransitionListeners() which calls ClusterMasterTokenTransitionListener.masterTokenAcquired() or masterTokenReleased()
      • Implementors of ClusterMasterTokenTransitionListener:
        • BackgroundTaskClusterMasterTokenTransitionListener: calls BackgroundTaskLocalServiceUtil.cleanUpBackgroundTasks() which in turn ends up invoking BackgroundTaskLocalServiceImpl -> cleanUpBackgroundTasks() which is annotated with @Clusterable(onMaster=true) gets all tasks with STATUS_IN_PROGRESS and changes their status to FAILED. Then gets the BackgroundTaskExecutor LOCK and UNLOCKs it if this server is the current owner/master for that lock…
    • JGroups
      • Member “removed”: triggers ClusterRequestReceiver.viewAccepted() that in turn invokes ClusterExecutorImpl.memberRemoved() which fires ClusterEventType.DEPART (see below)
    • ClusterMessageType.NOTIFY/UPDATE: These message types are requested from other nodes when a new node comes up. When received they are in-turn translated and re-broadcast locally within a individual node as a ClusterEvent of type JOIN via ClusterExecutorImpl.memberJoined() method. Note this latter class is different in the EE jar, the implementation of this method additionally invokes the EE LicenseManager.checkClusterLicense() method
      • “Live Users” message destination:
        • triggers actions done by LiveUsersMessageListener
        • invokes cluster RPC invocation of LiveUsers.getLocalClusterUsers()
        • response is processed by LiveUsers.addClusterNode
        • proceeds to “sign in” each user that is signed in on the other joined node, appears to do session tracking etc
      • LoadIndexClusterEventListener:
        • Loads a full stream of the lucene index per company id from any node that responds who has a local lucene index last generation long value that is > this node’s index’s value. This is only triggered on a JOIN event when the “control” JChannel’s total number of peer nodes minus one is <= 1, so that this only runs on a node boot and not at other times.
      • EE LicenseManager.checkClusterLicense()

        • Please see the notes below on licensing and verification in regards to clustering on bootup
    • ClusterEventType.DEPART: This event is broadcast locally on a node when it is shutdown
      • LiveUsersClusterEventListenerImpl: listens to local DEPART event and sends message over MessageBus to other cluster members letting them know this node is gone
    • ClusterMessageType.EXECUTE: These cluster messages contain serialized ClusterRequest objects that contain RPC like information on a class, parameters and a method to invoke on other node(s). Generated by ClusterRequest.createMulticastRequest and createUnicastRequest messages. The result of these are typically passed off to ClusterExecutorUtil.execute() methods (which end up calling ClusterExecutorImpl.execute())
    • ClusterRequest.createUnicastRequest() invokers:
      • LuceneHelperImpl: _getBootupClusterNodeObjectValuePair() called by getLoadIndexesInputStreamFromCluster when it needs to connect to another node only in order to load indexes on bootup from another node who’s index is up to date
      • ClusterSchedulerEngine: callMaster() called by getScheduledJob(s)(), initialize() and masterToSlave() only if MEMORY_CLUSTERED is enabled for the job engine. Initialize() calls initMemoryClusteredJobs() if the local node is a slave to get a list of of all MEMORY_CLUSTERED jobs. MEMORY_CLUSTERED jobs mean they only run on whoever the master is which is designated by which node owns the lock named “com.liferay.portal.scheduler.SchedulerEngine” in the Lock_ table in the database. masterToSlave() demotes the current node to a slave. slaveToMaster() gives current node opportunity to acquire lock above to become the scheduler master node
      • ClusterMasterExecutorImpl: used for executing EXECUTE requests against the master only. Leverages method getMasterAddressString() to determine who the master is (via which node owns the Lock_ table entry named “ClusterMasterExecutorImpl”.
    • ClusterRequest.createMulticastRequest() invokers:
      • JarUtil: Instructs other peer nodes to JarUtil.installJar(). Invoked by DataSourceFactoryUtil.initDataSource. Also by DataSourceSwapper.swap*DataSource(). SetupWizardUtil.updateSetup()
      • ClusterLoadingSyncJob: Invoked by EditServerAction.reindex(). This sends a multicast ClusterRequest of EXECUTE for LuceneClusterUtil.loadIndexesFromCluster(nodeThatWasJustReindexedAddress) to all nodes. So all nodes, even those on the other side of the RELAY2 bridge will get this, and then will attempt to connect over the bridge back to stream the “latest” index from the node that was just re-indexed. Note however that nodes across the bridge will output this stack trace in their logs when attempting to connect back to get a token on the re-indexed node (this routine is in _getBootupClusterNodeObjectValuePair() in LuceneHelperImpl and tries to do a unicast ClusterRequest to invoke TransientTokenUtil.createToken(). My guess on this error has something to do with the “source address” which is a jgroups local UUID, being sent to the remote bridge, and on receipt the local list in a different DC does not recognize that multicast address. This “source address” that is sent ClusterExecutorImpl.getLocalClusterNodeAddress(). Also see http://www.jgroups.org/manual/html/user-advanced.html#d0e3353. ClusterRequestReceiver *does* get invoked w/ a property scoped SiteUUID in the format of hostname:sitename, however ClusterRequestReceiver.processClusterRequest() defers to ClusterRequest.getMethodHandler().invoke() which has no knowledge of this originating SiteUUID and only the internal “local” jgroups UUID address that is embedded as an “argument” to loadIndexesFromCluster() in the ClusterRequest, hence this can’t effectively/properly call TransientTokenUtil.createToken() on the sender (and the error below). When it attempts the call via ClusterExecutorImpl (jgroups control channel .send()) using the “address” argument it received in the original inbound payload. We see the message is dropped cause the target address is invalid/unknown..
        • 21:10:30,485 WARN [TransferQueueBundler,LIFERAY-CONTROL-CHANNEL,MyHost-34776][UDP:1380] MyHost-34776: no physical address for 27270628-816b-ddba-872f-1f2d27327f2f, dropping message
        • 18:37:40,169 INFO [Incoming-1,shared=liferay-control][LuceneClusterUtil:47] Start loading Lucene index files from cluster node 9c4870e7-17c2-18fb-01c4-5817bb5d1290
          18:37:46,538 WARN [TransferQueueBundler,shared=liferay-control][TCP:1380] null: no physical address for 9c4870e7-17c2-18fb-01c4-5817bb5d1290, dropping message
        • 18:01:56,302 ERROR [Incoming-1,LIFERAY-CONTROL-CHANNEL,mm2][ClusterRequestReceiver:243] Unable to invoke method {arguments=[[J@431bd9ee, 5e794a8e-0b7f-c10d-be17-47ce43b66517], methodKey=com.liferay.portal.search.lucene.cluster.LuceneClusterUtil.loadIndexesFromCluster([J,com.liferay.portal.kernel.cluster.Address)}
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at java.lang.reflect.Method.invoke(Method.java:606)
          at com.liferay.portal.kernel.util.MethodHandler.invoke(MethodHandler.java:61)
          at com.liferay.portal.cluster.ClusterRequestReceiver.processClusterRequest(ClusterRequestReceiver.java:238)
          at com.liferay.portal.cluster.ClusterRequestReceiver.receive(ClusterRequestReceiver.java:88)
          at org.jgroups.JChannel.invokeCallback(JChannel.java:749)
          at org.jgroups.JChannel.up(JChannel.java:710)
          at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1025)
          at org.jgroups.protocols.relay.RELAY2.deliver(RELAY2.java:495)
          at org.jgroups.protocols.relay.RELAY2.up(RELAY2.java:330)
          at org.jgroups.protocols.FORWARD_TO_COORD.up(FORWARD_TO_COORD.java:153)
          at org.jgroups.protocols.RSVP.up(RSVP.java:188)
          at org.jgroups.protocols.FRAG2.up(FRAG2.java:181)
          at org.jgroups.protocols.FlowControl.up(FlowControl.java:400)
          at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
          at org.jgroups.protocols.pbcast.GMS.up(GMS.java:896)
          at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:245)
          at org.jgroups.protocols.UNICAST2.up(UNICAST2.java:453)
          at org.jgroups.protocols.pbcast.NAKACK2.handleMessage(NAKACK2.java:763)
          at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:574)
          at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:147)
          at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:187)
          at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:288)
          at org.jgroups.protocols.MERGE3.up(MERGE3.java:290)
          at org.jgroups.protocols.Discovery.up(Discovery.java:359)
          at org.jgroups.protocols.TP.passMessageUp(TP.java:1263)
          at org.jgroups.protocols.TP$4.run(TP.java:1181)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.java:744)
          Caused by: com.liferay.portal.kernel.exception.SystemException: java.lang.NullPointerException
          at com.liferay.portal.search.lucene.LuceneHelperImpl._getBootupClusterNodeObjectValuePair(LuceneHelperImpl.java:830)
          at com.liferay.portal.search.lucene.LuceneHelperImpl.getLoadIndexesInputStreamFromCluster(LuceneHelperImpl.java:451)
          at com.liferay.portal.search.lucene.LuceneHelperUtil.getLoadIndexesInputStreamFromCluster(LuceneHelperUtil.java:326)
          at com.liferay.portal.search.lucene.cluster.LuceneClusterUtil.loadIndexesFromCluster(LuceneClusterUtil.java:57)
          ... 32 more
          Caused by: java.lang.NullPointerException
          at com.liferay.portal.search.lucene.LuceneHelperImpl._getBootupClusterNodeObjectValuePair(LuceneHelperImpl.java:809)
          ... 35 more
      • PortalImpl: resetCDNHosts() does a ClusterRequest.EXECUTE for the same PortalUtil.resetCDNHosts() method on all peer nodes. To avoid recursive loop uses ThreadLocal to enter the EXECUTE block (relevant for remote nodes)
      • LuceneHelperImpl: _loadIndexFromCluster() (invoked via JOIN event above). Invoked via EXECUTE remote method calls against LuceneHelperUtil.getLastGeneration() against all peers to see who’s lucene indexes generation time is newer up earler than ME, picks that one and then initiate a direct http stream connection to it to fetch the entire index from it …. Who is a “peer” candidate is determined by invoking ClusterExecutorUtil.getClusterNodeAddresses() which is ultimately realized by calling the “control” channel’s getView() method
      • MessageSenderJob: notifyClusterMember() invoked by MessageSenderJob.doExecute() when a job execution context’s next execution time is null. The notifyClusterMember() method invokes SchedulerEngineHelperUtil.delete() via a ClusterRequest.EXECUTE
      • ClusterableAdvice: AOP advice class that intercepts any Liferay method annotatted with @Clusterable, AFTER returning, will then create a ClusterRequest.EXECUTE for the same method passed to all nodes in the cluster. Used by ClusterSchedulerEngine (delete, pause, resume, schedule, suppressError, unschedule, update) BackgroundTaskLocalServiceImpl, and PortletLocalServiceImpl. Has an ‘onMaster’ property that awkwardly controls whether or not the advice will actually be applied or not. If onMaster=false, then the advice fast returns like a no-op. But if onMaster=true then the advice runs (locally on the node if THIS node is the master, or via a ClusterRequest.EXECUTE). WHO is the master is dictated by which nodes owns the LOCK in the database named “com.liferay.portal.scheduler.SchedulerEngine”.
      • EhcacheStreamBootstrapCacheLoader: start() and doLoad() invokes EhcacheStreamBootstrapHelpUtil -> loadCachesFromCluster() broadcasts a ClusterRequest for EXECUTE of the method createServerSocketFromCluster() on all peer nodes, and then attempts to connect to a peer node to load all named caches from the peer node…
  • ClusterExecutorImpl notes:
    • This class has several methods relating to figuring out “who” is in the “cluster” and due to the implementation of these methods you will get inconsistent “views” of who are the local cluster members (if using RELAY2 in JGroups)
      • getClusterNodeAddresses() = consults the “control” JChannel’s receiver’s “view” of the cluster. This will only return local nodes subscribed to this JChannel. These are fetched real-time every time its invoked.
      • getClusterNodes() = consults a local map called “_liveInstances”. Live instances is populated when the “memberJoined()” method is invoked, in response to ClusterRequestReceiver.processClusterRequest() being invoked when a new node comes up. Note that because this is in response to ClusterRequests…. which WILL be transmitted over RELAY2 in JGroups, this will reflect MORE nodes than what getClusterNodeAddresses() returns…..
      • There is also another local member called “_clusterNodeAddresses” which also is populated w/ addresses of other nodes when “memberJoined()” is invoked…. I really don’t understand why this exists, AND getClusterNodeAddresses() exists, AND getClusterNodes() exists when they could potentially return inconsistent data.
  • Quartz/Jobs scheduling
    •  ClusterSchedulerEngine proxies all requests for scheudling jobs through SchedulerEngingProxyBean (BaseProxyBean) which just relays schedule() invocations through Messaging which ends up invoking QuartzSchedulerEngine.schedule()
    • ClusterSchedulerEngine on initialization calls getMasterAddressString() which gets the Lock from the _Lock table named “com.liferay.portal.scheduler.SchedulerEngine” via LockLocalServiceImpl.lock() which is a row based locking in the database this is based on “owner” which is based on the local IP?
    • _doMasterToSlave() – queries current master for list of all MEMORY_CLUSTERED jobs (does this over RMI or Jgroups method invoke of getScheduledJobs), gets this list of job names and unschedules them locally
    • slaveToMaster() – takes everything in _memoryClusteredJobs and schedules it locally. _memoryClusteredJobs seems to just be a global list of the MEMORY_CLUSTERED jobs that are kept track of on each node…
  • Licensing: If running Liferay EE the LicenseManager.checkClusterLicense() -> checkBinaryLicense() is invoked via ClusterExecutorImpl.memberJoined() (note that this is only invoked in the EE version as the class is different in the portal-impl.jar contained in the EE distribution). This particular invocation chain appears to respond to the JOIN event for the peer node by checking if the “total nodes” the node knows about is greater than the total number of nodes the license permits, if it is, the local node gets a System.exit() invoked on it (killing itself). For learning purposes if you are so interested to learn more about this obfuscated process, you can do so by investigating further on your own, its all in the EE portal-impl.jar. Here are some other notes on the licensing and how it relates to clustering and issues one might encounter.
    • LicenseManager. getClusterLicense() invokes ClusterExecutorUtil.getClusterNodes() and does validation to check if you have more nodes in the cluster than your installed license supports (amongst other things).
    • Another thing that happens when ClusterExecutorImpl.memberJoined() is invoked is that ClusterExecutorImpl.getClusterNodes() appears to be invoked (which @see above, in addition to local peer nodes in the same DC, will return nodes that exist across a JGroups RELAY2 bridge). It picks one of these peer nodes and remotely invokes LicenseManager.getLicenseProperties() against it, this is done via a ClusterExecutorUtil.execute(). The return is a Future with a timeout, upon reception the Future has a Map in it containing the license properties, these are then compared agains the local node’s license, if they match we are good to go, otherwise….. read below (a mixing licenses error will occurr) Why is this relvent? Well despite having legit licenses, when using RELAY2, you can come across two erronious errors that will prevent your cluster from booting, as described below.
      • The first error you might see is a timeout error, when it reaches out to a peer node to verify a license in response to a new member joining the cluster (likely a member over a RELAY2 bridge). It will look like the below. Typically just rebooting Liferay takes care of the issue, as explained in the followup bullets:
        • java.util.concurrent.TimeoutException
          at com.liferay.portal.kernel.cluster.FutureClusterResponses.get(FutureClusterResponses.java:88)
          at com.liferay.portal.license.a.a.f(Unknown Source)
          at com.liferay.portal.license.a.a.b(Unknown Source)
          at com.liferay.portal.license.a.f.d(Unknown Source)
          at com.liferay.portal.license.a.f.d(Unknown Source)
          at com.liferay.portal.license.LicenseManager.c(Unknown Source)
          at com.liferay.portal.license.LicenseManager.checkBinaryLicense(Unknown Source)
          at com.liferay.portal.license.LicenseManager.checkClusterLicense(Unknown Source)
          at com.liferay.portal.cluster.ClusterExecutorImpl.memberJoined(ClusterExecutorImpl.java:442)
          at com.liferay.portal.cluster.ClusterRequestReceiver.processClusterRequest(ClusterRequestReceiver.java:219)
          at com.liferay.portal.cluster.ClusterRequestReceiver.receive(ClusterRequestReceiver.java:88)
          at org.jgroups.JChannel.invokeCallback(JChannel.java:749)
          at org.jgroups.JChannel.up(JChannel.java:710)
          at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1025)
          at org.jgroups.protocols.relay.RELAY2.up(RELAY2.java:338)
          at org.jgroups.protocols.FORWARD_TO_COORD.up(FORWARD_TO_COORD.java:153)
          at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:178)
          at org.jgroups.protocols.FRAG2.up(FRAG2.java:181)
          at org.jgroups.protocols.FlowControl.up(FlowControl.java:400)
          at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
          at org.jgroups.protocols.pbcast.GMS.up(GMS.java:896)
          at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:245)
          at org.jgroups.protocols.UNICAST.up(UNICAST.java:414)
          at org.jgroups.protocols.pbcast.NAKACK2.handleMessage(NAKACK2.java:763)
          at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:574)
          at org.jgroups.protocols.BARRIER.up(BARRIER.java:126)
          at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:147)
          at org.jgroups.protocols.FD.up(FD.java:253)
          at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:288)
          at org.jgroups.protocols.MERGE2.up(MERGE2.java:205)
          at org.jgroups.protocols.Discovery.up(Discovery.java:359)
          at org.jgroups.protocols.TP$ProtocolAdapter.up(TP.java:2610)
          at org.jgroups.protocols.TP.passMessageUp(TP.java:1263)
          at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1825)
          at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1793)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.java:745)
    • When booting up several Liferay EE nodes (configured for clustering) simultaneously it appears you can run into a situation where all the nodes fail to boot and say they have “invalid licenses” (despite the fact you have valid cluster licenses installed under data/license)…and then your container processes are automatically shut down… nice
    • This appears to be due to the verification process that LicenseManager does on bootup where it receives license information from peer nodes, however if all nodes are started up at the same time it appears there is a condition of timing, when none of the nodes are yet in a “ready” state, so when peers connect to them for their license properties, they respond w/ nothing…. which is then compared against the local node’s license and so since they don’t “equals()” it decides the local license (despite being legit) is invalid..
    • Here is example output from NODE1 (booted up at same time as NODE2, note the blank remote license):
        • 13:55:20,381 ERROR [Incoming-5,shared=liferay-control][a:?] Remote license does not match local license. Local license: {productEntryName=Liferay Portal, startDate=11, expirationDate=22, description=Liferay Portal, owner=Liferay Portal, maxServers=4, licenseEntryName=Portal6 Non-Production (Developer Cluster), productVersion=6.1, type=developer-cluster, accountEntryName=Liferay Trial, version=2}
          Remote node: {clusterNodeId=11111111111, inetAddress=/, port=-1}
          Remote license: {}
          Mixing licenses is not allowed. Local server is shutting down.
      • Here is example output from NODE2 (booted up at same time as NODE1):
        • 13:55:20,388 ERROR [Incoming-1,shared=liferay-control][a:?] Remote license does not match local license. Local license: {productEntryName=Liferay Portal, startDate=11, expirationDate=22, description=Liferay Portal, owner=Liferay Portal, maxServers=4, licenseEntryName=Portal6 Non-Production (Developer Cluster), productVersion=6.1, type=developer-cluster, accountEntryName=Liferay Trial, version=2}
          Remote node: {clusterNodeId=222222222222, inetAddress=/, port=-1}
          Remote license: {}
          Mixing licenses is not allowed. Local server is shutting down.
      • How to solve this? The way I did it was to always boot one node first, with NO peers… wait for it to complete booting before bringing up all other nodes, then these errors seem to go away.


 JGroups RELAY2 notes

Liferay does NOT use RpcDispatcher but has its own “rpc” abstraction via “ClusterRequest” which is Serializable. (contains target addresses, MethodHandler, has a UUID, and type of EXECUTE) MethodHandler is just a serializable bag of arguments and a MethodKey which stores the target class, methodName and paramTypes. MethodHandler uses these deserialized parts on the receiving side to invoke the method locally via reflection

On the RECEIVING side is ClusterRequestReceiver which is the JGroups receiver that processes all inbound messages from peers. It takes the ClusterRequest and gets the MethodHandler (if type is EXECUTE) and invokes it, then returns the response as a ClusterNodeResponse