Category: cloud computing

Docker container peer IP discovery with Swarm services

 

When running applications deployed as a docker “service” within Docker Swarm (post 1.12 swarm mode), often those application’s have the need to discover their peer containers for purposes such as forming a cluster. Similar to “service discovery”, I’ll just use the term “peer discovery” for this use-case.

The simple example is as follows: You create a swarm service bound to a docker network “N” and launch/scale X containers over it. How do those container instances find one another on the docker network?

As it stands today, simple dynamic discovery of peer container instances within a named swarm “service” residing on a particular docker network is not trivial. The Docker engine’s service API, can provide the information you need to figure this out, however this still puts a lot of work on you, the container developer who needs to collect this info and then act upon it in order to form a higher level cluster between your containers.

I had this specific challenge for a Java based service that needed to form a peer cluster dynamically. Out of that use case I came up with a generic library that you can drop into your Java container application called docker-discovery-swarm-service which is available at: https://github.com/bitsofinfo/docker-discovery-swarm-service

The purpose of this library is for “self-discovery” from within your JVM based Docker service application where you need to discover what your accessible docker service network bound IP is, as well as that of your peers within the same named docker service. As noted above this is critical if your container has to do further peer discovery for other services it provides or clustering groups it must form.

You can read all the details of how it works and how to use it here: https://github.com/bitsofinfo/docker-discovery-swarm-service

Hopefully it will be of use to you as well.

Advertisements

Hazelcast discovery with Docker Swarm

After going through a lot of R&D of deploying Hazelcast enabled apps as Docker Containers, first starting on pre-Docker 1.12 (legacy swarm, w/ mgrs/agents etc), with Registrator and Consul, ended up creating the hazelcast-consul-discovery-spi (available on GitHub).

This served its purpose well, but as Docker evolved and post 1.12 “swarm mode” came out with the entire concept of Docker “services”, coupled with the questionable future of the Registrator project itself; it was time to migrate our apps to “swarm mode” services….. however again we ran into issues w/ Hazelcast and peer discovery.

Out of this challenge I ended up creating the hazelcast-docker-swarm-discovery-spi

 

This works with Hazelcast 3.6-EA+ and Docker Swarm services (manager nodes) to provide automatic peer discovery of the Hazelcast cluster. This is done by interrogating the swarm manager api’s to locate peers based on shared information such as network info, service names and service labels (that are specified by you when you launch your services)

I hope others find this helpful, and please leave your feedback, pull-requests or issues on the project!

 

 

Microservices with Spring Cloud & Docker

In the recent past, a team I was working with was facing an architectural decision regarding what technology and deployment footprint to go with for a greenfield project.

Its been about five months now since this application has been in production.

Use case:

The use-case in question was to present a suite of REST services to front a large set of “master data” dimensions for a data warehouse as well securing that data (record level ACLs). In addition to this, the security ACLs it would manage needed to be transformed and pushed downstream to various legacy systems.

With that general use-case in mind, some other items of note that were to be considered:

  • The specific use case of “REST services facade for master data” was generic in nature, so the security model was to be agnostic of the specific data set being secured and have the capability of being applied across different data sets for different clients.
  •  Changes for a given service should be easy to fix and deploy and independent of one another with minimal interruption.
  • The services need to scale easily and be deployed across several different data centers which are a mix of traditional bare-metal/ESX vm’s as well as in the cloud (azure/aws). Tight coupling to each DC should be minimized when possible.
  • The services stack would potentially serve as the hub for orchestrating various other ETL related processes for the warehouse, so adding new “services” should be easy to integrate into the larger application.
  • Given the sensitivity of the data, all traffic should be secured w/ TLS and REST apis locked down w/ OAuth2 client credentials based access.

Given the above requirements and much discussion we decided to go with a container based microservices architecture.

Why?

First off, this team already had significant experience w/ the traditional monolithic approach to applications and had already run into the many shortcomings of this architecture over the long term. As new features needed to be deployed, it was becoming more of a pain to add new “services” to the monolith as it required the entire stack to be redeployed which is disruptive. Given this new application would have a similar lifecycle (new services needing to be added over time) we wanted to try a different approach…. and who was the new kid on the block? “microservices”; and it was time to get one’s feet wet.

This shop was primarily focused on NodeJS, LAMP and Java stacks so after doing some research the decision was made to go with Spring Cloud as the base framework to build this new suite of services. If one does any reading on the topic of microservices, you will quickly see such architectures involve many moving parts: service discovery, configuration, calling tracing (i.e. think google dapper), load balancing etc.

Do you want to write these pattern implementations this all yourself? Probably not; I sure didn’t. So after evaluating the space at the time, Spring Cloud was the most robust solution for this and one of its biggest selling points is that it was based on many of the great frameworks that have come out of Netflix’s OSS project (Eureka, Hystrix and more..)

Lastly the decision to go w/ Docker was really a no brainer. The services would potentially need to be deployed and moved across various data centers. By using Docker DevOps would be able to have a footprint and deployment process that would be consistent regardless of what data center we would be pushing to. The only data center specific particulars our DevOps guys had to care about was, setting up the Docker infrastructure (i.e. think Docker hosts on VMs via Ansible coupling to DC specific host provisioning APIs) and the DC specific load balancers, who’s coupling to the application was just a few IP’s and ports (i.e. the IPs of the swarm nodes with exposed ports of our Zuul containers). Everything downstream from that was handled by Docker Swarm and the microservices framework itself (discovery, routing etc)

CELL

The acronym for this services backend ended up being CELL which stands for… well whatever you want it to stand for…. I guess think of it (the app) as an organism made up of various cells (services). CELL’s services are consumed by various applications that present nice user interfaces to end users.

Screen Shot 2017-05-16 at 1.34.04 PM

The above diagram gives a high level breakdown of its footprint. Its broken up into several services:

Core services that all other app services utilize:

  • cell-discovery: Netflix Eureka: Participating services both register on startup and use this to discover the cell-config service (to bootstrap themselves) plus discover any other peer level services they need to talk to.
  • cell-config: spring-cloud-config: Git sourced application configuration (w/ encryption support). Each application connects to this on startup to configure itself.
  • oauth2-provider: All services are configured w/ an OAuth2 client credentials compliant token generation endpoint to authenticate and get tokens that all peer services validate (acting as resource servers)
  • tracing-service: zipkin: All services are instrumented w/ hooks that decorate all outbound http requests (and interpret them upon reception) with zipkin compliant tracing headers to collect call tracing metrics etc. Background threads send this data periodically to the tracing service.
  • cell-event-bus: kafka and spring-cloud-stream: Certain services publish events that other services subscribe to to maintain local caches or react to logic events. This provides a bit looser coupling than direct service to service communication; leveraging Kafka gives us the ability to take advantage of such concepts of consumer groups for different processing requirements. (i.e. all or one)
  • cell-router: Netflix zuul: Router instances provide a single point of access to all application services under a https://router/service-name/ facade (discovered via the discovery service). Upstream data center specific FQDN bound load balancers only need to know about the published ports for the Zuul routers on the swarm cluster to be able to access any application service that is available in CELL.
  • cell-service-1-N: These represent domain specific application services that contain the actual business logic implementation invoked via external callers. Over time, more of these will be added to CELL and this is where the real modularity comes into play. We try to stick to the principle of one specific service per specific business logic use-case.

CELL Security

As noted above, one of the requirements for CELL was that participating services could have data they manage, gated by a generic security ACL system. To fulfill this requirement, one of those domain specific apps is the cell-security service.

Screen Shot 2017-05-16 at 10.00.34 PM

The cell-security service leverages a common library that both cell-security servers and clients can leverage to fulfill both ends of the contract. The contract being defined via some general modeling (below) and standard server/client REST contracts that can easily be exposed in any new “service” via including the library and adding some spring @[secConfig] annotations in an app’s configuration classes.

  • Securable: a securable is something that can have access to it gated by a SecurityService. Securables can be part of a chain to implement inheritance or any strategy one needs.
  • Accessor: is something that can potentially access a securable
  • ACL: Binds an Accessor to a Securable with a set of Permissions for a given context and optional expression to evaluate against the Securable
  • SecurableLocator: given a securable‘s guid, can retrieve a Securable or a chain of Securables
  • AccessorLocator: given a accessor‘s guid, can retrieve the Accessor
  • AccessorLocatorRegistry: manages information about available AccessorLocators
  • SecurableLocatorRegistry: manages information about available SecurableLocators
  • ACLService: provides access to manage ACLs
  • PrincipalService: provides access to manage Principals
  • LocatorMetadataService: provides access to manage meta-data about Securable|Accessor Locators
  • ACLExpressionEvaluator: evaluates ACL expressions against a Securable
  • SecurityService:  Checks access to a Securable for a requesting Accessor

The model above is expressed via standard REST contracts and interfaces in code, that are to be fulfilled by a combination of default implementations and those customized by individual application CELL services who wish to leverage the security framework. There are also a few re-usable cell-security persistence libraries we created to let services that leverage this to their persist security data (both authoritative and local consumer caches) across various databases (Mongo DB and or JPA etc). As well a another library to hook into streams of security events that flow through CELL’s Kakfa event bus.

Spring Cloud impressions

When I started using Spring Cloud (in the early days of the Brixton release), I developed a love – hate relationship with it. After a few initial early successes with a few simple prototypes I was extremely impressed with the discovery, configuration and abstract “service name” based way of access peer services (via feign clients bound to the discovery services)…. you could quickly see the advantageous to using these libraries to really build a true platform that could scale to N in several different ways and take care of a lot of the boilerplate “microservices” stuff for you.

That said, once we really got into the developing CELL we ended up having two development paths.

The first being one team working on creating a set of re-usable libraries for CELL applications to leverage and integrate into the CELL microservice ecosystem. This consisted of creating several abstractions that would bring together some of the required spring cloud libraries, pre-integrated via base configuration for CELL, and just make it easier to “drop-in” to a new CELL app without having to wade into the details of spring cloud too much and just let the service developer focus on their service. The amount of time on this part was about 70% of the development effort, heavily front loaded in the start of the project.

The second being the other team using the latter to actually build the business logic services, which was the whole point of this thing in the first place. This accounted for about 30% of the work in the beginning and today… about 80-90% of the work now that the base framework of CELL is established.

The hate part (well not true hate, but you know what I mean… friendly frustration) of this ended up being the amount of man hours spent in the start of the project dealing/learning spring-cloud. There is a tangible learning curve to be aware of. Working around bugs, finding issues in spring-cloud, both real ones or just working through perceived ones via misunderstandings due to the complexity of spring-cloud itself.

I’m not going to go into each specific issue here, however there were simply a lot of issues and time spent debugging spring cloud code trying to figure out why certain things failed or to learn how they behaved so we could customize and properly configure things. In the end most of the issues could be worked around or were not that hard to fix…. its just the time it took to figure out the underlying causation’s, produce a reproducible sample and then convey it to the spring-cloud developers to get help with. (The spring-cloud developers BTW are excellent and VERY responsive) kudos to them for that.

Lastly, taking each CELL artifact (jar) and getting it wrapped up in a Docker container was not an huge ordeal. In the deployed footprint, each CELL artifact is a separate Docker Swarm Service that is deployed on its own overlay network (separate one per CELL version). As stated previously, the CELL router (Zuul) is the only service necessary to be exposed on a published swarm port and then upstream datacenter load balancers can just point to that.

So would I recommend Spring-Cloud?

Yes. Spring Cloud at its heart is really an pretty impressive wrapper framework around a lot of other tools that are out there for microservices. It has a responsive and helpful community. (definitely leverage Gitter.im if you need help!) The project has matured considerably since I first used it and many of the issues I was dealing with are now fixed. Compared to writing all the necessary things to have a robust microservices ecosystem yourself….. I’ll take this framework any day.

Final note. I would NOT recommend using spring-data-rest. We used that on a few of the CELL application logic services and its main benefit of providing you a lot of CRUD REST services in a HATE-OS fashion…. its just not that easy to customize the behavior of, has a lot of bugs and just generally was a pain to work with. At the end of the day it would have just been easier to code our own suite of CRUD services instead of relying on it.

 

 

Docker container IP and port discovery with Consul and Registrator

Do you use Docker?

Does your containerized app have the need to discover both its own IP and one or more mapped ports?

How can another container access my exposed ports and how can I do the same of my peers?

As it stands today, simple self discovery of your container’s accessible IP and one or more of its mapped ports is not exposed to your Docker container process as a native feature of the engine itself.

If you’ve attempted to containerize an app that attempts to discover its peers in order to form its own peer-level cluster etc, you’ve likely run into this challenge.

That said there are several tools out there with can help you with this issue. One of which is Registrator which is a special container that listens for events from a Docker host and acts as service discovery bridge that relays this info into other tooling such as Consul and etcd etc.  In short, when your container is launched, the Registrator container collects all the info about the docker host it is running on and its exposed ports and registers this under a named service in one of the aforementioned backends.

This is all fine and great, however this still puts a lot of work on you, the container developer who needs to collect this info and then act upon it in order to form a higher level cluster between your containers.

I had this exact same problem for a Java based service that needed to form a Hazelcast cluster dynamically. Out of that use case I came up with a generic library that you can drop into your Java container application called docker-discovery-registrator-consul which is available at: https://github.com/bitsofinfo/docker-discovery-registrator-consul

The purpose of this library is for “self-discovery” from within your JVM based Docker application where you need to discover what your accessible docker-host bound IP and mapped port(s) are, as well as your peers within the same service. As noted above this is critical if your container has to do further peer discovery for other services it provides or clustering groups it must form.

You can read all the details of how it works and how to use it here: https://github.com/bitsofinfo/docker-discovery-registrator-consul

Hopefully it will be of use to you as well.

Reactive re-engineering with Akka

Everyone once in a while during the life cycle of any given piece of software comes that time where you have the opportunity to improve it in a major way….if that is, its lucky enough to still be in production.

One particular system I’ve been involved with is responsible for processing a lot of data and keeping that data in sync across many systems. For purposes of this little case study I’ve dumbed down the overall use-case, concept, architecture and implementation details to this simple idea. We need to synchronize data.

Use-Case

Something in the environment (i.e. a user or other process) makes a request for some operation to be done that generates a change operation against a “DataEntry”. This DataEntry is manipulated in the primary database and then the change needs to be synchronized numerous other systems to count. The changes could be described as “create DataEntry item number XYZ”, “Mutate DataEntry XYZ in fashion Z” or simply “Delete DataEntry item XYZ”.

Each target system where a DataEntry is to be synchronized is called a DataStore and involves its own complicated process of mutating our representation of a DataEntry into the target DataStore’s representation and the means to do it can vary wildly; i.e. web-service calls, RDBMS dml, nosql operations etc etc. Not to mention, as with any integration, each of these DataStore sync calls has the possibility being fast, very slow, not working at all, or experiencing random transient failures.

Version 1

For most of its life the system functioned as follows, each DataEntry mutated in the system was placed in a queue, and then processed by a consumer node’s DataSyncProvider who’s responsibility is to determine all the DataStores to process the DataEntry in via interrogating a DataStoreLocator and then make sure it happens.  It worked similar to the diagrams below (highly simplified!), and note the bottleneck.

Screen Shot 2016-03-19 at 5.37.13 PM

Version 1, synchronization flow, within one node

Screen Shot 2016-03-19 at 5.38.55 PM

Version 1. Overall cluster view

Version 1 issues

Version 1 functioned fine for most of its life, however the biggest issues with is were simply its lack of efficiency and speed in synchronizing any given DataEntry across all of the DataStores it was applicable for. More often than not any given DataEntry mutation would result in dozens of target DataStores that it needed to be synchronized against. Due to the sequential processing of each DataStore, accommodating for retries, and waiting for the overall result….before moving on to the next one, this would result in a sizable delay until the mutation materialized in all target DataStores. (not to mention lack of good core utilization across the cluster). What did this mean? Well an opportunity for improvement.

Version 2

Obviously, the choice here was to move to asynchronous parallel DataStore execution and decoupling from the main DataEntry mutation consumer thread(s)….. and there are many ways you could go about doing that. Fortunately the overall modeling of the synchronization engine enabled considerably flexibility in swapping out the implementation with a little refactoring. The key points being introducing the concept of a DataEntry logic execution engine; aptly named LogicExecutionEngine and adding a new implementation of our DataStoreLocator that could decouple any given DataStore’s location from any dependency on its actual residency within the local JVM.

Great. Now that the modeling is adjusted, what about implementation? For one, there was no interest it writing a multi-threaded execution engine, even though one could with the modeling in place; any implementation could have been be developed and plugged in. That said, after looking around for a good framework that provided location transparency, parallel execution management, clustering and good resiliency, it was decided that Akka, and moving to an Actor model for the new engine would be a good fit.

Screen Shot 2016-03-20 at 11.44.57 AM.png

Version 2. Actor based DataStore’s and LogicExecutionEngine

As shown above, the DataStores actually are now implemented via an ActorRef version which is then passed to the LogicExectionEngine who’s new Actor based implementation injects them into yet another Actor for the DataEntry logic processing awaiting a Future<Result>. This model increased overall execution time to completion by roughly 80% as everything now executed in parallel.

Another benefit was additional resiliency and distribution of load due to the location transparency of the actual DataStore itself. Utilizing Akka’s various Routers, such as in this case the ClusterRouterGroup Actor, we were able to further redistribute the processing of any given DataStore workload across the cluster and appropriately react as nodes came on and offline. See exploded view below.

Screen Shot 2016-03-20 at 11.22.38 AM

Version 2. Exploded view of DataStore location transparency

Lastly, the diagram below shows how execution of these DataEntry tasks is now more evenly distributed across the entire set of available nodes in the cluster. All nodes can now be potentially involved in processing any DataEntry workload. Also by feeding dynamic configuration into the construction of each ClusterRouterGroup Actor the system could also fine tune the distribution and amount of Actors in the cluster that are available to process entries targeted at any given DataStore. This permits for custom down-scaling based on the limitations or load ceilings that any given downstream target DataStore may present. In other words it permits throttling of loads.

 

Screen Shot 2016-03-19 at 6.02.24 PM.png

Version 2. Better utilization of core resources across cluster

Overall my experience with Akka was positive. After working some of the bugs out, so far in production this solution has been quite stable and Akka’s clustering protocol quite stable. If you are considering moving to a more reactive design approach for the back end of a system, I highly recommend giving Akka a consideration.

Lastly, as always I highly recommend going with a pure interface oriented design in any system you build. In this use-case, this system’s entire platform itself, having been designed from the ground up using interfaces extensively and then plugging in different “providers” (i.e. things like Camel or Akka) for each aspect of implementation has proved out to be very important as it has evolved over time. This gives the system tremendous flexibility as it matures over time and additional longevity.

Hazelcast discovery with Etcd

I’ve used Hazelcast for years and have generally relied upon the availability of multicast for Hazelcast cluster discovery and formation (within a single data-center). Recently was faced with two things, expand the footprint into a non-multicast enabled data-center and secondly pre-prep the service for containerization where nodes will come and go as scaling policies dictate it…. hardwired Hazelcast clustering via an XML configuration and/or reliance on multicast is a no-go.

With Hazelcast 3.6, they now support a pluggable implementation for a cluster discovery mechanism called the Discovery SPI. (Discovery Strategy) Perfect timing, given we are already playing with Etcd as part of our Docker container strategy, this was an opportunity to let our application’s native clustering mechanism (coded on top of Hazelcast) to leverage Etcd as well as discover/remove peers both within, and potentially across data-centers.

So I coded up hazelcast-etcd-discovery-spi available on GitHub.

diag.png

This works with Hazelcast 3.6-EA+ and Etcd to provide (optional) automatic registration of your hazelcast nodes as Etcd services and automatic peer discovery of the Hazelcast cluster.

Note that the automatic registration of each hazelcast instance as a Etcd service is OPTIONAL if you want to manually maintain these key-paths in etcd. I added that in simply because I think it will be convenient for folks, especially when containerizing a Hazelcast enabled app (such as via Docker) where the less “dependencies” and manual things to do (i.e. register your hazelcast nodes manually).. the better. You can totally embedded this functionality with this discovery strategy SPI.

I hope others find this helpful, and please leave your feedback, pull-requests or issues on the project!

NOTE, if you are running your app in Docker you have a separate issue where you need to determine your own externally accessible IP/PORT that the docker host has mapped for you on 5701… well how can you determine that so that you can publish the correct IP/PORT info to Etcd? Check out: https://github.com/bitsofinfo/docker-discovery-registrator-consul

NOTE! Interested in consul? There is a separate project which is built around Consul for your discovery strategy located here: https://github.com/bitsofinfo/hazelcast-consul-discovery-spi

 

Hazelcast discovery with Consul

I’ve used Hazelcast for years and have generally relied upon the availability of multicast for Hazelcast cluster discovery and formation (within a single data-center). Recently was faced with two things, expand the footprint into a non-multicast enabled data-center and secondly pre-prep the service for containerization where nodes will come and go as scaling policies dictate it…. hardwired Hazelcast clustering via an XML configuration and/or reliance on multicast is a no-go.

With Hazelcast 3.6, they now support a pluggable implementation for a cluster discovery mechanism called the Discovery SPI. (Discovery Strategy) Perfect timing, given we are already playing with Consul as part of our Docker container strategy, this was an opportunity to let our application’s native clustering mechanism (coded on top of Hazelcast) to leverage Consul as well as discover/remove peers both within, and potentially across data-centers.

So I coded up hazelcast-consul-discovery-spi available on GitHub.

diag.png

This works with Hazelcast 3.6-EA+ and Consul to provide automatic registration of your hazelcast nodes as Consul services (without having to run a local Consul agent) and automatic peer discovery of the Hazelcast cluster.

Note that the automatic registration of each hazelcast instance as a Consul service is OPTIONAL if you already have Consul agents running that define your Hazelcast service nodes. I added that in simply because I think it will be convenient for folks, especially when containerizing a Hazelcast enabled app (such as via Docker) where the less “dependencies” like a Consul agent available on the host, or in the container (or another container).. the better. You can totally embedded this functionality with this discovery strategy SPI.

I hope others find this helpful, and please leave your feedback, pull-requests or issues on the project!

NOTE, if you are running your app in Docker you have a separate issue where you need to determine your own externally accessible IP/PORT that the docker host has mapped for you on 5701… well how can you determine that so that you can publish the correct IP/PORT info to Consul? Check out: https://github.com/bitsofinfo/docker-discovery-registrator-consul

NOTE! Interested in etcd? There is a separate project which is built around etcd for your discovery strategy located here: https://github.com/bitsofinfo/hazelcast-etcd-discovery-spi

 

Book review: Building Microservices

Screen Shot 2015-04-06 at 10.11.15 PMRecently I read Sam Newman’s “Building Microservices” , at ~280 pages its a fairly quick read. The reviews on this book overall are mixed and I can see where readers are coming from. By the title of this book one might expect some coverage of some of the microservices frameworks out there, concrete examples, maybe some actual code… but you won’t really find that here. Instead you will find a pretty good overview of various architectural approaches to modern application design in today’s world; covering general topics such a proper separation of concerns, unit-testing, continuous integration, automation, infrastructure management, service discovery, fault tolerance, high-availability and security etc.

In reality, none of the principles covered in this book are the exclusive domain of “microservice” application architectures, but rather can (and should be) applied to any application you are considering deploying; whether its a “monolithic” application or a suite of microservices interacting as parts of a larger functioning application.

In that right I think this book is definitely a good read and worth a look, if for nothing more than to ensure your team gets a refresher on good design principles and how they can be materialized with some of the newer frameworks and tool sets that have come out of our community in recent years. The material presented is sound.

Configuring PowerShell for Azure AD and o365 Exchange management

Ahhh, love it! So you need to configure a Windows box to be able to utilize DOS, sorry PowerShell, to remotely manage your Azure AD / o365 / Exchange online services via “cmdlets”. You do some searching online and come across a ton of seemingly loosely connected Technet articles, forum questions etc.

Well I hope to summarize it up for you in this single blog post and I’ll try to keep it short without a lot of “why this needs to be done” explanations. You can read up on that on your own w/ the reference links below.

#1: The first thing we need to do is setup a separate user account that we will use when connecting via PowerShell to the remote services we want to manage with it:

  1. Using an account with administrative privileges, login to your Azure account/tenant at https://manage.windowsazure.com
  2. Once logged in click on “Active Directory” and select the instance you want to add the new user account too
  3. Click on “Add user”, fill out the details. Be sure to select “Global Administrator” as the role (or a lesser one, if need be depending on what you will be doing with PowerShell)
  4. Click create and it will generate a temporary password and email it to that user + the user listed for the secondary email that you filled out
  5. Logout of the Azure management portal
  6. Login again at https://manage.windowsazure.com, however this time login as the new user you just created with the temporary password. Once logged in, reset the password to a better one, click next.
  7. You should now be logged in as the new user you just created and on the main Azure management dashboard screen
  8. Find the link for managing “Exchange” and click on it
  9. You will now be redirected to the o365 Exchange admin center
  10. Click on “Permissions”, you will now see a bunch of groups/roles, the one we care about is Organization Management.
  11. Highlight the “Organization Management” role/group and ensure that the user you are logged in as (the new user you just created) is a member of this group directly or indirectly. You need to be a member of this group in order to get the “Remote Shell” permission that lets you download the Exchange cmdlets and manage exchange remotely via PowerShell. (See here for info on this group and the Remote Shell permission)

#2: Now that our special admin user is created with all the needed permissions, we can now get our PowerShell environment ready:

  1. Get on the Windows box that you intend to run the PowerShell commands from
  2. Download and install the “Microsoft Online Services Sign-In Assistant for IT Professionals” (its ok even if you are not a “professional”)
  3. Its 2014… you need to reboot after the last step…
  4. Download and install the “Azure AD Module for Windows PowerShell 64 bit”

#3: Ok, lets verify basic Azure AD PowerShell cmdlet capabilities

  1. Now on your Desktop RIGHT click on “Windows Azure Active Directory Module for Windows PowerShell” and “Run as Administrator”
  2. In PowerShell run this command “Set-ExecutionPolicy Unrestricted”
  3. In PowerShell run this command “Connect-MsolService” a nice dialog will prompt you for your credentials (use the creds that you setup above)
  4. In PowerShell run this command “Get-Msoluser”, get data back?? Great you are good to go for basic connectivity

#4: Finally…. lets verify o365 Exchange PowerShell cmdlet capabilities

  1. In the same PowerShell as you started above…
  2. Type: “$UserCredential = Get-Credential”… again enter your user credentials
  3. Type:
    $Session = New-PSSession -ConfigurationName Microsoft.Exchange -ConnectionUri https://outlook.office365.com/powershell-liveid/ -Credential $UserCredential -Authentication Basic -AllowRedirection
    
  4. Type: “Import-PSSession $Session”
  5. At this point you should see some activity at the top of your PowerShell window as 300+ Exchange online cmdlets are downloaded to your system for use
  6. Quickly verify the Exchange Online Remote Shell permission with: “Get-User YOUR_UPN | Format-List RemotePowerShellEnabled”
  7. You should get back “RemotePowerShellEnabled: true”

DONE, proceed to the next quagmire…

 

REFERENCE LINKS:

Managing Azure AD Using PowerShell:
http://technet.microsoft.com/en-us/library/jj151815.aspx

o365 Exchange online: Remote Shell Permission and Organization Management
http://technet.microsoft.com/en-us/library/dd638114(v=exchg.150).aspx

Connect to Exchange Online using Remote PowerShell:
http://technet.microsoft.com/en-us/library/jj984289(v=exchg.150).aspx

Series: Using remote PowerShell to manage o365
http://o365info.com/using-remote-powershell-to-manage_212/

Copying lots of files into S3 (and within S3) using s3-bucket-loader

Recently a project I’ve been working on had the following requirements for a file-set containing roughly a million files varying in individual size from one byte to over a gigabyte; and the file-set size in total being sized between 500gb and one terabyte

  1. Store this file-set on Amazon S3
  2. Make this file-set accessible to applications via the filesystem; i.e. access should look no different then any other directory structure locally on the Linux filesystem
  3. Changes on nodeA in regionA’s data-center should be available/reflected on nodeN in regionN’s data-center
  4. The available window to import this large file-set into S3 would be under 36 hours (due to the upgrade window for the calling application)
  5. The S3 bucket will need to be backed up at a minimum every 24 hours (to another bucket in S3)
  6. The application that will use all of the above generally treats the files as immutable and they are only progressively added and not modified.

If you are having to deal w/ a similar problem perhaps this post will help you out. Let go through each item.

Make this file-set accessible to applications via the filesystem; i.e. access should look no different then any other directory structure locally on the Linux filesystem. Changes on node-A in region-A’s data-center should be available/reflected on node-N in region-N’s data-center.

So here you are going to need an abstraction that can present the S3 bucket as a local directory structure; conceptually similar to an NFS mount. Any changes made to the directory structure should be reflected on all other nodes that mount the same set of files in S3. Now there are several different kinds of S3 file-system abstractions and they generally fall into one of three categories (block based, 1 to 1, and native), the type has big implications for if the filesystem can be distributed or not. This webpage (albeit outdated) gives a good overview that explains the different types.  After researching a few of these we settled on attempting to use YAS3FS (yet another, S3 filesystem). YAS3FS, written in Python, presents an S3 bucket via a local FUSE mount; what YAS3fs adds above other S3 filesystems is that it can be “aware” of events that occur on other YA3FS nodes who mount the same bucket, and can be notified of changes via SNS/SQS messages. YAS3FS keeps a local cache on disk, so that it gives the benefits (up to a point) of local access and can act like a CDN for the files on S3. Note that FUSE based filesystems are slow and limited to a block size (IF the caller will utilize it) of 131072. YAS3FS itself works pretty good, however we are *still* in evaluation process as we work through many issues that are creeping up in our beta-environment, the big ones being unicode support and several concurrency issues that keep coming up. Hopefully these will be solvable in the existing code’s architecture…

 

The available window to import this large file-set into S3 would be under 36 hours

Ok no problem, lets just use s3cmd. Well… tried that and it failed miserably. After several crashes and failed attempts we gave up. S3cmd is single-threaded and extremely slow to do anything against a large file-set, much less load it completely into S3. I also looked at other tools, (like s4cmd which is multi-threaded), but again, even these other “multi-threaded” tools eventually bogged down and/or became non-responsive against this large file-set.

Next we tried mounting the S3 bucket via YAS3fs and executing rsync’s from the source files to the target S3 mount…. again this “worked” without any crashing, but was single threaded and took forever. We also tried running several rsyncs in parallel, but managing this; and verifying the result, that all files were actually in S3 correctly w/ the correct meta-data, was a challenge. The particular challenge being that YAS3FS returns to rsync/cp immediately after the file is written to the local YAS3FS cache, and then proceeds to push to S3 asynchronously in the background (which makes it more difficult to check for failures).

Give the above issues, it was time to get crazy with this, so I came up with s3-bucket-loader. You can read all about how it works here, but the short of it is that s3-bucket-loader uses massive parallelism via orchestrating many ec2 worker nodes to load (and validate!) millions of files into an S3 bucket (via an s3 filesystem abstraction) much quicker than other tools. Rather than sitting around for days waiting for the copy process to complete with other tools, s3-bucket-loader can do it in a matter of hours (and validate the results). Please check it out for more details, as the github project explains it in more details.

The S3 bucket will need to be backed up at a minimum every 24 hours (to another bucket in S3)

Again, this presents another challenge; at least with copying from bucket to bucket you don’t actually have to move the files around yourself (bytes), and can rely on s3’s key-copy functionality. So again here we looked at s3cmd and s4cmd to do the job, and again they were slow, crashed, or bogged down due to the large file-set. I don’t know how these tools are managing their internal work queue, but it seems to be so large they just crash or slow down to the point where they become in-efficient. At this point you have two options for very fast bucket copying

  1. s3-bucket-loader: I ended up adding key-copy support to the program and it distributes the key-copy operations across ec2 worker nodes. It copies the entire fileset in under an hour, and under 20 minutes with more ec2 nodes.
  2. s3s3mirror: After coding #1 above, I came across s3s3mirror. This program is a multi-threaded, well coded power-house of a program that just “worked” the first time I used it. After contributing SSL, aws-encryption and storage-class support for it, doing a full bucket copy of over 600gb and ~800k s3 objects took only 45 minutes! (running w/ 100 threads). It has good status logging/output and I highly recommend it

Overall for the “copying” bucket to bucket requirement, I really like s33mirror, nice tool.