Tagged: lucene

Aggregate, backup elasticsearch fs snapshots across a widely distributed cluster

One of the Elasticsearch clusters I’ve worked on is spanned across multiple data-centers around the world and stores some very large indexes. Sometimes, but not often we have the need to get a backup of one of these indexes off of the cluster for restoration onto another cluster, but due to the sheer size of these indexes, its not practical for us to snapshot it to S3 or even a shared NFS mount (as the cluster spans multiple data-centers). Therefore the local file-system “fs” snapshot type is the only one really usable for us in this scenario.. but what you end up with is parts of the snapshot distributed across individual nodes all over the world.

So there was a need for a tool to automate the task of collecting all of the individual snapshot “parts” and downloading them to a central machine. If you’ve ever looked into the actual format of an elasticsearch snapshot its a little tedious… i.e. you just can’t blindly copy over the contents of snapshot shard directory contents as ES smartly does snapshots via diffs and keeping track of what files are relevant for each snapshot in metadata files; see here for an excellent overview: https://www.found.no/foundation/elasticsearch-snapshot-and-restore/.

So in the end I came up with elasticsearch-snapshot-manager (Scala) as a tool for handling all of this (analyzing, aggregating, downloading).

This tool is intended to aid with the following scenario:

  1. You have a large elasticsearch cluster that spans multiple data-centers
  2. You have a “shared filesystem snapshot repository” who’s physical location is local to each node and actually NOT on a “shared device” or logical mountpoint (i.e due to (1) above), the snapshots reside on local-disk only.
  3. You need a way to execute the snapshot, then easily collect all the different parts of that snapshot which are located across N nodes across your cluster
  4. This tool is intended to automate that process…

Please see the github project for all the details @ https://github.com/bitsofinfo/elasticsearch-snapshot-manager , feedback appreciated.

Book Review: Solr 1.4 Enterprise Search Server

This is a book review of Solr 1.4 Enterprise Search Server, by David Smiley and Eric Pugh

I picked up this book after hearing about Solr. I was looking into Drupal and trying to see what indexing engines were available for it and the only option that seemed to fit the bill was Solr. It became quite apparent to me, that Solr, being built on top of Lucene was quickly becoming a a favorite of developers out there. Having a fair amount of previous experience directly with Lucene I figured this would be a good book to get an introduction to this package.

Overall this book is pretty good. I’ve never read a book by Packt Publishing and compared to the Wroxs’, Oriellys’ and Mannings’ of the world, the publication does seem a bit rougher with some minor grammatical and editing errors… but overall those things don’t bother me.

This book pretty much appears to cover all the guts to get you up and running with Solr. Chapter one gives a solid overview of the platform, which Chapter 2 dives right into one of the most important items for anyone working with an indexing engine: text analysis (stemming, tokenization, index vs. query time analysis etc). Having dived into the guts of writing my own search engine using Lucene, I felt the authors did a pretty good job covering this important topic in the 2nd chapter.

Chapters 3 and 4 cover the basics of indexing and basic searching, which chapters 5 and 6 jump into the higher level components that Solr provides and which lots people are interested in nowadays: faceting, term highlighting, suggestions, spell checking etc.

Chapters 7 through 9 cover more systems administration related topics, such as deployment options, logging, monitoring, non-java clients/langs (PHP, Javascript, JSON) and finally on how to scale Solr both with vertical tips and solutions for horizontal (master/slave scenarios).

Overall I would highly recommend this book for anyone looking at Solr as a solution to add an indexing engine to their application. Having written a Lucene implementation in the past, I can appreciate a lot of the features that Solr appears to bring to the table so you don’t have to write them from scratch. The book presents much of the material in a straightforward manner targeted towards intermediate to advanced readers. Solr’s scaling capabilities look very attractive as well, either way I hope to get an opportunity to try this project out in the near future.