The saga continues …

This blog has now been moved to

Book Review: Systems Performance: Enterprise and the cloud


Welcome back for another book review. This time, I am going to review a book that I have bought when it came out, in late 2013. I have always wanted to do a review of this one but it seems I had two options:

  1.  Write a short review that probably does not do the book justice.
  2. Postpone the review for a more suitable time, when $IRL and $DAYJOB allow …

I opted for the second option, as I consider this book to be indispensable (yes, this is going to be a positive review). So, here is the table of contents:

  1. Introduction
  2. Methodology
  3. Operating Systems
  4. Observability Tools
  5. Applications
  6. CPUs
  7. Memory
  8. File Systems
  9. Disks
  10. Network
  11. Cloud Computing
  12. Benchmarking
  13. Case Study
  14. Appendices (which you SHOULD read)

Wow, a lot of contect, huh? (something to be expected, given that the book is more than 700+ pages). Do not let the size daunts you however. Chapters are self-contained, as the author understands that the book might be read under pressure, and contain useful exercises at the end.

What really makes this book stands out, is not the top-notch technical writing or abundance of useful one-liners, is the fact that the author moves forward and suggests a methodology for troubleshooting and performance analysis, as opposed to the ad-hoc methods of the past (or best case scenario a checklist and $DEITY forbid the use of “blame someone else methodology”). In particular the author suggests the USE methodology, USE standing for Utilization – Saturation – Errors, to methodically and accurately analyze and diagnose problems. This methodology (which can be adapted/expanded at will, last time I checked the book was not written in stone), is worth the price of the book alone.

The author correctly maintains that you must have an X-ray (so to speak) of the system at all times. By utilizing tools such as DTrace (available for Solaris and BSD) or the Linux equivalent SystemTap, much insight can be gained from the internals of a system.

Chapters 5-10 are self-explanatory: the author presents what the chapter is about, common errors and common one-liners used to diagnose possible problems. As said before, chapters aim to be self contained and can be read while actually troubleshooting a live system so no lengthy explanations there. At the end of the chapter, the bibliography section provides useful pointers towards resources for further study, something that is greatly appreciated. Finally, the exercises can be easily transformed to interview questions, which is another bonus.

Cloud computing and the special considerations that is presenting is getting its own chapter and the author tries to keep it platform agnostic (even if employed by a “Cloud Computing” company), which is a nice touch. This is followed by a chapter on useful advice on how to actually benchmark systems and the book ends with a, sadly too short, case study.

The appendices that follow should be read, as they contain a lot of useful one-liners (as if the ones in the book were not enough), concrete examples of the USE method, a guide of porting dtrace to systemtap and a who-is-who in the world of systems performance.

So how to sum up the book? “Incredible value” is one thought that comes to mind, “timeless classic” is another. If you are a systems {operator|engineer|administrator|architect}, this book is a must-have and should be kept within reach at all times. Even if your $DAYJOB does not have systems on the title, the book is going to be useful, if you have to interact with Unix-like systems on a frequent basis.

PS. Some reviews of this book complain about the binding of the book. In three physical copies that I have seen before my eyes, binding was of the highest quality so I do not know if this complain is still valid.


Conference review: Distributed Matters Berlin 2015

“Kept you waiting, huh?” – to start the post with a pop culture reference.

Yesterday, I was privileged enough to attend Distributed Matters Berlin 2015. The focus of the conference is, you guessed it, distributed systems, often within a NoSQL context. It was hosted at the awesome KulturBrauerei, a refurbished brewery. The format of the conference was 45 minute presentations, including Q&A, thankfully followed by a 15 minute break between talks, in two tracks. The overall level of the presentations was above the average and given that you could only attend one at a time, it made for a hard choice.

Owing to the greatness of Berlin taxi drivers (you know what I am talking about if you used a taxi in Berlin recently), I managed to attend only half of the keynote by @aphyr, so I am not going to comment on this one. My main takeaway is “always, always read the documentation carefully”.

The next presentation I attended was NoSQL meets Microservices, by Michael Hackstein. This one was labelled as beginner. It presented the main paradigms of the NoSQL landscape (KV/Graph/Document), certain topologies and then a presentation of the new-ish ArangoDB, a NoSQL based on V8 Javascript that claims to support all three paradigms at once, eliminating the need for multiple network hops. Overall, it was well presented, if a tad on the product side, and it served nicely to kickoff my conference experience.

After the coffee break, where I was lucky enough to meet some old colleagues from $DAYJOB-1, I attended A tale of queues, from ActiveMQ over Hazelcast to Disque. @xeraa presented his journey with various queueing solutions. He kicked off by stating that the hard problem in distributed systems is exactly once delivery and guaranteed delivery. He then presented the landscape of existing message queues, giving the rationale behind deciding what to use and, more importantly, what not to use. The talk was quite technical, giving me a lot of pointers for future research, overall a solid talk, well done!

It was followed by @pcalcado and No Free Lunch, Indeed: Three Years of Microservices at Soundcloud. Phil has amazing presentation skills and described the journey of Soundcloud from a monolithic Ruby on Rails app, towards a microservices oriented architecture. What I liked most about this presentation was not just the great technical content but also the honestly. Evolving your architecture is no trivial task and the road to it is full of potential pitfalls. Phil was kind enough to share some of his hard gained experience with us, greatly appreciated.

The lunch break was BAD, ’nuff said. Too long a queue and the food, by the time I got there, the good stuff was gone.

After the lunch, I attended Scalable and Cost Efficient Server Architecture by Matti Palosuo. One of the more solid talks, this no-frills presentation did what said on the tin: presented the service infrastructure behind EA’s Sim City Build It mobile title. Dealing with mobile, casual games  presents a unique challenge service-wise and Matti covered all angles in his presentation, diving deep into specifics of their implementation.

The next presentation was Containers! Containers! Containers! And now? by Michael Hausenblas. I am not going to comment a lot on this one, since it had no slides and it was more like a tech demo. Mesos is an AMAZING product and I would have preferred some technical discussion, as opposed to a hands-on demo, but hey! this is just me.

Microservices with Netflix OSS and Spring Cloud by Arnaud Cogoluegnes was the next presentation that I attended. It focused on FOSS software by Netflix and how it can be utilized by the form of Java decorators within an application context. Useful and well presented, the only thing I personally did not like was certain slides full of code but this does not take away from the value of the presentation. Bonus point is that, for a Java engineer, this presentation was immediately actionable, with some nice coding takeaways.

Before proceeding with the next presentation, the astute reader of this blog should have noticed by now a pattern forming: microservices. The topic of the next talk was no exception Microservices – stress-free and without increased heart-attack risk by Uwe Friedrichsen. I really loved this talk. Uwe has a strong opinion regarding microservices (and the experience to back it up). In a nutshell, while microservices can be viable, one should keep a clear head and not fall into the trap of hype-driven architecture. This was my favorite talk of the conference and without further ado, here are the slides. I cannot speak more highly about this presentation so please, have a look at the slides. It was extremely nice to deconstruct the microservices hype and present a realistic case.

It was time for the last talk. The choice was between Antirez’s disque implementation talk and Just Queue it! by Marcos Placona. I decided to give the underdog a chance, given that almost everyone went to Antirez’s presentation (which I am sure it was excellent) and went to Marcos’ presentation instead. I was not disappointed, Marcos described his experience with using MQ while migrating a project and gave another overview of the MQ landscape.

After that, I had some food and some orange juice and decided to call it a day. Overall, it was quite a nice conference, good talks, not a lot of marketing and I will definitely visit the next one, if I am able. Met some interesting people as well and grabbed a lot of pointers for future research. Kudos to the organizers.

See you in DevOps Days Berlin 2015.

Book Review: DevOps Troubleshooting


Hello everyone and welcome back for another book review at woktime. Today’s edition is a short review of a short book called “DevOps Troubleshooting: Linux Server Best Practices”. Without further ado, below is the Table Of Contents

  1. Troubleshooting best practices
  2. Why is the server so slow? Running out of CPU, RAM and Disk I/O
  3. Why won’t the system boot? Solving boot problems
  4. Why can’t write to the disk? Solving full or corrupt disk issues
  5. Is the server down? Tracking down the source of network problems
  6. Why won’t the hostnames resolve? Solving DNS server issues
  7. Why didn’t my email go through? Tracing email problems
  8. Is the website down? Tracking down web server problems
  9. Why is the database slow? Tracking down database problems
  10. It’s the hardware’s fault? Diagnosing common hardware problems

So let’s start at the title. “DevOps” can be an overloaded term – it means different things to different people and unfortunately an “according-to-Hoyle” definition does not exists. I belong in the train of thought that DevOps is more of a cultural movement within an organization than say, a specific job title, so the title of the book “DevOps troubleshooting” is meaningless (I would have strongly preferred the term “Linux Systems Troubleshooting”, as it would have been more accurate for reasons that I am going to explain below).

The author is clearly experienced within the realm of Linux administration and he attempts to cover a broad range of topics. The book is approximately 205 pages long, which means that it will never get too deep within a subject, opting instead to cover as many topics as possible. The writing style of the author is quite readable and he goes out of his way to explain things in relative detail and on the really plus side of the book, there are no glaring errors – proofreaders and the author really did went the extra mile to ensure that content was accurate in the vast number of examples this book is providing.

However, my gripe with the book is that the material covered is really basic. Granted, the intended audience is not a veteran system administrator or engineer – this book by its own admission is aimed towards developers or QA personnel that, owing to some definition of DevOps, are thrown into operational duties. The author makes an effort NOT to use random based troubleshooting, however a complete methodology is never introduced.

Overall, this is a well-written book that provides value to a non-operations member of a team doing operations or for a novice system administrator. Its small size makes it portable enough to be carried around as a level-1 reference, however for system level debugging there are better options out there (keep watching this space for the definite follow up on this sentence).

Installing python fabric on OSX 10.9 Mavericks


– you have xcode and the C build toolchain installed

then it is literally two commands!

sudo easy_install pip

(and pip can be useful for a lot of different things)

sudo pip install fabric

wait and Presto!

Monitoring Tip: Resolving mms.mongodb host is unreachable error

I think I will save a few people some time by sharing this tip that is not covered in the FAQ. Assume that you have a working munin setup and you setup the cloud monitoring from (and you should). All seems to be fine but you are getting “Host Unreachable” – even though the connectivity between hosts and specifically, between the agent and mongod is flawless, then it might be a name resolution issue.
sudo vi /etc/hosts
enter a hosts entry as it appears in the mms dashboard (for example if staging-host1 with IP is appearing on the mms dashboard as not reachable, add it to /etc/hosts)

Book Review: The Art of Capacity Planning

The Art of Capacity Planning

Book Cover

The next entry in my series of Systems Engineering book reviews is The Art of Capacity Planning by O’Reilly. This book is a worthy addition to every system engineer’s bookshelf, as capacity planning is a valuable skill to have and should be constantly be applied to infrastructure of any given size. Capacity planning is a task that you will be doing frequently (and if not, be prepared to pay some dire consequences) as a systems engineer. For the rest of the review I will assume that you do know what capacity planning is and why you need it.

Below is the table of contents:

  • Goals, Issues and Processes in Capacity Planning
  • Setting Goals For Capacity
  • Measurement: Units of Capacity
  • Predicting Trends
  • Deployment
  • Virtualization and Cloud Computing
  • Dealing with Instantaneous Growth
  • Capacity Tools
  • The first chapter of the book serves as a quick introduction. The author is quick to state that this is not about complex simulations and the maths used are mostly back-of-the-envelope calculations, as opposed to formal models and then starts with the distinction between performance tuning and capacity planning – two closely related activities that cannot be used interchangeably. The need for extensive measurement is stressed over and over again and the need for a system to tell stories is clearly and explicitly stated.

    Goal determination is the name of the game for chapter two. Concepts such as Service Level Agreements (both formal and informal), user expectations and business capacity requirements are introduced and then it moves on to architectural design goals, with a focus on providing accurate measurement points. Once measurements points for each role are established, possible scaling points per role are introduced, as well as the different kinds of scaling. The author is not political, merits for both horizontal and vertical scaling are discussed and the term diagonal scaling is introduced. Finally, the chapter briefly touches on Disaster Recovery and Business Continuity Planning.

    Chapter three is mostly about metrics and finding your limits. It is common sense that in order to measure something, you should first define sensible units of measurement. The fact that introducing measurement into a system affects slightly system performance is stated and a baseline for measurement tools is given. The author, like me, seems to be a big fan of measurement and with good reason. System, network and application metrics can be used to proactively identify problems and the more metrics you gather, the more informed decisions you can make about your overall system’s health, performance and capacity. Different contexts are introduced via different real-world case studies and effects of technologies such as caching are discussed. The chapter closes stressing again the importance of extensive measurement.

    All the metrics in the world are useless if you do not know how to classify and use them in your decision making process. The focus of chapter four is plotting trends and making forecasts. As stated elsewhere in the book, capacity planning goes hand in hand with resource procurement (be it real or cloud) and resources carry a price tag with them. It is a hands on chapter that uses a spreadsheet to show concepts such as curve fitting (personally I would have preferred to have some R code samples) and the effect that capacity planning has on procurement (delving into topics such as procurement time as well). This is a nice chapter to read if you interact with CFOs frequently.

    Chapter five is mostly about deploying and managing the capacity. It kicks off with a set of goals, such as centralized log management, hints at configuration management (it does not point out any tools for it such as Puppet or Chef but it describes the need for it) and server consistency and automatic start-up is highlighted. I personally am a big fan of automation and I am glad to see that I am on the same page with the author.

    The book closes with three appendixes. One is dealing with virtualization and cloud computing in general, combined with a number of case studies. The next one gives some advice on what to do if you have spikes of too much traffic (what we used to call the Slashdot effect back in the day) and finally the author points to some, mostly Free Software Open Source, tools of the trade.

    The TL:DR; version of this review is the following: This book is quality volume that belongs in your bookshelf or e-book reader. But let’s elaborate.

    John Allspaw is a seasoned professional and this shows, the book is packed with case studies from Allspaw’s employers, allowing easy transfer of knowledge. Compared to other capacity planning volumes, such as the excellent “Guerrilla Capacity Planning”, this book is lighter on the math side, using tons of graphs and schematics to convey information. It is also a short book clocking in at just under 150 pages, making it a quick read that guarantees re-visits. The author makes a strong effort to keep things platform agnostic. Having said that, while this book is not a tutorial, any decent Unix/GNU Linux engineer will be able to apply the knowledge contained therein immediately. A point that I cannot stress enough is that the printing quality of the book is excellent – something that perhaps can be expected from O’ Reilly. I would really welcome a second and more expanded edition but even as-is, this is an excellent book.