Thursday, December 27, 2012

2012 Year in Review: devops blog posts and articles

This is a whirlwind tour of blog posts and articles that I saved in Instapaper in 2012.

January
February
March
July
August

Tuesday, December 18, 2012

15 things I learned from "Shipping Greatness"

I finished reading "Shipping Greatness" by Chris Vander Mey (ex-Googler, ex-Amazonian). It's a great book, and I highly recommend it to anyone who is interested in learning how to ship software as close to schedule as possible, with as few bugs as possible, which is really the best we can hope for in this industry. Here are some things I jotted down while reading the book. You should really read the whole book, if only for the 'in-the-trenches' vignettes from Google and Amazon.

1. When tackling a new product, always start with the needs/problems of your customers (of course Jeff Bezos is famous for this approach at Amazon).

2. Have a mission statement that inspires and that also fits on a t-shirt if possible (for example the mission statement of the personalization team at Amazon is "increase customer delight").

3. Define your product by writing a press release (Amazon does it; it forces you to be succinct and to capture the essence of your product).

4. Avoid building a product that nobody wants. Engage real customers in testing early versions of the product or in giving feedback from on mockups and wireframes.

5. When putting together project schedules, remember that 60% is the best you can hope for in terms of productivity, so adjust the schedules accordingly.

6. Never release on a Friday! You don't want to scramble to find engineers available to fix critical issues over the weekend.

7. Apply the High School Embarrassment Test to the product you want to ship: will you be embarrassed if one of your old high school friends sees your product?

8. Run periodic bug bashes that are scheduled in the project plan and that are short (1 hour or so). Incentivize participation by giving away t-shirts, be grateful for every bug found -- better you find it than your customers!

9. Report on meaningful metrics: for example Google measures "seven-day active user count" (unique users who used the product in the last 7 days) so they can compare week to week.

10. Have a launch checklist - makes sure all moving pieces are accounted for, facilitates communication across different functions of the team (commercial pilots go through a checklist before every flight).

11. When you write email messages, make sure they are "crisp", i.e. clear, succinct, and covering a single topic if possible.

12. Before the daily standup meeting, have the dev lead give a 30-second brief on the status of the project (it sets the context of the meeting and reminds the team of critical things).

13. If one person in a team meeting is being strongly negative and dismissive of other people or of the product, take it offline for a one-on-one meeting rather than arguing in front of the team. 

14. If you need to give a presentation, make it short (10-15 min), deliver one and only one message, and tell a story if you can. Also use blue background with yellow and white font because it looks great on any projector.

15. In terms of getting things done, start by doing things that you, and only you, can do, so you're not blocking others (e.g. approve an expense report).

I really struggled to keep the list somewhat short. The book is full of similar nuggets. Go read it!

Thursday, November 29, 2012

Code performance vs system performance

Just a quick thought: as non-volatile storage becomes faster and more affordable, I/O will cease to be the bottleneck it currently is, especially for database servers. Granted, there are applications/web sites out there which will always have to shard their database layer because they deal with a volume of writes well above what a single DB server can handle (and I'm talking about mammoth social media sites such as Facebook, Twitter, Tumblr etc).  By database in this context I mean relational databases. NoSQL-like databases worth their salt are distributed from the get go, so I am not referring to them in this discussion.

For people who are hoping not to have to shard their RDBMS, things like memcached for reads and super fast storage such as FusionIO for writes give them a chance to scale their single database server up for a much longer period of time (and by a single database server I mostly mean the server where the writes go, since reads can be scaled more easily by sending them to slaves of the master server in the MySQL world for example).

In this new world, the bottleneck at the database server layer becomes not the I/O subsystem, but the CPU. Hence the need to squeeze every ounce of performance out of your code and out of your SQL queries. Good DBAs will become more important, and good developers writing efficient code will be at a premium. Performance testing will gain a greater place in the overall testing strategy as developers and DBAs will need to test their code and their SQL queries against in-memory databases to make sure there are no inefficiencies in the code.

I am using the future tense here, but the future is upon us already, and it's exciting!

Friday, November 09, 2012

Quick troubleshooting of Sensu 'no keepalive from client' issue

As I mentioned in a previous post, we started using Sensu as our internal monitoring tool. We also integrated it with Pager Duty. Today we terminated an EC2 instance that had been registered as a client with Sensu. I started to get paged soon after with messages of the type:

 keepalive : No keep-alive sent from client in over 180 seconds

Even after removing the client from the Sensu dashboard, the messages kept coming. My next step was of course to get on the #sensu IRC channel. I immediately got help from robotwitharose and portertech.  They had me try the following:

1) Try to remove the client via the Sensu API.

I used curl and ran:

curl -X DELETE http://sensu.server.ip.address:4567/client/myclient

2) Try to retrieve the client via the Sensu API and make sure I get a 404

curl -v http://sensu.server.ip.address:4567/client/myclient

This indeed returned a 404.

3) Check that there is a single redis process running

BINGO -- when I ran 'ps -def | grep redis', the command returned TWO redis-server processes! I am not sure how they got to be both running, but this solved the mystery: sensu-server was talking to one redis-server process, and sensu-api was talking to another. When the client was removed via the sensu-api, the Sensu server was still seeing events sent by the client, such as this one from /var/log/sensu/sensu-server.log:


{"timestamp":"2012-11-10T01:41:14.154418+0000","message":"handling event","event":{"client":{"subscriptions":["all"],"name":"myclient","address":"10.2.3.4","timestamp":1352502348},"check":{"name":"keepalive","issued":1352511674,"output":"No keep-alive sent from client in over 180 seconds","status":2,"history":["2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2"],"flapping":false},"occurrences":305,"action":"create"},"handler":{"type":"pipe","command":"/etc/sensu/handlers/ops_pagerduty.rb","api_key":"myapikey","name":"ops_pagerduty"},"level":"info"}

To actually solve this, I killed the 2 redis-server processes (since 'service redis-server stop' didn't seem to do it), then stopped sensu-server and sensu-api, then started redis-server, and finally started sensu-server and sensu-api again.

At this point, the Sensu dashboard showed the 'myclient' client again. I removed it one more time from the dashboard (I could have done it via the API too) and it finally went away for good.

This was quite some obscure issue. I wouldn't have been able to solve it were it not for the awesomeness of the #sensu IRC channel (and kudos to the aforementioned robotwitharose and portertech!)

I hope google searches for 'sensu no keepalive from client' will result in this blog post helping somebody out there! :-)

Sensu rocks BTW.

Saturday, October 27, 2012

Monitoring doesn't have to suck

This is a follow up to my previous post, where I detailed some of the things I want in a modern monitoring tool. A month has passed, and I thought I'd give a quick overview of some tools we started to use as part of our monitoring and graphing strategy.

Several people recommended Sensu, so my colleague Jeff Roberts has given it a shot, and he liked what he saw (blog post with more technical details hopefully soon from Jeff!). We're still tinkering with it, but we're already using it to monitor our Linux-based machines (both Ubuntu and CentOS). Jeff is working on our Chef infrastructure, and soon we'll be able to deploy the Sensu client via Chef. What's nice about Sensu, and different from other tools, is the queueing mechanism it uses for client-server communication, and for posting events such as 'send this metric to Graphite' or 'send this alert to Pager Duty' or 'send this notification to this email address'. It does have a few rough edges, but @portertech seems to be on IRC 24x7, so that helps a lot :-)

Sensu, similar to many other monitoring tools, doesn't have good support for Windows. We looked around and we settled on Server Density for monitoring our Windows machines (mostly used for our backend infrastructure). We also monitor the Sensu server itself with Server Density, and we're looking into adding RabbitMQ-specific checks for Sensu.

We integrated both Sensu and Server Density with Pager Duty, and we consider every alert sent to Pager Duty as critical (which means the engineer on duty has to acknowledge, solve or escalate it). Pager Duty is an amazingly useful service, and most useful of all are probably the escalation processes and policies it provides. No more excuses for people who are on pager that they weren't notified when shit hit the fan!

For non-critical alerts we send email notifications outside of Pager Duty. With Sensu we do this via its mailer handler, and for Server Density by using a profile that goes to an email alias.

Two more monitoring tools we signed up for are Boundary, which we haven't been using in anger yet, but will do so in the next couple of weeks, and New Relic, whose sweet spot seems to be application-level monitoring. We'll deploy some Java app servers soon and we'll try the JVM New Relic plugin at the same time. Boundary will be very useful for looking at network traffic, establishing patterns and baselines, and getting notified when something gets out of whack. It will tell us what we don't know that we don't know.

Update: I forgot to mention Pingdom, which is a very inexpensive but extremely useful external monitoring service. We use it to monitor important user-facing resources (mostly web pages, but Pingdom can also do generic TCP checks, and mail-related checks) so we can get alerted when something is wrong from the perspective of our users, looking outside in so to speak.

For graphing, we deployed Graphite. We haven't started to use it very heavily, but again this will change soon, as we'll send there various business metrics obtained by querying the critical databases within our infrastructure. Jeff already integrated it with Sensu, so we'll be able to use the same queuing mechanism for Graphite as we are using for the monitoring alerts.

So there you have it: Sensu, Pingdom, Server Density, Pager Duty, Boundary, New Relic and Graphite. Modern tools that give a good name to monitoring. No, monitoring no longer sucks.

Sunday, September 30, 2012

What I want in a monitoring tool

I started a new job a few weeks ago, and I'm now at a point where I'm investigating monitoring options. At past jobs I used Nagios, which I know will work, but I would like to look into other more modern tools. I am aware that #monitoringsucks, and I am pretty sure people have hashed these topics before, but here are some of the things I want from a modern monitoring tool:
  • Ideally open source, of if not affordable per host per month pricing (we already signed up as a paying customer of Boundary for example)
  • Installation and configuration should be easily scriptable
    • server installation, as well as addition/modification of clients should be easily automated so it can be done with Puppet/Chef
    • API would be ideal
  • Robust notifications/alerting rules
    • escalations
    • service dependencies
    • event handler scripts
    • alerts based on subsets of hosts/services
      • for example alert me only when 2+ servers of the same type are down
  • Out-of-the-box plugins
    • database-specific checks for example
  • Scalability
    • the monitoring server shouldn't become a bottleneck as more clients are added
      • nagios is OK with 100-200 clients (with passive checks)
    • hierarchy of servers should be supported
    • agent-based clients
  • Reporting/dashboards
    • Hosts/services status dashboards
    • Downtime/outages dashboards
    • Latency (for HTTP checks)
  • Resource graphing would be great
    • but in my experience very few tools do both alerting and resource/metrics graphing well
    • in the past I used Nagios for alerting and Ganglia/Graphite for graphing
  • Integration with other tools
    • Send events to graphing tools (Graphite), alerting tools (PagerDuty), notification mechanisms (irc, Campfire), logging tools (Graylog2)
I also asked a question on Twitter about what monitoring tool people would recommend, and here are some of the tools mentioned in the replies:
  • Sensu
  • OpenNMS
  • Icinga
  • Zenoss
  • Riemann
  • Ganglia
  • Datadog
Several people told me to look into Sensu, and a quick browsing of the home page tells me it would be worth giving it a whirl. So I think I'll do that next. Stay tuned, and also please leave a comment if you know of other tools that might fit the profile I am looking for.

Update: more tools mentioned in comments or on Twitter after I posted a link to this blog post:
  • New Relic (which I am actually in the process of evaluating, having paid for 1 host)
  • Circonus
  • Zabbix
  • Server Density
  • Librato
  • Comostas
  • OpsView
  • Shinken
  • PRTG
  • NetXMS
  • Tracelytics

Sunday, September 23, 2012

3 things to know when starting out with cloud computing

In the same vein as my previous post, I want to mention some of the basic but important things that someone starting out with cloud computing needs to know. Many times people see 'the cloud' as something magical, as the silver bullet that will solve all their scalability and performance problems. These people are in for a rude awakening if they don't pay attention to the following points.

Expect failure at any time

There are no guarantees in the cloud. Failures can and will happen, suddenly and mercilessly. Their frequency will increase as you increase the number of instances that you run in the cloud. It's a sickening feeling to realize that one of your database instances is gone, and there's not much you can do to bring it back. At that point, you need to rely on your disaster recovery plan (and you have a DR plan, don't you?) and either launch a new instance from scratch, or, in the case of a MySQL master server for example, promote a slave to a master. The silver lining in all this is that you need a disaster recovery plan anyway, even if you host your own servers in your own data center space. The cloud just makes failures happen more often, so it will test your DR plan more thoroughly. In a few months, you will be an expert at recovering from such failures, which is not a bad thing.

In fact, knowing that anything can fail in the cloud at any time forces you to architect your infrastructure such that you can recover from failures quickly. This is a worthy goal to have no matter where your infrastructure is hosted.

There's more to expecting failures at any time. You need a solid monitoring system to alert you that failures are happening. You need a solid configuration management system to spin up new instances of a given role quickly and with as little human intervention as possible. All these are great features to have in any infrastructure.

Automation is key

If you only have a handful of servers to manage, 'for' loops with ssh commands in a shell script can still seem like a good and progressive way of running tasks across the servers. This method breaks at scale. You need to use more industrial-grade configuration management tools such as Puppet, Chef or CFEngine. The learning curve can be pretty steep for most of these tools, but it's worth investing your time in them.

Deploying automatically is not enough though. You also need ways to ensure that things have been deployed as expected, and that you haven't inadvertently introduced new bugs. Automated testing of your infrastructure is key. You can achieve this either by writing scripts that check to see if certain conditions have been met, or, even better, by adding those checks to your monitoring system. This way, your monitoring checks are your deployment unit tests.

If the cloud is a kingdom, its APIs are the crown jewels. You need to master those APIs in order to automate the launching and termination of cloud instances, as well as the creation and management of other resources (storage, load balancers, firewall rules). Libraries such as jclouds and libcloud help, but in many situations you need to fall back to the raw cloud provider API. Having a good handle on a scripting language (any decent scripting language will do) is very important.

Dashboards are essential

One of the reason, and maybe the main reason, that people use the cloud is that they want to scale their infrastructures horizontally. They want to be able to add new instances at every layer (web app, database, caching) in order to handle anything that their users throw at their web site. Simply monitoring those instances for failures is not enough. You also need to graph the resources that are consumed, and the metrics that are generated at all layers: hardware, operating system, application, business logic. Tools such as Graphite, Ganglia, Cacti, etc can be of great assistance. Even homegrown dashboards based on tools such as the Google Visualization API can be extremely useful (I wrote about this topic here).

One aspect of having these graphs and dashboards is that they are essential for capacity planning, by helping you in establishing correlations between traffic that hits your web site and resources that are consumed and that can become a bottleneck (such as CPU, memory, disk I/O, network bandwidth etc). If you run your database servers in the cloud for example, you will very quickly realize that disk I/O is your main bottleneck. You will be able to correlate high CPU wait numbers with slowness in your database, which reflects in slowness in your overall application. You will then know that if you reach a certain CPU wait threshold, it's time to either scale your database server farm horizontally (easier said than done especially if you do manual sharding), or vertically (which may not even be possible in the cloud, because you have pretty restrictive choices of server hardware).

As an aside: do not run your relational database servers in the cloud. Disk I/O is so bad, and relational databases are so hard to scale horizontally, that you will soon regret it. Distributed NoSQL databases such as Riak, Voldemort or HBase are a different matter, since they are designed from the ground up to scale horizontally. But for MySQL or PostgreSQL, go for bare metal servers hosted at a good data center.

One last thing about scaling horizontally is the myth of autoscaling. I've very rarely seen companies that are able to scale their infrastructure up and down based on traffic patterns. Many say they do it, but when you get down to it it's mostly a manual process. It's somewhat easier to scale it up, but scaling it down is a totally different matter. You need to worry at that point that resources that you are taking away from your infrastructure are properly accounted for. Let's say you terminate a web server instance behind a load balancer -- is the load balancer aware of the fact that it shouldn't send traffic to that instance any more?

Another aspect of having graphs and dashboards is to see how business metrics (such as sales per day, or people that abandoned their shopping carts, or payment transactions, etc.) are affected by bottlenecks in your infrastructure. This assumes that you do graph business metrics, which you should. Without dashboards, you are flying blind.


I could go on with more caveats about the cloud, but I think that these 3 topics are a good starting point. The interesting thing about all 3 topics though is that they apply to any infrastructure, not only to cloud-based ones. But the scale of the cloud puts these things into their proper perspective and forces you to think them through.



Sunday, August 26, 2012

10 things to know when starting out as a sysadmin

This post was inspired by Henrik Warne's post "Top 5 Surprises When Starting Out as a Software Developer". I thought it was a good idea to put together a similar list for sysadmins. I won't call them 'surprises', just 'things to know'. I found them useful when I started out, and I still find them useful today. I won't prioritize them either, because they're all important in their own way.

Backups are good only if you can restore them

You would be right to roll your eyes and tell yourself this is so obvious, but in my experience most people run backups regularly, but omit to try to restore from those backups periodically. Especially if you have a backup scheme with one full backup every N days followed by either incremental or differential backups every day, it's important to test that you can obtain a recent backup (yesterday's at a minimum) by applying those incrementals or differentials to the full backup. And remember, if it's not backed up, it's not in production.

If it's not monitored, it's not in production

This is one of those things that you learn pretty quickly, especially when your boss calls you up in the middle of the night telling you the site is down. I wrote before on how in my opinion monitoring is for ops what testing is for dev, and I also wrote how monitoring is the foundation for whipping your infrastructure into shape.

If a protocol has an acronym, you need to learn it

SNMP, LDAP, NFS, NIS, SMTP are just some examples of such protocols. As a sysadmin, you need to be deeply familiar with them if you want to have any chance of troubleshooting complex issues. And I want to single out two protocols that are the most important in my book: DNS and HTTP. Get the RFCs out and study them. And speaking of troubleshooting complex issues...

The most important skill you need to master is problem solving

The issues you'll face in your career as a sysadmin will get more and more complex, in direct relation with the complexity of the infrastructures you'll build and maintain. You need to be able to analyze a problem and come up with several variables that could cause the issue, then eliminate the variables one by one until you discover the root cause. This one-by-one variable elimination strategy is really important, and I've been struck throughout my career by how many people have never mastered it, and instead flail around hopelessly when faced with a non-trivial issue.

You need at least 2 of everything in production

As soon as you are in charge of a non-trivial Web site, you realize that you need to eliminate single points of failure as much as possible. It starts with border routers, it continues with firewalls and load balancers, then web/app/database servers, and network switches that tie everything together. All of a sudden, you have a pretty complex infrastructure to build and maintain.

One of the most important things you can do in this context is to test the failover of the various devices (firewalls, load balancers, routers, switches), which are usually in an active/passive configuration. I've been bit many times by forced failovers (when the active device unexpectedly failed) which didn't go well because the passive device wasn't configured properly or wasn't syncing properly from the active one.

I also want to mention in this context the necessity of deeply understanding how networks work both at Layer 2 (MAC) and Layer 3 (IP routing). You can only fake so much a lack of understanding of these issues. The most subtle and hard to solve issues I've faced in my career as a sysadmin have all been networking issues (which for some reason involved ARP tables many times). You need to become best friends with tcpdump.


Keep your systems secure

The days when telnet was enabled by default in most OSes are long gone, but you still need to worry about security issues. Fortunately there are simple things you can do that go a long way towards improving the security of your infrastructure -- things like putting firewalls in front of everything and only allowing the ports necessary for your production traffic, disabling services you don't need on your servers, monitoring your logs for unauthorized access attempts, and not running Windows (just kidding, kinda).

One issue I faced in this context was applying security patches to various OSes. You need to be careful when doing this, and make sure you test out those patches in staging before applying them in production (otherwise you run the risk of rebooting a production server and have it not come back because of the effects of that patch -- trust me, it happens).

Logging is your best friend

Logging goes hand in hand with monitoring as one of those sine-qua-non conditions for having a good grasp of what's going on with your infrastructure. You'll learn soon that you need to have a strategy for logging, in most cases a central log server where you send logs from your other systems. There are tools such as Flume and Scribe that help these days, but even good old syslog-ng works just fine for this purpose. Logging by itself is not enough -- you need to also monitor the logs and send alerts when you identify error conditions. It's not easy, but it needs to be done.

You need to know a scripting language

You can only go so far in your sysadmin career if you don't master a decent scripting language. I started with Perl (after programming in C/C++ for a living for several years) but discovered Python in 2004 and never looked back. Ruby will do the trick too. You don't need to be a ninja programmer, but you need to have decent skills -- know how to split a program into modules, know how to use OOP techniques, know enough of the language to be able to read and extend other people's code, and maybe most important of all, KNOW HOW TO TEST YOUR CODE! Always test your code in staging before you put it in production.

Document everything

This is very important when you start out because you learn something new every day. Write it down (I used to do it with old fashioned pen and paper) but also share it with your team. Wikis are decent for this purpose, although they become hard to organize as they grow. But having some sort of searchable knowledge base is definitely 'a good thing', especially as you team grows and new people need to be brought up to speed. Of course, these days you can also use 'executable documentation' in the form of Chef recipes or Puppet manifests.

And speaking of teams...

Always try to be a leader

You start out on the bottom rung of the ladder, but you can still be a leader. I once saw a definition of leadership that really resonated with me: "a leader is somebody who makes something happen which otherwise wouldn't happen". There are countless opportunities to do just that even if you are just starting out in your career. If something is hard (or 'not that fun') and people on your team either postpone it or seem to just forget to do it, that's a good sign you need to step up and be a leader and do it. You will help your team and you will help yourself in the process.

One thing you can make happen (for example by blogging) is to share lessons that you've learned the hard way. Many of the solutions I've found to thorny issues I've faced have come from blogs, so I am always happy to contribute back to the community by sharing some of my own experiences via blogging. I strongly advise you to do the same.

Monday, August 13, 2012

The dangers of uniformity

This blog post was inspired by the Velocity 2012 keynote given by Dr. Richard Cook and titled "How Complex Systems Fail". Approximately 6 minutes into the presentation, Dr. Cook relates a story which resonated with me. He talks about upgrading hospital equipment, specifically infusion pumps, which perform and regulate the infusion of fluids in patients. Pretty important and critical task. The hospital bought brand new infusion pumps from a single vendor. The pumps worked without a glitch for exactly 1 year. Then, at 20 minutes past midnight, the technician on call was alerted to the fact that one of the pumps stopped working. He fiddled with it, rebooted the equipment and brought it back to life (not sure about the patient attached to the pump though). Then, minutes later, other calls started to pour in. It turns out that approximately 20% of the pumps stopped working around the same time that night. Nightmare night for the technician on call, and we can only hope he retained his sanity.

The cause of the problems was this: the pumps have a series of pretty complicated settings, one of which being the period of time that needs to elapse until a mandatory software upgrade. That period of time was initially set to, you guessed it, 1 year, because it seemed like such a distant point in time. Well, after 1 year, the pumps begged to be upgraded (it's not clear whether that was a manual process to be initiated by the technician, or an automated process) -- but the gotcha was that normal functionality was suspended during the upgrade process, so the pumps effectively stopped working.

This story resonates with me on 2 fronts related to uniformity: the first is uniformity in time (most of the pumps were put in production around the same time), and the second is uniformity or monoculture in setup (and by this I mean single vendor/hardware/OS/software). These 2 aspects can introduce very subtle and hard to avoid bugs, which usually hit you when you least expect it.

I have a few stories of my own to tell regarding these issues.

First story: at Evite we purchased Dell C2100 servers to act as our production database servers. We got them in the summer of 2011 and we set them up in time for our high season of the year, which is late November/early-mid December. We installed Ubuntu 10.04 on all of them, and they performed magnificently, with remarkable uptime -- in fact, once we set them up, we never needed to reboot them. That is, until they started to crash one by one with kernel panic messages approximately 200 days after putting them in production. This seemed to be too much of a coincidence. After consulting with Dell support specialists, we were pointed to this Linux kernel bug which says that the scheduler code for kernel 2.6.32.11 crashes after 200+ days of uptime. We had 2 crashes in 24 hours, after which we preemptively rebooted the other servers during the night and this seemed to solve that particular issue. We also started to update Ubuntu to 12.04 on all servers so we can get an updated kernel version not affected by that bug.

As you can see, we had both uniformity/monoculture in setup (same vendor, same hardware, same OS), and uniformity in time (all servers had been put in production within a few days of each other). This combination hit us hard and unexpectedly.

Second story: one of the Dell C2100 servers mentioned above was displaying an unusual behavior. Every Saturday morning at 9 AM, we would get a monitoring alert about increased CPU I/O wait on that server. This would last for about 1 hour, after which things would return to normal. At first we thought it's a one-off, but after 2 or 3 consecutive Saturdays we looked more deeply on the system and we figured out (with the help of Percona, since those servers were running the Percona MySQL distribution) that the behavior was due to the RAID controller battery discharging and recharging itself as part of a relearning process. This had the effect or disabling the RAID write cache, so the I/O on the system suffered. I wrote more extensively about these battery issues in another post. The lesson here: if you see a cyclic behavior (which belongs to the issue of uniformity in time), investigate your hardware settings, especially the RAID controller!

Third story: this is the well known story of the 'leapocalypse', the addition of a leap second at midnight on July 1st 2012. While we weren't actually affected by the leap second bug, we still monitored our servers nervously as soon as word broke out on Twitter that servers running mostly Java apps, but also some MySQL servers, would become almost unresponsive, with 100% CPU utilization. The fix found by Mozilla was to run the date command and set it to the current date. The fact that they spread this fix to all their affected servers using Puppet also helped them.

Related to this story, it was interesting to read this article on how Google's servers weren't affected. They had been bit by leap second issues in the past, so on days that preceded the introduction of a leap second they introduced 'leap smear' to their NTP servers, adding a few milliseconds to every NTP update so that the overall effect was to get in sync more slowly (of course, they had to run their own modified NTP code for this to work).

The leapocalypse story contains elements of both unformity of time (everybody was affected at the same time due to the nature of the leap second update) and to a lesser degree uniformity of setup (Linux servers were affected because of a bug in the Linux kernel; Solaris and its variants weren't affected).

So...what steps can we take to try to alleviate these issues? I think we can do a few things:

1) Don't restart all servers at the same time prior to putting them in production; instead, introduce a 'jitter' of a couple of days in between restarts. This will give you time to react to 'uniformity in time' bugs, so that not all your servers will exhibit the bug at the same time.

Same strategy applies to clearing caches -- memcached or other cache servers.

2) Don't buy all your servers from the same vendor. This is harder to do, since vendors like to sell you things in bulk, and provide incentives for you to do so. But this would avoid issues with uniformity/monoculture in setup. Of course, even if you do buy servers from a single vendors, you can still install different OSes on them, different versions of the same OS, etc. This is only feasible I think if you use a configuration management tool such as Puppet or Chef, which in theory at least abstracts away the OS from you.

3) Make sure your monitoring is up to par. Monitor every single component of your servers which can be monitored. Pay attention to RAID controller cards! And of course graph the resources you monitor as well, to see spikes or dips that maybe are not caught by your alerting thresholds.

Monday, July 09, 2012

Installing Python scientific and statistics packages on Ubuntu

I tried to install the pandas Python library a while ago using easy_install/pip and I hit some roadblocks when it came to installing all the dependencies. So I tried it again, but this time I tried to install most of the required packages from source. Here are my notes, hopefully they'll be useful to somebody out there.

This is on an Ubuntu 12.04 machine.

Install NumPy

# wget http://downloads.sourceforge.net/project/numpy/NumPy/1.6.2/numpy-1.6.2.tar.gz
# tar xvfz numpy-1.6.2.tar.gz; cd numpy-1.6.2
# cat INSTALL.txt
# apt-get install libatlas-base-dev libatlas3gf-base
# apt-get install python-dev
# python setup.py install



Install SciPy


# wget http://downloads.sourceforge.net/project/scipy/scipy/0.11.0b1/scipy-0.11.0b1.tar.gz
# tar xvfz scipy-0.11.0b1.tar.gz; cd scipy-0.11.0b1/
# cat INSTALL.txt
# apt-get install gfortran g++
# python setup.py install


Install pandas


Prereq #1: NumPy 

- already installed (see above)

Prereq #2: python-dateutil

# wget http://labix.org/download/python-dateutil/python-dateutil-1.5.tar.gz
# tar xvfz python-dateutil-1.5.tar.gz; cd python-dateutil-1.5/
# python setup.py install



Prereq #3: pyTables (optional, needed for HDF5 support)

pyTables was the hardest package to install, since it has its own many dependencies:

numexpr

# wget http://numexpr.googlecode.com/files/numexpr-1.4.2.tar.gz
# tar xvfz numexpr-1.4.2.tar.gz; cd numexpr-1.4.2/
# python setup.py install


Cython

# wget http://www.cython.org/release/Cython-0.16.tar.gz
# tar xvfz Cython-0.16.tar.gz; cd Cython-0.16/
#python setup.py install


HDF5

# wget http://www.hdfgroup.org/ftp/HDF5/current/src/hdf5-1.8.9.tar.gz
# tar xvfz hdf5-1.8.9.tar.gz; cd hdf5-1.8.9/
# ./configure --prefix=/usr/local
# make; make install


pyTables itself

# wget http://downloads.sourceforge.net/project/pytables/pytables/2.4.0b1/tables-2.4.0b1.tar.gz
# tar xvfz tables-2.4.0b1.tar.gz; cd tables-2.4.0b1/
# python setup.py install


Edit 07/10/12: statsmodels is not a prereq, see below.

Prereq #4: statsmodels


Wasn't able to install it, it said 'requires pandas' but this is what I tried:


# wget http://pypi.python.org/packages/source/s/statsmodels/statsmodels-0.4.3.tar.gz
# tar xvfz statsmodels-0.4.3.tar.gz; cd statsmodels-0.4.3/
# python setup.py install --> requires pandas?


Prereq #4: pytz

# wget http://pypi.python.org/packages/source/p/pytz/pytz-2012c.tar.gz
# tar xvfz pytz-2012c.tar.gz; cd pytz-2012c/
# python setup.py install


Prereq #5: matplotlib

This was already installed on my target host during the EC2 instance bootstrap via Chef: 

# apt-get install python-matplotlib

pandas itself

# git clone git://github.com/pydata/pandas.git
# cd pandas
# python setup.py install

NOTE: Ralf Gommers added a comment that statsmodels is not a prerequisite to pandas, but instead needs to be installed once pandas is there. So I did this:

Install statsmodels

# wget http://pypi.python.org/packages/source/s/statsmodels/statsmodels-0.4.3.tar.gz
# tar xvfz statsmodels-0.4.3.tar.gz; cd statsmodels-0.4.3/
# python setup.py install


Finally, if you also want to dabble into machine learning algorithms:

Install scikit-learn

# wget http://pypi.python.org/packages/source/s/scikit-learn/scikit-learn-0.11.tar.gz
# tar xvfz scikit-learn-0.11.tar.gz; cd scikit-learn-0.11/
# python setup.py install

Thursday, June 28, 2012

A sweep through my Instapaper for June 2012


I'm not sure if I'll do this every month, but it does seem like a good way of recapitulating the last month in terms of interesting blog posts and articles that came my way. So here's my list for the month of June 2012:
  • Latency numbers every programmer should know -- from cache references to intercontinental network latency, some numbers that will help you do those back-of-the-envelope calculations when you need to speed things up in your infrastructure
  • Cynic -- test harness by Ruslan Spivak for simulating remote HTTP service behavior, useful when you want to see how your application reacts to various failures when interacting with 3rd party services
  • Amazon S3 performance tips and tricks -- some best practices for getting the maximum performance out of S3 from Doug Grismore, Director of Storage Operations at AWS
  • How to stop sucking and be awesome instead -- Jeff Atwood advises you to embrace failure, ship often, listen to feedback, and more importantly work on stuff that matters
  • Examining file system latency in production (PDF) -- Brendan Gregg from Joyent goes into more detail than you ever wanted regarding disk-level I/O latency vs file system I/O (dtrace is mentioned obviously)
  • Openstack in real life -- Paul Guth from Cloudscaling describes that elusive animal: a real life deployment of Openstack (a welcome alternative to the usual press-release-driven development of Openstack)
  • Peaches and pecans -- Theo Schlossnagle talks about the balance needed between doing challenging things that you are good at on one hand, and doing things that may seem boring but make you grow in unexpected ways on the other hand
  • What Facebook knows -- they pretty much know everything about you, and they want to use it to make money (but then you knew that already)
  • ACM Turing Centenary celebration -- Dave Pacheco reviews a celebration that gathered some of the brightest minds in Computer Science; great bits and pieces disseminated throughout this post, such as Ken Thompson's disappointment at the complexity of a modern Linux system
  • Embracing risk in career decisions -- it boils down to 'listen to your heart'
  • Flexible indexing in Hadoop via Elephant Twin -- Dmitriy Ryabov from Twitter talks about a new tool that can be used to create indexes in Hadoop in order to speed up queries (and can also be integrated with Pig)
  • The interesting thing about cutting costs -- just an example of the mind-boggling posts that Simon Wardley consistently comes up with; I highly recommend his blog for those of you interested in long-term business and technology strategy and vision
  • Building websites with science -- another great post from Etsy regarding good and bad ways to do data science, and some caveats regarding A/B testing
  • 100 most influential programming books -- a good list from Stack Overflow; curious if there's anybody who read all of them?
  • Building resilient user experiences -- another gem from Mike Brittain at Etsy on how to offer a good user experience even in the face of backend errors; your mantra should be 'never say no to your customers money'
  • Nobody ever got fired for using Hadoop on a cluster (PDF) -- interesting point of view from a team at Microsoft Research on how many 'Big Data' datasets can actually fit in (generous amounts of) RAM and how this can impact the architecture of your data analytics infrastructure
  • 9 beliefs of remarkably successful people -- I usually don't like lists of '10 things that will change your life' but this one is pretty good


Monday, June 25, 2012

Installing and using sysbench on Joyent SmartOS

If you read Percona's MySQL Performance blog (and if you run MySQL in production, you should!), then you know that one of their favorite load testing tools is sysbench. As it turns out, it's not trivial to install this tool, especially when you have to install from source, for example on Solaris-based systems such as the Joyent SmartOS machines. Here's what I did to get it to work.

Download source distribution for sysbench

I downloaded the latest version of sysbench (0.4.12) from the Sourceforge download page for the project.

Compile and install sysbench

If you launch a SmartOS machine in the Joyent cloud, you'll find out very quickly that it's lacking tools that you come to take for granted when dealing with Ubuntu or Fedora. In this case, you need to install compilers and linkers such as gcc and gmake. Fortunately, SmartOS has its own package installer called pkgin, so this is not too bad.

To see what packages are available if you know the tool you want to install, you can run the 'pkgin available' command and grep for the tool name:

# pkgin available | grep gcc
gcc-compiler-4.5.2 GNU Compiler Collection 4.5
gcc-runtime-4.5.2 GNU Compiler Collection 4.5 Runtime libs
gcc-tools-0 Subset of binutils needed for GCC


To install gcc, I ran:

# pkgin install gcc-compiler-4.5.2 gcc-runtime-4.5.2 gcc-tools-0

Similarly, I installed gmake and automake:

# pkgin install gmake automake

When I ran ./configure for sysbench, I got hit with errors of the form

../libtool: line 838: X--tag=CC: command not found 
../libtool: line 871: libtool: ignoring unknown tag : command not found 
../libtool: line 838: X--mode=link: command not found
etc

A quick Google search revealed this life-saving blog post which made things work. So first of all I ran ./autogen.sh, got hit with more errors, and edited configure.ac per the blog post -- basically I commented out this line in configure.ac:

#AC_PROG_LIBTOOL

And added this line:

AC_PROG_RANLIB

Now running ./autogen.sh produces no errors.

At this point I was ready to run ./configure again. However, if you want to run sysbench against a MySQL server, you need to specify MySQL header and library files when you run ./configure. This also means that you need to install some MySQL client package in order to satisfy those dependencies. If you install sysbench on a Joyent Percona SmartMachines, those packages are already there. On a plain SmartOS machine, you need to run:

# pkgin install mysql-client-5.0.92

At this point, on a Percona SmartMachine you have MySQL header files in /opt/local/include/mysql and MySQL libraries in /local/lib. On a vanilla SmartOS machine, the MySQL header files are in /opt/local/include/mysql and the MySQL libraries are in /opt/local/lib/mysql. So the configure command line will be as follows.

On a Percona SmartMachine:

# ./configure --with-mysql-includes=/opt/local/include/mysql --with-mysql-libs=/local/lib/

On a vanilla SmartOS machine where you installed mysql-client:

# ./configure --with-mysql-includes=/opt/local/include/mysql --with-mysql-libs=/opt/local/lib/mysql

Now you're ready to run the usual commands:

# make; make install

If everything goes well, the sysbench binary will be in /usr/local/bin. That directory is not in the default PATH on SmartOS, so you need to add it to your PATH environment variable in .bashrc or .bash_profile.

On a vanilla SmartOS machine, I also had issues when trying to run the sysbench tool -- I got an error message of the type 'ld.so.1: sysbench: fatal: libmysqlclient.so.18: open failed: No such file or directory'

To get past this, I had to do two things: 

1) Add "export LD_LIBRARY_PATH=/opt/local/lib/mysql" to .bashrc

2) symlink from the existing shared library file libmysqlclient.so.15 to libmysqlclient.so.18:

# ln -s /opt/local/lib/mysql/libmysqlclient.so.15 /opt/local/lib/mysql/libmysqlclient.so.18

If you've followed along this far, you reward is that you'll finally be able to run syblog post on 'DROP TABLE and stalls'sbench with no errors.

Running sysbench

It is recommended that you run sysbench from a remote host against your MySQL server, so that no resources on the server get taken by sysbench itself. I used two phases in my sysbench tests: a prepare phase, where the table I tested against was created by sysbench, and the proper load testing phase. For the prepare phase, I ran:

# sysbench --test=oltp --mysql-host=remotehost --mysql-user=mydbadmin --mysql-db=mydb --mysql-password=mypass --mysql-table-engine=innodb --oltp-table-size=1000000 --oltp-table-name=millionRowsA prepare

where
  • remotehost is the host running MySQL server
  • mydb is a database I created on the MySQL server
  • mydbadmin/mypass are the user name and password for a user which I granted all permissions for on the mydb database (with a MySQL statement like "GRANT ALL ON mydb.* TO 'mydbadmin'@'remoteip' IDENTIFIED BY 'mypass'" where remoteip is the IP address of the host I was running sysbench from)
This command will create a table called millionRowsA with 1 million rows, using InnoDB as the storage engine.

To perform a load test against this table, I ran:

# sysbench --test=oltp --mysql-host=remotehost --mysql-user=mydbadmin --mysql-db=mydb --mysql-password=mypass --mysql-table-engine=innodb --oltp-table-size=1000000 --oltp-table-name=millionRowsA --num-threads=16 run

This will run an OLTP-type test using 16 threads. Per the sysbench documentation, an OLTP-type test will perform advanced transactional operations against the test database, thus mimicking real-life scenarios to the best of its ability.

I would like to stress one thing at this point: I am not a big believer in benchmarks. Most of the time I find that they do not even remotely manage to model real-life scenarios that you see in production. In fact, there is nothing like production traffic to stress-test a component of your infrastructure, which is why techniques such as dark launching are so important. But benchmarks do give you at least a starting point for a conversation with your peers or your vendors about specific issues you might find. However, it's important to consider them starting points and not end points. Ovais Tariq from Percona agrees with me in a recent blog post on 'DROP TABLE and stalls':

I would also like to point out one thing about benchmarks – we have been always advising people to look beyond average performance numbers because they almost never really matter in production, it is not a question if average performance is bad but what stalls and pileups you have.

So far, my initial runs of sysbench against a Percona SmartMachine with 16 GB of RAM and against an EC2 m1.xlarge instance running Percona (with RAID0 across ephemeral disks, no EBS) show pretty similar results. No huge advantage either way. (I tried 16 and 32 threads against 1 million row tables and 10 million row tables). One advantage of EC2 is that it's a known ecosystem and I can run Ubuntu. I am working with Joyent on maybe further tuning the Percona SmartMachine to squeeze more performance out of it.

Thursday, June 21, 2012

Using the Joyent Cloud API

Here's some notes I took while doing some initial experiments with provisioning machines in the Joyent Cloud. I used their CloudAPI directly, although in the future I also want to try the libcloud Joyent driver. The promise of the Joyent Cloud 'SmartMachines' is that they are really Solaris zones running on a SmartOS host, and that gives you more performance (especially I/O performance) than regular virtual machines such as the ones offered by most cloud vendors. I have yet to fully verify this performance increase, but it's next on my TODO list.

Installing the Joyent CloudAPI tools


I did the following on an Ubuntu 10.04 server:

  • installed node.js -- I downloaded it in tar.gz format from http://nodejs.org/dist/v0.6.19/node-v0.6.19.tar.gz then I ran the usual './configure; make; make install'
  • installed the Joyent smartdc node package by runing 'npm install smartdc -g'
  • created new ssh RSA keypair: id_rsa_joyentapi (private key) and id_rsa_joyentapi.pub (public key)
  • ran the sdc-setup utility, pointing it to the US-EAST-1 region:
# sdc-setup https://us-east-1.api.joyentcloud.com
Username (login): (root) myjoyentusername
Password:
The following keys exist in SmartDataCenter:
   [1] grig
Would you like to use an existing key? (yes) no
SSH public key: (/root/.ssh/id_rsa.pub) /root/.ssh/id_rsa_joyentapi.pub

If you set these environment variables, your life will be easier:
export SDC_CLI_URL=https://us-east-1.api.joyentcloud.com
export SDC_CLI_ACCOUNT=myjoyentusername
export SDC_CLI_KEY_ID=id_rsa_joyentapi
export SDC_CLI_IDENTITY=/root/.ssh/id_rsa_joyentapi


  • added recommended environment variables (above) to .bash_profile, sourced the file
Using the Joyent CloudAPI tools

At this point I was able to use the various 'sdc' commands included in the Joyent CloudAPI toolset. For example, to list the available Joyent datacenters, I used sdc-listdatacenters:

# sdc-listdatacenters
{
 "us-east-1": "https://us-east-1.api.joyentcloud.com",
 "us-west-1": "https://us-west-1.api.joyentcloud.com",
 "us-sw-1": "https://us-sw-1.api.joyentcloud.com"
}


To list the available operating system images available for provisioning, I used sdc-listdatasets (the following is just an excerpt of its output):

# sdc-listdatasets
[
 {
"id": "988c2f4e-4314-11e1-8dc3-2bc6d58f4be2",
"urn": "sdc:sdc:centos-5.7:1.2.1",
"name": "centos-5.7",
"os": "linux",
"type": "virtualmachine",
"description": "Centos 5.7 VM 1.2.1",
"default": false,
"requirements": {},
"version": "1.2.1",
"created": "2012-02-14T05:53:49+00:00"
 },
 {
"id": "e4cd7b9e-4330-11e1-81cf-3bb50a972bda",
"urn": "sdc:sdc:centos-6:1.0.1",
"name": "centos-6",
"os": "linux",
"type": "virtualmachine",
"description": "Centos 6 VM 1.0.1",
"default": false,
"requirements": {},
"version": "1.0.1",
"created": "2012-02-15T20:04:18+00:00"
 },
  {
"id": "a9380908-ea0e-11e0-aeee-4ba794c83c33",
"urn": "sdc:sdc:percona:1.0.7",
"name": "percona",
"os": "smartos",
"type": "smartmachine",
"description": "Percona SmartMachine",
"default": false,
"requirements": {},
"version": "1.0.7",
"created": "2012-02-13T19:24:17+00:00"
 },
etc

To list the available machine sizes available for provisioning, I used sdc-listpackages (again, this is just an excerpt of its output):

# sdc-listpackages
[
 {
"name": "Large 16GB",
"memory": 16384,
"disk": 491520,
"vcpus": 3,
"swap": 32768,Cloud Analytics API
"default": false
 },
 {
"name": "XL 32GB",
"memory": 32768,
"disk": 778240,
"vcpus": 4,
"swap": 65536,
"default": false
 },
 {
"name": "XXL 48GB",
"memory": 49152,
"disk": 1048576,
"vcpus": 8,
"swap": 98304,
"default": false
 },
 {
"name": "Small 1GB",
"memory": 1024,
"disk": 30720,
"vcpus": 1,
"swap": 2048,
"default": true
 },
etc

Provisioning and terminating machines

To provision a machine, you use sdc-createmachine and pass it the 'urn' field of the dataset (OS) you 
want, and the package name for the size you want. Example:

# sdc-createmachine --dataset sdc:sdc:percona:1.3.9 --package "Large 16GB"
{
 "id": "7ccc739e-c323-497a-88df-898dc358ea40",
 "name": "a0e7314",
 "type": "smartmachine",
 "state": "provisioning",
 "dataset": "sdc:sdc:percona:1.3.9",
 "ips": [
"A.B.C.D",
"X.Y.Z.W"
 ],
 "memory": 16384,
 "disk": 491520,
 "metadata": {
"credentials": {
  "root": "",
  "admin": "",
  "mysql": ""
}
 },
 "created": "2012-06-07T17:55:29+00:00",
 "updated": "2012-06-07T17:55:30+00:00"
}

The above command provisions a Joyent SmartMachine running the Percona distribution of MySQL in the 'large' size, with 16 GB RAM. Note that the output of the command contains the external IP of the provisioned machine (A.B.C.D) and also its internal IP (X.Y.Z.W). The output also contains the passwords for the root, admin and mysql accounts, in the metadata field.

Here's another example for provisioning a machine running Ubuntu 10.04 in the 'small' size (1 GB RAM). You can also specify a machine name when you provision it:

# sdc-createmachine --dataset sdc:sdc:ubuntu-10.04:1.0.1 --package "Small 1GB" --name ggtest
{
 "id": "dc856044-7895-4a52-bfee-35b404061920",
 "name": "ggtest",
 "type": "virtualmachine",
 "state": "provisioning",
 "dataset": "sdc:sdc:ubuntu-10.04:1.0.1",
 "ips": [
"A1.B1.C1.D1",
"X1.Y1.Z1.W1"
 ],
 "memory": 1024,
 "disk": 30720,
 "metadata": {
"root_authorized_keys": ""
 },
 "created": "2012-06-07T19:28:19+00:00",
 "updated": "2012-06-07T19:28:19+00:00"
}

For an Ubuntu machine, the 'metadata' field contains the list of authorized ssh keys (which I removed from my example above). Also, note that the Ubuntu machine is of type 'virtualmachine' (so a regular KVM virtual instance) as opposed to the Percona Smart Machine, which is of type 'smartmachine' and is actually a Solaris zone within a SmartOS physical host.

To list your provisioned machines, you use sdc-listmachines:

# sdc-listmachines
[
 {
"id": "36b50e4c-88d2-4588-a974-11195fac000b",
"name": "db01",
"type": "smartmachine",
"state": "running",
"dataset": "sdc:sdc:percona:1.3.9",
"ips": [
  "A.B.C.D",
  "X.Y.Z.W"
],
"memory": 16384,
"disk": 491520,
"metadata": {},
"created": "2012-06-04T18:03:18+00:00",
"updated": "2012-06-07T00:39:20+00:00"
 },

  {

    "id": "dc856044-7895-4a52-bfee-35b404061920",
    "name": "ggtest",
    "type": "virtualmachine",
    "state": "running",
    "dataset": "sdc:sdc:ubuntu-10.04:1.0.1",
    "ips": [
      "A1.B1.C1.D1",
      "X1.Y1.Z1.W1"
    ],
    "memory": 1024,
    "disk": 30720,
    "metadata": {
      "root_authorized_keys": ""
    },
    "created": "2012-06-07T19:30:29+00:00",
    "updated": "2012-06-07T19:30:38+00:00"
  },

]

Note that immediately after provisioning a machine, its state (as indicated by the 'state' field in the output of sdc-listmachines) will be 'provisioning'. The state will change to 'running' once the provisioning process is done. At that point you should be able to ssh into the machine using the private key you created when installing the CloudAPI tools.

To terminate a machine, you first need to stop it via sdc-stopmachine, then to delete it via sdc-deletemachine. Both of these tools take the id of the machine as a parameter. If you try to delete a machine without first stoppping it, or without waiting sufficient time for the machine to go into the 'stopped' state, you will get a message similar to Requested transition is not acceptable due to current resource state.

Bootstrapping a machine with user data

In my opinion, a cloud API for provisioning instances/machines is only useful if it offers a bootstrapping mechanism for running user-specified scripts upon the first run. This would enable an integration with configuration management tools such as Chef or Puppet. Fortunately, the Joyent CloudAPI does support this bootstrapping via its Metadata API.
For a quick example of a customized bootstrapping action, I changed the hostname of an Ubuntu machine and also added it to /etc/hostname. This is a toy example. In a real-life situation, you would instead download a script from one of your servers and run it in order to install whatever initial packages you need, then to configure the machine as a Chef or Puppet client, etc. In any case, you need to actually spell out the commands you need the machine to run during its initial provisioning boot process. You do that by defining the metadata 'user-script' variable:

# sdc-createmachine --dataset sdc:sdc:ubuntu-10.04:1.0.1 --package "Small 1GB" --name ggtest2 --metadata user-script='hostname ggtest2; echo ggtest2 > /etc/hostname'
{
 "id": "379c0cad-35bf-462a-b680-fc091c74061f",
 "name": "ggtest2",
 "type": "virtualmachine",
 "state": "provisioning",
 "dataset": "sdc:sdc:ubuntu-10.04:1.0.1",
 "ips": [
"A2.B2.C2.D2",
"X2.Y2.Z2.W2"
 ],
 "memory": 1024,
 "disk": 30720,
 "metadata": {
"user-script": "hostname ggtest2; echo ggtest2 > /etc/hostname",
"root_authorized_keys": ""
 },
 "created": "2012-06-08T23:17:44+00:00",
 "updated": "2012-06-08T23:17:44+00:00"
}

Note that the metadata field now contains the user-script variable that I specified.

Collecting performance metrics with Joyent Cloud Analytics

The Joyent Cloud Analytics API lets you define metrics that you want to query for on your machines in the Joyent cloud. Those metrics are also graphed on the Web UI dashboard as you define them, which is a nice touch. For now there aren't that many such metrics available, but I hope their number will increase.

Joyent uses a specific nomenclature for the Analytics API. Here are some definitions, verbatim from their documentation (CA means Cloud Analytics):

metric is any quantity that can be instrumented using CA. For examples:

  • Disk I/O operations
  • Kernel thread executions
  • TCP connections established
  • MySQL queries
  • HTTP server operations
  • System load average


When you want to actually gather data for a metric, you create an instrumentation. The instrumentation specifies:
  • which metric to collect
  • an optional predicate based on the metric's fields (e.g., only collect data from certain hosts, or data for certain operations)
  • an optional decomposition based on the metric's fields (e.g., break down the results by server hostname)
  • how frequently to aggregate data (e.g., every second, every hour, etc.)
  • how much data to keep (e.g., 10 minutes' worth, 6 months' worth, etc.)
  • other configuration options
To get started with this API, you need to first see what analytics/metrics are available. You do that by calling sdc-describeanalytics (what follows is just a fragment of the output):

# sdc-describeanalytics
 "metrics": [
{
  "module": "cpu",
  "stat": "thread_samples",
  "label": "thread samples",
  "interval": "interval",
  "fields": [
    "zonename",
    "pid",
    "execname",
    "psargs",
    "ppid",
    "pexecname",
    "ppsargs",
    "subsecond"
  ],
  "unit": "samples"
},
{
  "module": "cpu",
  "stat": "thread_executions",
  "label": "thread executions",
  "interval": "interval",
  "fields": [
    "zonename",
    "pid",
    "execname",
    "psargs",
    "ppid",
    "pexecname",
    "ppsargs",
    "leavereason",
    "runtime",
    "subsecond"
  ],
etc

You can create instrumentations either via the Web UI (go to the Analytics tab) or via the command line API.

Here's an example of creating an instrumentation for file system logical operations via the sdc-createinstrumentation API:


# sdc-createinstrumentation -m fs -s logical_ops

{
  "module": "fs",
  "stat": "logical_ops",
  "predicate": {},
  "decomposition": [],
  "value-dimension": 1,
  "value-arity": "scalar",
  "enabled": true,
  "retention-time": 600,
  "idle-max": 3600,
  "transformations": {},
  "nsources": 0,
  "granularity": 1,
  "persist-data": false,
  "crtime": 1340228876662,
  "value-scope": "interval",
  "id": "17",
  "uris": [
    {
      "uri": "/myjoyentusername/analytics/instrumentations/17/value/raw",
      "name": "value_raw"
    }
  ]
}

To list the instrumentations you have created so far, you use sdc-listinstrumentations:


# sdc-listinstrumentations

[
  {
    "module": "fs",
    "stat": "logical_ops",
    "predicate": {},
    "decomposition": [],
    "value-dimension": 1,
    "value-arity": "scalar",
    "enabled": true,
    "retention-time": 600,
    "idle-max": 3600,
    "transformations": {},
    "nsources": 2,/
    "granularity": 1,
    "persist-data": false,
    "crtime": 1340228876662,
    "value-scope": "interval",
    "id": "17",
    "uris": [
      {
        "uri": "/myjoyentusername/analytics/instrumentations/17/value/raw",
        "name": "value_raw"
      }
    ]
  }
]

To retrieve the actual metrics captured by a given instrumentation, call sdc-getinstrumentation and pass it the instrumentation id:

# sdc-getinstrumentation -v 17 { "value": 1248, "transformations": {}, "start_time": 1340229361, "duration": 1, "end_time": 1340229362, "nsources": 2, "minreporting": 2, "requested_start_time": 1340229361, "requested_duration": 1, "requested_end_time": 1340229362 } 

You can see how this can be easily integrated with some like Graphite in order to keep historical information about these metrics.

You can dig deeper into a specific metric by decomposing it by different fields, such as the application name. For example, to see filesystem logical operation by application name, you would call:


# sdc-createinstrumentation -m fs -s logical_ops --decomposition execname

{
  "module": "fs",
  "stat": "logical_ops",
  "predicate": {},
  "decomposition": [
    "execname"
  ],
  "value-dimension": 2,
  "value-arity": "discrete-decomposition",
  "enabled": true,
  "retention-time": 600,
  "idle-max": 3600,
  "transformations": {},
  "nsources": 0,
  "granularity": 1,
  "persist-data": false,
  "crtime": 1340231734049,
  "value-scope": "interval",
  "id": "18",
  "uris": [
    {
      "uri": "/myjoyentusername/analytics/instrumentations/18/value/raw",
      "name": "value_raw"
    }
  ]
}

Now if you retrieve the value for this instrumentation, you see several values in the output, one value for application that performs file system logical operations:



# sdc-getinstrumentation -v 18
{
  "value": {
    "grep": 4,
    "ksh93": 5,
    "cron": 7,
    "gawk": 15,
    "svc.startd": 2,
    "mysqld": 163,
    "nscd": 27,
    "top": 159
  },
  "transformations": {},
  "start_time": 1340231762,
  "duration": 1,
  "end_time": 1340231763,
  "nsources": 2,
  "minreporting": 2,
  "requested_start_time": 1340231762,
  "requested_duration": 1,
  "requested_end_time": 1340231763
}

Another useful technique is to isolate metrics pertaining to a specific host (or 'zonename' in Joyent parlance). For this, you need to specify a predicate that will filter only the host with a specific id (you can see the id of a host when you call sdc-listmachines). Here's an example that captures the CPU wait time for a Percona SmartMachine which I provisioned earlier:


# sdc-createinstrumentation -m cpu -s waittime -p '{"eq": ["zonename","36b50e4c-88d2-4588-a974-11195fac000b"]}'

{
  "module": "cpu",
  "stat": "waittime",
  "predicate": {
    "eq": [
      "zonename",
      "36b50e4c-88d2-4588-a974-11195fac000b"
    ]
  },
  "decomposition": [],
  "value-dimension": 1,
  "value-arity": "scalar",
  "enabled": true,
  "retention-time": 600,
  "idle-max": 3600,
  "transformations": {},
  "nsources": 0,
  "granularity": 1,
  "persist-data": false,
  "crtime": 1340232271092,
  "value-scope": "interval",
  "id": "19",
  "uris": [
    {
      "uri": "/myjoyentusername/analytics/instrumentations/19/value/raw",
      "name": "value_raw"
    }
  ]
}

You can combine decomposition with predicates. For example, here's how to create an instrumentation for CPU usage time decomposed by CPU mode (user, kernel):

# sdc-createinstrumentation -m cpu -s usage -n cpumode -p '{"eq": ["zonename","36b50e4c-88d2-4588-a974-11195fac000b"]}' { "module": "cpu", "stat": "usage", "predicate": { "eq": [ "zonename", "36b50e4c-88d2-4588-a974-11195fac000b" ] }, "decomposition": [ "cpumode" ], "value-dimension": 2, "value-arity": "discrete-decomposition", "enabled": true, "retention-time": 600, "idle-max": 3600, "transformations": {}, "nsources": 0, "granularity": 1, "persist-data": false, "crtime": 1340232361944, "value-scope": "point", "id": "20", "uris": [ { "uri": "/myjoyentusername/analytics/instrumentations/20/value/raw", "name": "value_raw" } ] }
Now when you retrieve the values for this instrumentation, you can see them separated by CPU mode:

# sdc-getinstrumentation -v 20 { "value": { "kernel": 24, "user": 28 }, "transformations": {}, "start_time": 1340232390, "duration": 1, "end_time": 1340232391, "nsources": 2, "minreporting": 2, "requested_start_time": 1340232390, "requested_duration": 1, "requested_end_time": 1340232391 }


Finally, here's a MySQL-specific instrumentation that you can create on a machine running MySQL, such as a Percona SmartMachine. This one is for capturing MySQL queries:

# sdc-createinstrumentation -m mysql -s queries -p '{"eq": ["zonename","36b50e4c-88d2-4588-a974-11195fac000b"]}' { "module": "mysql", "stat": "queries", "predicate": { "eq": [ "zonename", "36b50e4c-88d2-4588-a974-11195fac000b" ] }, "decomposition": [], "value-dimension": 1, "value-arity": "scalar", "enabled": true, "retention-time": 600, "idle-max": 3600, "transformations": {}, "nsources": 0, "granularity": 1, "persist-data": false, "crtime": 1340232562361, "value-scope": "interval", "id": "22", "uris": [ { "uri": "/myjoyentusername/analytics/instrumentations/22/value/raw", "name": "value_raw" } ] }
Overall, I found the Joyent Cloud API and its associated Analytics API fairly easy to use, once I got past some nomenclature quirks. I also want to mention that the support I got from Joyent was very, very good. Replies to questions regarding some of the topics I discussed here were given promptly and knowledgeably. My next step is gauging the performance of MySQL on a SmartMachine, when compared to a similar-sized instance running in the Amazon EC2 cloud. Stay tuned.

    Using AWS CloudWatch Logs and AWS ElasticSearch for log aggregation and visualization

    If you run your infrastructure in AWS, then you can use CloudWatch Logs and AWS ElasticSearch + Kibana for log aggregation/searching/visuali...