Thursday, December 22, 2011

Load balancing and SSL in EC2

Here is another post I wrote for this year's Sysadvent blog. It briefly mentions some ways you can do load balancing in EC2, and focuses on how to upload SSL certificates to an Elastic Load Balancer using command-line tools. Any comments appreciated!

Monday, December 12, 2011

Analyzing logs with Pig and Elastic MapReduce

This is a blog post I wrote for this year's Sysadvent series. If you're not familiar with the Sysadvent blog, you should be, if you are at all interested in system administration/devops topics. It is maintained by the indefatigable Jordan Sissel, and it contains posts contributed by various authors, 25 posts per year, from Dec 1st through Dec 25th. The posts cover a variety of topics, and to give you a taste, here are the articles so far this year:

Day 1: "Don't bash your process outputs" by Phil Hollenback
Day 2: "Strategies for Java deployment" by Kris Buytaert
Day 3: "Share skills and permissions with code" by Jordan Sissel
Day 4: "A guide to packaging systems" by Jordan Sissel
Day 5: "Tracking requests with Request Tracker" by Christopher Webber
Day 6: "Always be hacking" by John Vincent
Day 7: "Change and proximity of communication" by Aaron Nichols
Day 8: "Running services with systemd" by Jordan Sissel
Day 9: "Data in the shell" by Jordan Sissel
Day 10: "Analyzing logs with Pig and Elastic MapReduce" by yours truly
Day 11: "Simple disk-based server backups with rsnapshot" by Phil Hollenback
Day 12: "Reverse-engineer servers with Blueprint" by Richard Crowley

Jordan needs more articles for this year, so if you have something to contribute, please propose it on the mailing list.

Wednesday, December 07, 2011

Crowd Mood - an indicator of health for products/projects

I thought I just coined a new term -- Crowd Mood -- but a quick Google search revealed a 2009 paper on "Crowd Behavior at Mass Gatherings: A Literature Review" (PDF) which says:

In the mass-gathering literature, the use of terms “crowd behavior”, “crowd type”, “crowd management”, and “crowd mood” are used in variable contexts. More practically, the term “crowd mood” has become an accepted measure of probable crowd behavior outcomes. This is particularly true in the context of crowds during protests/riots, where attempts have been made to identify factors that lead to a change of mood that may underpin more violent behavior.

Instead of protests and riots, the crowd behavior I'm referring to is the reaction of users of software products or projects. I think the overall Crowd Mood of these users is a good indicator of the health of those products/projects. I may state the obvious here, and maybe it's been done already, but I'm not aware of large-scale studies that try to correlate the success or failure of a particular software project with the mood of its users, as expressed in messages to mailing lists, in blog posts, and of course Twitter.

I'm aware of Sentiment Analysis and I know there are companies who offer this service by mining Twitter. But a more rigorous study would include other data sources. I have in mind something similar to this study by Kaleev Leetaru: "Culturonomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space". This study mined information primarily from the archives of the "Summary of World Broadcasts (SWB)" global news monitoring service. It analyzed the tone or mood of the news regarding a particular event/person/place, and it established correlations between the sentiment it mined (for example negative news regarding Egypt's then-president Mubarak) and events that happen shortly afterwards, such as the Arab Spring of 2011. The study actually talks not only about correlation, but also about forecasting events based on current news tone.

I believe similar studies mining the Crowd Mood would be beneficial to any large-scale software product or project. For example, Canonical would be well-advised to conduct such a study in order to determine whether their decision to drop Gnome in favor of Unity was good or not (my feeling? it was BAD! -- and I think the Crowd Mood surrounding this decision would confirm it).

Another example: Python 3 and the decision not to continue releasing Python 2 past 2.8. Good or bad? I say BAD! (see also Armin Ronacher's recent blog post on the subject, which brilliantly expresses the issues around this decision).

Yet one more example: the recent changes in the Google UI, especially GMail. BAD!

These examples have a common theme in my opinion: the unilateral decision by a company or project to make non-backwards-compatible changes without really consulting its users. I know Apple can pull this off, but they're the exception, not the rule. The attitude of "trust us, we know what's good for you" leads to failure in the long run.

Somebody should build a product (or even better, an Open Source project) around Crowd Mood analysis. Maybe based on Google's Prediction API, which has functionality for sentiment analysis.

Thursday, November 17, 2011

Seven years of blogging in less than 500 words


In a couple of days, this blog will be 7 years old. Hard to believe so much time has passed since my first Hello World post.

I keep reading that blogging is on the wane, and it's true, mainly because of the popularity of Twitter. But I strongly believe that blogging is still important, and that more people should do it. For me, it's a way to give back to the community. I can't even remember how many times I found solutions to my technical problems by reading a blog post. I personally try to post something on my blog every single time I solve an issue that I've struggled with. If you post documentation to a company wiki (assuming it's not confidential), I urge you to try to also blog publicly about it – think of it as a public documentation that can help both you and others.

Blogging is also a great way to further your career. Back in September 2008 I blogged about my experiences with EC2. I had launched an m1.small instance and I had started to play with it. Lo and behold, I got a job offer based on that post. I accepted the offer, and that job allowed me to greatly enhance my experience with cloud computing. This brings me to a more general point: if you want to work on something you know nothing about, start small on your own, persevere, and yes, blog about it! In my experience, you'll invariably find yourself in a position to be paid to work on it.

It's also pretty interesting for me to look at my blog posts and to recognize in them the evolution of my own career. I started to blog when I was working as a tester. I was passionate about automated testing, Python and agile processes and tools. This led to digging more into automation tools, then into automated deployments and configuration management in the context of system architecture. This led into cloud computing, and these days this is leading into Big Data analytics. Who knows what's next? Whatever it is, I'm sure it will be exciting, because I make an effort to make it so!

I think it's also important to recognize that learning a tool or a technique is necessary but not sufficient: at the end of the day it's what you do with it that matters. For me the progression has been testing->automation->cloud computing->data analytics->??? (I actually started my career as a programmer, but here I include only the period coinciding with my blogging years.)

This may be self-help-kind-of-corny, but a quote which is attributed (maybe wrongly) to Goethe really applies to everybody:

“Whatever you can do, or dream you can do, begin it. Boldness has genius, power, and magic in it. Begin it now."

Wednesday, November 09, 2011

Troubleshooting memory allocation errors in Elastic MapReduce

Yesterday we ran into an issue with some Hive scripts running within an Amazon Elastic MapReduce cluster. Here's the error we got:


Caused by: java.io.IOException: Spill failed
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:877)
 at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:474)
 at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:289)
 ... 11 more
Caused by: java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:176)
 at org.apache.hadoop.util.Shell.run(Shell.java:161)
 at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
 at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329)
 at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
 at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1238)
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:703)
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1190)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
 at java.lang.UNIXProcess.(UNIXProcess.java:148)
 at java.lang.ProcessImpl.start(ProcessImpl.java:65)
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)



Googling around for java.io.IOException: java.io.IOException: error=12, Cannot allocate memory, it seems it's a common problem. See this AWS Developer Forums thread, this Hadoop core-user mailing list thread, and this explanation by Ken Krugler from Bixo Labs.

Basically, it boils down to the fact that when Java tries to fork a new process (in this case a bash shell), Linux will try to allocate as much memory as the current Java process, even though not all that memory will be required. There are several workarounds (read in particular the AWS Forum thread), but a solution that worked for us was to simply add swap space to the Elastic MapReduce slave nodes.

You can ssh into a slave node from the EMR master node by using the same private key you used when launching the EMR cluster, and by targeting the internal IP address of the slave node. In our case, the slaves are m1.xlarge instances, and they have 4 local disks (/dev/sdb through /dev/sde) mounted as /mnt, /mnt1, /mnt2 and /mnt3, with 414 GB available on each file system. I ran this simple script via sudo on each slave to add 4 swap files of 1 GB each, one on each of the 4 local disks.


$ cat make_swap.sh 
#!/bin/bash


SWAPFILES='
/mnt/swapfile1
/mnt1/swapfile1
/mnt2/swapfile1
/mnt3/swapfile1
'
for SWAPFILE in $SWAPFILES; do
dd if=/dev/zero of=$SWAPFILE bs=1024 count=1048576
mkswap $SWAPFILE
swapon $SWAPFILE
echo "$SWAPFILE swap swap defaults 0 0" >> /etc/fstab
done

This solved our issue. No more failed Map tasks, no more failed Reduce tasks. Maybe this will be of use to some other frantic admins out there (like I was yesterday) who are not sure how to troubleshoot the intimidating Hadoop errors they're facing.



Friday, November 04, 2011

Experiences with Amazon Elastic MapReduce


We started to use AWS Elastic MapReduce (EMR) in earnest a short time ago, with the help of Bradford Stephens from Drawn to Scale. We needed somebody to jumpstart our data analytics processes and workflow, and Bradford's help was invaluable. At some point we'll probably build our own Hadoop cluster either in EC2 or in-house, but for now EMR is doing the job just fine.

We started with an EMR cluster containing the master + 5 slave nodes, all m1.xlarge. We still have that cluster up and running, but in the mean time I've also experimented with launching clusters, running our data analytics processes on them, then shutting them down -- which is the 'elastic' type of workflow that takes full advantage of the pay-per-hour model of EMR.

Before I go into the details of launching and managing EMR clusters, here's the general workflow that we follow for our data analytics processes on a nightly basis:

  1. We gather data from various sources such as production databases, ad impression reports, mail server logs, and other 3rd party sources. All this data is in CSV format, mostly comma-separated or tab-separated. Each CSV file is timestamped with YYYY-MM-DD and corresponds to a 'table' or 'entity' that we want to analyse later.
  2. We gzip all CSV files and upload them to various S3 buckets.
  3. On the EMR Hadoop master node, we copy the csv.gz files we need from S3 into HDFS.
  4. We create Hive tables, one table for each type of 'entity'.
  5. We run Hive queries against these tables and save the results in HDFS.
  6. We export the results of the Hive queries to MySQL so we can further analyse them when we need to, and so we can visualize them in a dashboard.
I think this is a fairly common workflow for doing data analytics with Hadoop. It can definitely be optimized. Currently we create the Hive tables from scratch in step 4. Ideally we'll want to save them to S3, and only append new data to them, but the append operation seems to only exist in Hive 0.8 which is not yet available in EMR. But as suboptimal as it is, even if it takes a few hours each night, this process allows us to run queries that were simply impossible to execute outside of Hadoop.

Here are some experiments I've done with using EMR in its truly 'elastic' mode.

Installing the EMR Ruby CLI

Eventually we'll use something like boto's EMR bindings to manage our EMR clusters, but for quick experimentation I preferred a purely command-line tool, and the Ruby-based elastic-mapreduce tool seemed to be the only one available. To install, download the zip file from here, then unzip it somewhere on an EC2 instance where you can store your AWS credentials (a management-type instance usually). I installed in /opt/emr on one of our EC2 instances. At this point it's also a good idea to become familiar with the EMR Developer Guide, which has examples of various elastic-mapreduce use cases. I also found a good README on GitHub.

Next, create a credentials.json file containing some information about your AWS credentials and the keypair that will be used when launching the EMR cluster. The format of this JSON file is:

{
  "access-id": "YOUR_AWS_ACCESS_ID",
  "private-key": "YOUR_AWS_SECRET_KEY",
  "key-pair": "YOUR_EC2_KEYPAIR_NAME",
  "key-pair-file": "PATH_TO_PRIVATE_SSH_KEY_CORRESPONDING_TO_KEYPAIR",
  "region": "us-east-1",
  "log-uri": "s3://somebucket.yourcompany.com/logs"
}

Launching an EMR cluster

# ./elastic-mapreduce -c /opt/emr/credentials.json --create --name "test1" --alive --num-instances 3 --master-instance-type m1.small --slave-instance-type m1.small --hadoop-version 0.20 --hive-interactive --hive-versions 0.7.1

The command line above launches an EMR cluster called test1 with 3  instances (1 Hadoop master and 2 slave nodes), installs Hadoop 0.20 and Hive 0.71 on it, then keeps the cluster up and running (because we specified --alive). For experimentation purposes I recommend using m1.small instances.

This command returns a jobflow ID, which you'll need for all other commands that reference this specific cluster.

Getting information about a specific jobflow

# ./elastic-mapreduce --describe --jobflow JOBFLOWID

This command returns a JSON document containing a wealth of information about the EMR cluster. Here's an example output.

Listing the state of a jobflow

# ./elastic-mapreduce --list --jobflow JOBFLOWID
JOBFLOWID     WAITING        ec2-A-B-C-D.compute-1.amazonaws.com         test1
   COMPLETED      Setup Hive                    

This command is useful when you're trying to ascertain whether the initial configuration of the cluster is done. The state of the cluster immediately after you launch it will change from STARTING to RUNNING (during which step Hadoop and Hive are installed) and finally to WAITING.

Enabling and disabling termination protection for a jobflow

If you want to make sure that the jobflow won't be terminated, you can turn the termination protection on (it's off by default):

# ./elastic-mapreduce --set-termination-protection true --jobflow JOBFLOWID

To disable the termination protection, set it to false.

Adding core nodes to the EMR cluster

There are 2 types of Hadoop slave nodes: core (which contribute to the HDFS cluster) and task (which run Hadoop tasks but are not part of the HDFS cluster). If you want to add core nodes, you can use this command:

# ./elastic-mapreduce --modify-instance-group CORE --instance-count NEW_COUNT

where NEW_COUNT is the new overall count of core nodes that you are targetting (so for example if you had 5 core nodes and you wanted to add 2 more, the NEW_COUNT will be 7).

Note that you can only request an increased NEW_COUNT for core nodes, never a decreased count.

Also note that if one or more slave nodes are misbehaving (more on that in a bit), it's better to terminate them (via ElasticFox or the AWS console for example) than to add new core nodes. When you terminate them, the EMR jobflow will automatically launch extra nodes so that the core node count is kept the same.

Accessing the Hadoop master node and working with the HDFS cluster

You can ssh into the Hadoop master by using the private key you specified in credentials.json. 

# ssh -i /path/to/private_ssh_jey hadoop@public_ip_or_dns_of_master

Or you can use:

# ./elastic-mapreduce --jobflow JOBFLOWID --ssh

Once you're logged in as user hadoop on the master node, you can run various HDFS commands such as:
  • creating an HDFS directory: $ hadoop fs -mkdir /user/hadoop/mydir
  • copying files from S3 into HDFS: $ hadoop fs -cp s3://somebucket.yourcompany.com/dir1/data*gz /user/hadoop/mydir
  • listing files in an HDFS directory: $ hadoop fs -ls /user/hadoop/mydir
  • deleting files in an HDFS directory: $ hadoop fs -rm /user/hadoop/*.gz
  • copying files from HDFS to a local file system on the master node: $ hadoop fs -copyToLocal /user/hadoop/mydir/*.gz /home/hadoop/somedirTh
There is also a very useful admin-type command which allows you to see the state of the slave nodes, and the HDFS file system usage. Here's an example which I ran on a cluster with 3 slave nodes (I added the 3nd node at a later time, which is why its DFS usage is less than the usage on the first 2 nodes):

$ hadoop dfsadmin -report
Configured Capacity: 5319863697408 (4.84 TB)
Present Capacity: 5052263763968 (4.6 TB)
DFS Remaining: 4952298123264 (4.5 TB)
DFS Used: 99965640704 (93.1 GB)
DFS Used%: 1.98%
Under replicated blocks: 3
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 3 (3 total, 0 dead)

Name: 10.68.125.161:9200
Decommission Status : Normal
Configured Capacity: 1773287899136 (1.61 TB)
DFS Used: 47647641600 (44.38 GB)
Non DFS Used: 89327800320 (83.19 GB)
DFS Remaining: 1636312457216(1.49 TB)
DFS Used%: 2.69%
DFS Remaining%: 92.28%
Last contact: Fri Nov 04 20:31:31 UTC 2011


Name: 10.191.121.76:9200
Decommission Status : Normal
Configured Capacity: 1773287899136 (1.61 TB)
DFS Used: 49191796736 (45.81 GB)
Non DFS Used: 89329020928 (83.19 GB)
DFS Remaining: 1634767081472(1.49 TB)
DFS Used%: 2.77%
DFS Remaining%: 92.19%
Last contact: Fri Nov 04 20:31:29 UTC 2011


Name: 10.68.105.36:9200
Decommission Status : Normal
Configured Capacity: 1773287899136 (1.61 TB)
DFS Used: 3126202368 (2.91 GB)
Non DFS Used: 88943112192 (82.83 GB)
DFS Remaining: 1681218584576(1.53 TB)
DFS Used%: 0.18%
DFS Remaining%: 94.81%
Last contact: Fri Nov 04 20:31:30 UTC 2011

When there's something wrong with any slave data node, e.g. it can't be contacted by the master, then the number of dead nodes will be non-zero, and the 'Last contact' date for that slave will be off compared to the healthy nodes.

Another useful admin command is 

$ hadoop fsck /

which should report HEALTHY:


Status: HEALTHY
 Total size: 99648130222 B
 Total dirs: 76
 Total files: 5044 (Files currently being written: 1)
 Total blocks (validated): 5215 (avg. block size 19107982 B) (Total open file blocks (not validated): 1)
 Minimally replicated blocks: 5215 (100.0 %)
 Over-replicated blocks: 0 (0.0 %)
 Under-replicated blocks: 3 (0.057526365 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor: 1
 Average block replication: 1.0011505
 Corrupt blocks: 0
 Missing replicas: 21 (0.4022218 %)
 Number of data-nodes: 3
 Number of racks: 1




The filesystem under path '/' is HEALTHY

Terminating a jobflow

As you would expect, the command is:

# ./elastic-mapreduce --terminate --jobflow JOBFLOWID

Inspecting Hive logs

After running a Hive query on the Hadoop master node, you can inspect the logs created in this directory (this is for Hive 0.71): /mnt/var/lib/hive_07_1/tmp/history

Putting it all together

I put many of these pieces together in one script (see gist here) that I run every night from the EC2 management instance, and that does the following:
  1. Launches EMR cluster with desired instance types for the master node and the slave nodes, and with the desired instance count (this count includes the master node); the command line also specified that Hadoop 0.20 and Hive 0.71 need to be installed, and it keeps the cluster up and running via the --alive option.
  2. Waits in a loop for the state of the jobflow to be WAITING and sleeps 10 seconds in between checks.
  3. Retrieves the public DNS name of the Hadoop master from the JSON description of the new jobflow.
  4. Copies from the management node to the Hadoop master a local directory containing scripts that will be run on the master.
  5. Runs a script (called run_hive_scripts.sh) on the Hadoop master via ssh; this script does all the further Hadoop and Hive processing, which includes
    • creation of HDFS directories
    • copying of csv.gz files from S3 into HDFS
    • creation of Hive files
    • running of Hive queries
    • saving of Hive queries to a local directory on the Hadoop master
  6. Retrieves the Hive output files by scp-ing them back from the Hadoop master
  7. Terminates the EMR cluster
At this point, the Hive output files are processed and the data is inserted into a MySQL instance for further analysis and visualization.

That's about it for now. We have a lot more work to do before we declare ourselves satisfied with the state of our data analytics platform, and I'll blog more about it as soon as we cross more things off our todo list.

Friday, October 28, 2011

More fun with the LSI MegaRAID controllers

As I mentioned in a previous post, we've had some issues with the LSI MegaRAID controllers on our Dell C2100 database servers. Previously we noticed periodical slow-downs of the databases related to decreased I/O throughput. It turned out it was the LSI RAID battery going through its relearning cycle.

Last night we got paged again by increased load on one of the Dell C2100s. The load average went up to 25, when typically it's between 1 and 2. It turns out one of the drives in the RAID10 array managed by the LSI controller was going bad. You would think the RAID array would be OK even with a bad drive, but the drive didn't go completely offline, so the controller was busy servicing it and failing. This had the effect of decreasing the I/O throughput on the server, and making our database slow.

For my own reference, and hopefully for others out there, here's what we did to troubleshoot the issue. We used the MegaCli utilities (see this post for how to install them).

Check RAID event log for errors

# ./MegaCli64 -AdpEventLog -GetSinceReboot -f events.log -aALL

This will save all RAID-related events since the last reboot to events.log. In our case, we noticed these lines:

===========
Device ID: 11
Enclosure Index: 12
Slot Number: 11
Error: 3

seqNum: 0x00001654
Time: Fri Oct 28 04:45:26 2011

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 0b(e0x0c/s11) Path
5000c50034731385, CDB: 2a 00 0b a1 4c 00 00 00 18 00, Sense: 6/29/02
Event Data:
===========
Device ID: 11
Enclosure Index: 12
Slot Number: 11
CDB Length: 10
CDB Data:
002a 0000 000b 00a1 004c 0000 0000 0000 0018 0000 0000 0000 0000 0000
0000 0000 Sense Length: 18
Sense Data:
0070 0000 0006 0000 0000 0000 0000 000a 0000 0000 0000 0000 0029 0002
0002 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

seqNum: 0x00001655

Time: Fri Oct 28 04:45:26 2011

Code: 0x00000060
Class: 1
Locale: 0x02
Event Description: Predictive failure: PD 0b(e0x0c/s11)
Event Data:



Check state of particular physical drive

You need the enclosure index (12 in the output above) and the device ID (11 in the output above). Then you run:

./MegaCli64 -pdInfo -PhysDrv [12:11] aALL

In our case, we saw that Media Error Count was non-zero.

Mark physical drive offline

We decided to mark the faulty drive offline, to see if that takes work off of the RAID controller and improves I/O throughput in the system.

# ./MegaCli64 -PDOffline  -PhysDrv [12:11] aALL

Indeed, we noticed how the I/O wait time and the load average on the system dropped to normal levels.

Hot swap faulty drive

We had a spare drive which we hot-swapped with the faulty drive. The new drive started to rebuild. You can see its status if you run:

# ./MegaCli64 -pdInfo -PhysDrv [12:11] aALL

Enclosure Device ID: 12
Slot Number: 11
Device Id: 13
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 931.512 GB [0x74706db0 Sectors]
Non Coerced Size: 931.012 GB [0x74606db0 Sectors]
Coerced Size: 931.0 GB [0x74600000 Sectors]
Firmware state: Rebuild
SAS Address(0): 0x5000c500347da135
SAS Address(1): 0x0
Connected Port Number: 0(path0) 
Inquiry Data: SEAGATE ST31000424SS    KS689WK4Z08J            
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None 
Device Speed: Unknown 
Link Speed: Unknown 
Media Type: Hard Disk Device


Exit Code: 0x00

Check progress of the rebuilding process

# ./MegaCli64 -PDRbld -ShowProg -PhysDrv [12:11] -aALL
                                     
Rebuild Progress on Device at Enclosure 12, Slot 11 Completed 47% in 88 Minutes.

Exit Code: 0x00

Check event log again

Once the rebuilding is done, you can check the event log by specifying for example that you want to see the last 10 events:

# ./MegaCli64 -AdpEventLog -GetLatest 10 -f t1.log -aALL

In our case we saw something like this:


# cat t1.log 



seqNum: 0x00001720
Time: Fri Oct 28 12:48:27 2011

Code: 0x000000f9
Class: 0
Locale: 0x01
Event Description: VD 00/0 is now OPTIMAL
Event Data:
===========
Target Id: 0


seqNum: 0x0000171f
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000051
Class: 0
Locale: 0x01
Event Description: State change on VD 00/0 from DEGRADED(2) to OPTIMAL(3)
Event Data:
===========
Target Id: 0
Previous state: 2
New state: 3


seqNum: 0x0000171e
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from REBUILD(14) to ONLINE(18)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 20
New state: 24


seqNum: 0x0000171d
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000064
Class: 0
Locale: 0x02
Event Description: Rebuild complete on PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11


seqNum: 0x0000171c
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000063
Class: 0
Locale: 0x02
Event Description: Rebuild complete on VD 00/0
Event Data:
===========
Target Id: 0


seqNum: 0x000016b7
Time: Fri Oct 28 08:55:42 2011

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from OFFLINE(10) to REBUILD(14)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 16
New state: 20


seqNum: 0x000016b6
Time: Fri Oct 28 08:55:42 2011

Code: 0x0000006a
Class: 0
Locale: 0x02
Event Description: Rebuild automatically started on PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11


seqNum: 0x000016b5
Time: Fri Oct 28 08:55:42 2011

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from UNCONFIGURED_GOOD(0) to OFFLINE(10)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 0
New state: 16


seqNum: 0x000016b4
Time: Fri Oct 28 08:55:42 2011

Code: 0x000000f7
Class: 0
Locale: 0x02
Event Description: Inserted: PD 0d(e0x0c/s11) Info: enclPd=0c, scsiType=0, portMap=00, sasAddr=5000c500347da135,0000000000000000
Event Data:
===========
Device ID: 13
Enclosure Device ID: 12
Enclosure Index: 1
Slot Number: 11
SAS Address 1: 5000c500347da135
SAS Address 2: 0


seqNum: 0x000016b3
Time: Fri Oct 28 08:55:42 2011

Code: 0x0000005b
Class: 0
Locale: 0x02
Event Description: Inserted: PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11

Here are some other useful troubleshooting commands:

Check firmware state across all physical drives


This is a useful command if you want to see at a glance the state of all physical drives attached to the RAID controller:

# ./MegaCli64 -PDList -a0 | grep Firmware

In the normal case, all drives should be ONLINE. If any drive reports as FAILED you have a problem.

Check virtual drive state

You can also check the virtual drive state:

# ./MegaCli64 -LDInfo -L0 -aALL

Adapter 0 -- Virtual Drive Information:
Virtual Disk: 0 (Target Id: 0)
Name:
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:5.455 TB
State: Degraded
Stripe Size: 64 KB
Number Of Drives per span:2
Span Depth:6
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None


Exit Code: 0x00

Here you can see that the state is degraded. Running the previous PDlist command shows that one drive is failed:

# ./MegaCli64 -PDList -aALL | grep Firmware
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Failed
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online


Our next step is to write a custom Nagios plugin to check for events that are out of the ordinary. A good indication of an error state is the transition from 'Previous state: 0' to 'New state: N' where N > 0, e.g.:


Previous state: 0

New state: 16

Thanks to my colleague Marco Garcia for digging deep into the MegaCli documentation and finding some of these obscure commands.

Resources



Thursday, October 06, 2011

Good email sending practices


I'm not going to call these 'best practices' but I hope they'll be useful if you're looking for ways to improve your email sending capabilities so as to maximize the odds that a message intended for a given recipient actually reaches that recipient's inbox.

  • Make sure your mail servers are not configured as open relays
    • This should go without saying, but it should still be your #1 concern
    • Use ACLs and only allow relaying from the IPs of your application servers
    • Check your servers by using a mail relay testing service
  • Make sure you have reverse DNS entries for the IP addresses you're sending mail from
    • This is another one of the oldies but goldies that you should double-check
  • Use DKIM
    • From the Wikipedia entryDomainKeys Identified Mail (DKIM) is a method for associating a domain name to an email message, thereby allowing a person, role, or organization to claim some responsibility for the message. The association is set up by means of a digital signature which can be validated by recipients. Responsibility is claimed by a signer —independently of the message's actual authors or recipients— by adding a DKIM-Signature: field to the message's header. The verifier recovers the signer's public key using the DNS, and then verifies that the signature matches the actual message's content.
    • Check out dkim.org
    • Check out my blog post on setting up OpenDKIM with postfix
  • Use SPF
    • From the Wikipedia entrySender Policy Framework (SPF) is an email validation system designed to prevent email spam by detecting email spoofing, a common vulnerability, by verifying sender IP addresses. SPF allows administrators to specify which hosts are allowed to send mail from a given domain by creating a specific SPF record (or TXT record) in the Domain Name System (DNS). Mail exchangers use the DNS to check that mail from a given domain is being sent by a host sanctioned by that domain's administrators.
    • Some SPF-testing tools are available at this openspf.org page
  • Use anti-virus/anti-spam filtering tools as a precaution for guarding against sending out malicious content
    • list of such tools for postfix
    • for sendmail there are 'milters' such as MIMEDefang 
  • Monitor the mail queues on your mail servers
    • Watch for out-of-ordinary spikes or drops which can mean that somebody might try to exploit your mail subsystem
    • Use tools such as nagios, munin and ganglia to monitor and alert on your mail queues
  • Monitor the rate of your outgoing email
    • Again spikes or drops should alert you to suspicious activity or even errors in your application that can have an adverse effect on your mail subsystem
    • Check out my blog post on visualizing mail logs although these days we are using Graphite to store and visualize these data points
  • If you have the budget, use deliverability and reputation monitoring services such as ReturnPath
    • These services typically monitor the IP addresses of your mail servers and register them with various major ISPs
    • They can alert you when users at major ISPs are complaining about emails originating from you (most likely marking them as spam)
    • They can also whitelist your mail server IPs with some ISPs
    • They can provide lists of mailboxes they maintain at major ISPs and allow you to send email to those mailboxes so you can see the percentage of your messages that reach the inbox, or are missing, or go to a spam folder
  • Again if you have the budget (it's not that expensive), use a service such as Litmus which shows you how your email messages look in various mail clients on a variety of operating systems and mobile devices
  • If you don't have the budget, at least check that your mail server IPs are not blacklisted by various RBL sites
    • You can use a multi-RBL check tool such as this one
  • Make sure the headers of your messages match the type of email you're sending
    • For example, if your messages are MIME multipart, make sure you mark them as such in the Content-type headers. You should have a Content-type header showing 'multipart/related' followed by other headers showing various types of content such as 'multipart/alternative', 'text/plain' etc.
    • This is usually done via an API; if you're using Python, this can be done with something like
    • message = MIMEMultipart("related")
      message_alternative = MIMEMultipart("alternative")
      message.attach(message_alternative)
      message_alternative.attach(MIMEText(plaintext_body.encode('utf-8'), "plain", 'utf-8'))
      message_alternative.attach(MIMEText(html_body.encode('utf-8'), 'html', 'utf-8'))
      

    Perhaps the most important recommendation I have: unless you really know what you're doing, look into outsourcing your email sending needs to a 3rd party service which preferably offers an API:

Friday, September 16, 2011

Slow database? Check RAID battery!

Executive Summary: 

If your Dell database servers get slow suddenly, and I/O seems sluggish, do yourself a favor and check if the RAID battery is currently going through its 'relearning' cycle. If this is so, then the Write-Back policy is disabled and Write-Through is enabled -- as a result writes become very slow compared to the standard operation.

Details:

This turns out to be a fairly well known problem with RAID controllers in Dell servers, specifically LSI controllers. The default mode of operation for the RAID battery is to periodically go through a so-called 'relearn cycle', where it discharges, then charges and recalibrates itself by finding the current charge. In this timeframe, as I mentioned, Write-Back is disabled and Write-Through is enabled.

For our MySQL servers, we have innodb_flush_log_at_trx_commit set to 1, which means that every commit if flushed to disk. In consequence, the Write-Through mode will severely impact the performance of the writes to the database. A symptom is that CPU I/O wait is high, and the database gets sluggish. Pain all around.

We started to experience this database slowness on 3 database server at almost the same time. Two of them were configured as slaves, and one as master. The symptoms included high CPU I/O wait, slow queries on the master, and replication lag on the slaves. Nothing pointed to something specific to MySQL. We opened an emergency ticket with Percona and were fortunate to be assigned to Aurimas Mikalauskas, a Percona principal consultant and a MySQL/RAID hardware guru. It took him less than a minute to correctly diagnose the issue based on these symptoms. Now that we knew what the issue was, some Google searches turned out other articles and blog posts talking about it. It turns out one of the most frequently cited posts belongs to Robin Bowes, my ex-coworker from RIS Technology/Reliam! It also turns out Percona engineers blogged about this issue extensively (see this post which references other posts).

In any case, for future reference, here is what we did on all the servers that have the LSI MegaRaid controller (these servers are Dell C2100s in our case):

1) Install MegaCli utilities

I had a hard time finding these utilities, since the LSI support site doesn't seem to have them anymore. I found this blog post talking about a zip file containing the tools, then I googled the zip filename and I found an updated version on this Gentoo-related site. Then I followed the steps in the blog post above to extract the statically-linked binaries:

# apt-get install rpm2cpio
# mkdir megacli; cd megacli

# wget http://download.gocept.com/gentoo/mirror/distfiles/4.00.11_Linux_MegaCLI.zip
# unzip 4.00.11_Linux_MegaCLI.zip 
# unzip MegaCliLin.zip 
# rpm2cpio MegaCli-4.00.11-1.i386.rpm| cpio -idmv

At this point I had 2 statically-linked binaries called MegaCli and MegaCli64 in megacli/opt/MegaRAID/MegaCli.

2) Inspect event log for RAID controller to figure out what has been going on in that subsystem (this command was actually run by Aurimas during the troubleshooting he did):

# ./MegaCli64 -AdpEventLog -GetSinceReboot -f events.log -aALL
# cat events.log
...
Event Description: Time established as 09/09/11 15:27:18; (48 seconds since power on)
--
Time: Fri Sep 9 16:27:36 2011
Event Description: Battery temperature is normal
--
Time: Fri Sep 9 16:56:51 2011
Event Description: Battery started charging
--
Time: Fri Sep 9 17:08:46 2011
Event Description: Battery charge complete
--
Time: Sat Sep 10 19:54:16 2011
Event Description: Battery relearn will start in 4 days
--
Time: Mon Sep 12 19:53:46 2011
Event Description: Battery relearn will start in 2 day
--
Time: Tue Sep 13 19:54:36 2011
Event Description: Battery relearn will start in 1 day
--
Time: Wed Sep 14 14:54:16 2011
Event Description: Battery relearn will start in 5 hours
--
Time: Wed Sep 14 19:55:26 2011
Event Description: Battery relearn pending: Battery is under charge
--
Time: Wed Sep 14 19:55:26 2011
Event Description: Battery relearn started
--
Time: Wed Sep 14 19:55:29 2011
Event Description: BBU disabled; changing WB virtual disks to WT, Forced WB VDs are not affected
--
Time: Wed Sep 14 19:55:29 2011
Event Description: Policy change on VD 00/0 to [ID=00,dcp=01,ccp=00,ap=0,dc=0] from [ID=00,dcp=01,ccp=01,ap=0,dc=0]
Previous LD Properties
Current Cache Policy: 1
Default Cache Policy: 1
New LD Properties
Current Cache Policy: 0
Default Cache Policy: 1
--
Time: Wed Sep 14 19:56:31 2011
Event Description: Battery is discharging
--
Time: Wed Sep 14 19:56:31 2011
Event Description: Battery relearn in progress
--
Time: Wed Sep 14 22:43:21 2011
Event Description: Battery relearn completed
--
Time: Wed Sep 14 22:44:26 2011
Event Description: Battery started charging
--
Time: Wed Sep 14 23:53:46 2011
Event Description: BBU enabled; changing WT virtual disks to WB
--
Time: Wed Sep 14 23:53:46 2011
Event Description: Policy change on VD 00/0 to [ID=00,dcp=01,ccp=01,ap=0,dc=0] from [ID=00,dcp=01,ccp=00,ap=0,dc=0]
Previous LD Properties
Current Cache Policy: 0
Default Cache Policy: 1
New LD Properties
Current Cache Policy: 1
Default Cache Policy: 1


So as you can see, the battery relearn started at 19:55:26, then 3 seconds later the Write-Back policy was changed to Write-Through, and it stayed like this until 23:53:46, when it was changed back to Write-Back. This shows that the I/O was impacted for 4 hours. Luckily for us it was outside of our high  traffic period for the day, but it was still painful.

3) Disable autoLearnMode for the RAID battery

This is so we don't have this type of surprise in the future. The autoLearnMode variable is ON by default. You can see its current setting if you run this command:

# ./MegaCli64 -AdpBbuCmd -GetBbuStatus -a0

We followed Robin's blog post (thanks, Robin!) and did:

# echo "autoLearnMode=1" > tmp.txt
# ./MegaCli64 -AdpBbuCmd -SetBbuProperties -f tmp.txt -a0

4) Force battery relearn cycle

It is still recommended to run the battery relearn cycle manually periodically, so we did it on all servers that are not yet in production. For the rest of the servers we'll do it at night, during a time frame when traffic is lowest. In the future, we'll take maintenance windows every N months (where N is probably 6 or 12) and force the relearn cycle.

Here's the command to force the relearn:

# ./MegaCli64 -AdpBbuCmd -BbuLearn -a0

For reference, LSI has good documentation for the MegaCli utilities on one of their KB sites. Another good reference is this Dell PERC cheatsheet.

I hope this will be a good troubleshooting guide for people faced with mysterious I/O slowness. Thanks again to Aurimas from Percona for his help. These guys are awesome!

Using AWS CloudWatch Logs and AWS ElasticSearch for log aggregation and visualization

If you run your infrastructure in AWS, then you can use CloudWatch Logs and AWS ElasticSearch + Kibana for log aggregation/searching/visuali...