Wednesday, September 19, 2007

Beware of timings in your tests

Finally I get to write a post about testing. Here's the scenario I had to troubleshoot yesterday: a client of ours has a Web app that uses a java applet for FTP transfers to a back-end server. The java applet presents a nice GUI to end-users, allowing them to drag and drop files from their local workstation to the server.

The problem was that some file transfers were failing in a mysterious way. We obviously looked at the network connectivity between the user reporting the problem initially and our data center, then we looked at the size of the files he was trying to transfer (he thought files over 10 MB were the culprit). We also looked at the number of files transferred, both multiple files in one operation and single files in consecutive operations. We tried transferring files using both a normal FTP client, and the java applet. Everything seemed to point in the direction of 'works for me' -- a stance well-known to testers around the world. All of a sudden, around an hour after I started using the java applet to transfer files, I got the error 'unable to upload one or more files', followed by the message 'software caused connection abort: software write error'. I thought OK, this may be due to web sessions timing out after an hour. I did some more testing, and the second time I got the error after half an hour. I also noticed that I let some time pass between transfers. This gave me the idea of investigating timeout setting on the FTP server side (which was running vsftpd). And lo and behold, here's what I found in the man page for vsftpd.conf:

idle_session_timeout
The timeout, in seconds, which is the maximum time a remote client may spend between FTP commands. If the timeout triggers, the remote client is kicked off.

Default: 300

My next step was of course to wait 5 minutes between file transfers, and sure enough, I got the 'unable to upload one or more files' error.

Lesson learned: pay close attention to the timing of your tests. Also look for timeout settings both on the client and on the server side, and write corner test cases accordingly.

In the end, it was by luck that I discovered the cause of the problems we had, but as Louis Pasteur said, "Chance favors the prepared mind". I'll surely be better prepared next time, timing-wise.

3 comments:

tp said...

Definite agreement.

@GTAC there was a lot of comment about how sleep() or pause() commands can be the undoing of a nice functional/regression test.

In my own experience, when I started writing functional tests I would say to myself, "Oh, a little sleep() function here won't hurt anything." But then, over time, the smell of slow tests begins to permeate. Oh the reek!

Selenium has some nice functions like 'ClickAndWait' or 'WaitFor*' which give you a chance to work around the use of pause().

In Twill and Python writing up a nice little recursive function that looks for elements on a page that correspond to a new condition/state sped up my functional tests by about 50%.

Grig Gheorghiu said...

Hi, Terry -- thanks for the comment. You're right about apparently harmless sleep commands that end up ruining your day. I had that happen to Selenium tests until I discovered the WaitForCondition command.

However, I think in the case I blogged about it's more of a question of introducing delays in your tests on purpose, so that you hit certain corner cases that might not be otherwise apparent. I'm not sure this is the best tactic to use in automated tests, but it's useful for sure in exploratory-type tests.

Grig

Anonymous said...

Hmmm. Something to keep at the back of my mind next time I do testing ..
Thanks.

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...