Blog Closed

This blog has moved to Github. This page will not be updated and is not open for comments. Please go to the new site for updated content.

Monday, July 21, 2008

All of 9 hours

Worked a long day today, stayed for 9 consecutive hours. I did eat lunch, but it took about 5 minutes and I did it at my desk. Today was an interesting day. One of the first projects I had when I started was updating a number of our test terminals to use an updated firmware. However, with the new firmwares installed, terminal performance decreased drastically.

Our terminals collect sensor data and send it in discrete (and highly compact) messages to a satellite. The satellite then transmits the data back down to earth at a datacenter, where the message is stored until we retrieve it. This is an oversimplification, of course, but isn't too badly inaccurate. Our system performance is mostly a reliability metric, and we count how many messages aren't sent that should have been. For the past week, our quality of service performance has been lousy: only 70-90%. 70% might not seem bad, but many of our systems will only send out one message every few days, and missing a message is clearly unacceptable. Ideally, we like to be much much closer to the 100% mark. For the past week we've been doing a lot of tests to try to resolve the issues, including testing firmware (the old firmware actually performed worse now then it did two weeks ago), hardware (including making several invasive, exploratory alterations), and software. We've been collecting data like crazy trying to make sense of everything.

So we have a conference call today between our engineers and those of the satellite company, and they lay a revelation on us: Their network traffic has increased dramatically in the past few days, and network congestion might be causing at least a 20% failure rate. All our data, all our tests, all our problems, might be resolved by a simple network capacity increase (something that is far outside our responsibilities).

In the world of software, it's generally accepted as a rule that if you find a problem, the problem lies in your code, not in the OS or the relevant libraries. This is not to say that libraries and operating systems never have bugs, they certainly do and sometimes they're terrible. However, I know that the odds are that any bugs exist in my development software, not in the production platform.

Following this logic, it never really occured to us that the underlying platform might be the cause of so many failures. Quite the contrary, we assumed it was something going wrong on our end, and ripped a lot of things apart looking for the solution. Everything from modifying antenna geometry and searching for cold solder joints (we actually found a few), to orienting the terminal in different ways with respect to the satellte (which actually does have an effect). In our mad search to resolve the major problem, we found and fixed several smaller problems in our device. By tomorrow, we should know if network congestion is indeed the culprit.

For tonight, I'm going to put my head down and try to get some serious work done on Parrot. I've got an extra hour tomorrow, and I am going to try to get some serious work done then as well. I feel like I'm getting close to a working system, I just need to put in a little bit of extra elbow grease to get there.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.