Rangachari Anand
August 26 2004
This is the second in a series of articles about things I have learned while developing software at Reefedge Networks. The first article was about the tools we used to manage the development process itself.
I recently came across a list of eight common mistakes that people make when developing of distributed applications (compiled by L. Peter Deutsch). I can say, based on hindsight, that even if one has read this list, there is a good chance that you will repeat the kinds of mistakes that Peter Deutsch describes.
We have built our own messaging middleware since, at the time when the code was being developed a few years ago, all commercial messaging middleware was found to be too heavy weight and no open source tools seemed to be appropriate. The decision to create your own middleware should not be taken lightly. While we did get our messaging system to work well, based on our experience, I would strongly suggest trying to use existing tools in this space whenever possible.
In this context, here is a problem that we encountered relatively early in the development cycle. Although messages are sent via TCP in our messaging system, message delivery is not assured. This means that if the network connecting two of our boxes were to die at the wrong moment, such as when a message is in transit, the message will be dropped. There are two possible approaches to this problem: One can either make messaging reliable or else make the end-points robust to message delivery failure.
We took a two pronged approach to solving this problem. For all cases where RPC style messaging is performed, we added a timeout mechanism. In those cases where one-way messaging is performed, we added the facility in our messaging system to notifies a sender about delivery failure.
Copyright © 2004 Rangachari Anand, All Rights Reserved.
August 26 2004
This is the second in a series of articles about things I have learned while developing software at Reefedge Networks. The first article was about the tools we used to manage the development process itself.
I recently came across a list of eight common mistakes that people make when developing of distributed applications (compiled by L. Peter Deutsch). I can say, based on hindsight, that even if one has read this list, there is a good chance that you will repeat the kinds of mistakes that Peter Deutsch describes.
1. The network is reliable
Our main product, the Reefswitch, is primarily used to secure wireless LANs. The software is implemented primarily as a set of daemons that run on each Reefswitch that send messages to each other. In addition to intra-box communication, there is also a substantial amount of message chatter between Reefswitches.We have built our own messaging middleware since, at the time when the code was being developed a few years ago, all commercial messaging middleware was found to be too heavy weight and no open source tools seemed to be appropriate. The decision to create your own middleware should not be taken lightly. While we did get our messaging system to work well, based on our experience, I would strongly suggest trying to use existing tools in this space whenever possible.
In this context, here is a problem that we encountered relatively early in the development cycle. Although messages are sent via TCP in our messaging system, message delivery is not assured. This means that if the network connecting two of our boxes were to die at the wrong moment, such as when a message is in transit, the message will be dropped. There are two possible approaches to this problem: One can either make messaging reliable or else make the end-points robust to message delivery failure.
We took a two pronged approach to solving this problem. For all cases where RPC style messaging is performed, we added a timeout mechanism. In those cases where one-way messaging is performed, we added the facility in our messaging system to notifies a sender about delivery failure.
2. Latency is zero
Our Multi-Site Manager product allows a large number of Reefswitches to be managed centrally from a NOC. It is necessary to secure the communication between the MSM and the Reefswitches. Since the code was already available, we reused the mechanism that we had developed to secure traffic between Reefswitches. This code assumed that Reefswitches talk to each other over a high bandwidth low latency network, as would be typically found in a campus setting. When we started talking to our multi-site customers, we discovered that some of them were planning to use slow satellite links with a round trip time of over a second between the NOC and the sites where Reefswitches were deployed. When we tried to simulate the effect of a satellite link using the NIST network emulation system, we discovered a number of problems. Fortunately, these problems were relatively easy to fix. We eliminated some costly network liveness tests and increased timeouts to compensate for the long round trips. This did however have the unavoidable side effect of making the system detect and recover from failures a little more slowly.
3. Bandwidth is infinite
As previously mentioned, some customers use high-latency satellite networks to connect the MSM with Reefswitches. We discovered to our surprise however that other customers still use very low bandwidth links such as 56k frame relay! Now, the system software running on Reefswitches can be updated from the Multi-Site Manager. The size of system image for a Reefswitch is on the order of 15 megabytes. Since the customer could only spare about 16 Kbit/sec on the link for software updates, its clear that it could take several days to download a new software image to a Reefswitch! We overcame this problem in several ways:- We added a bandwidth throttling mechanism to control the rate at which the system image is sent to the Reefswitches.
- We added the ability to resume download of an image if it were interrupted for any reason.
4. The network is secure
Since our product is primarily intended to secure wireless networks, we did not place great emphasis on securing inter-box communication. Indeed, many of the initial customers operated very secure networks so it was not considered a major issue. It was not until we learned that a customer planned to deploy a collection of Reefswitches on a network directly connected to the Internet did we get around to adding this feature in a hurry!5. Topology doesn't change
We had initially assumed that the network topology would be relatively static. In particular, we assumed that the IP addresses of Reefswitches would not change very often. We discovered however that network administrators do in fact reconfigure networks more often that we had assumed. We subsequently added the ability for a Reefswitch to recover gracefully from address changes.6. There is one administrator
We had initially assumed, rather naively it appears in retrospect, that a corporate network would be managed by a single department. We have discovered that this not the case in many companies. There are often several departments involved. For example, one that handles DHCP, one that handles DNS, one that handles the routers etc! Since our product is at the heart of a network, it is often a tough sell with some customers due to the fragmented nature of their IT department.7. Transport cost is zero
As previously mentioned, some customers still use low-bandwidth leased lines to connect the MSM with the Reefswitches.8. The network is homogeneous
We have encountered this first hand at a number of customers. Although one often hears about companies trying to standardize on one vendor for their networking equipment, we have often encountered customers that have a mix of products from various vendors such as Cisco, Foundry networks, 3Com etc. These have all been challenging to work with in different ways. An added problem for us is that being the smallest and youngest vendor, we often get blamed for strange behavior that can often be attributed to other vendor's products!Copyright © 2004 Rangachari Anand, All Rights Reserved.