It’s The Little Things That Get You Every Time…

Even on the smallest of networks, the failure to pay attention to every detail can make a routine upgrade or migration far more complicated than it needs to be.

This year, I’ve had an above average number of mail server related incidents, but in many cases I made things worse by overlooking something small.

For instance, there were at least 3 times where something had gone wrong with my mailing list server, and I could have solved it in a few minutes before I left home, but didn’t check until I had gotten to work.  (It doesn’t help that my monitoring server is currently being rebuilt, and so not doing its job)

On my network, I have an Exchange 2007 server and a separate server that handles my mailing lists.  The way they work together is probably more complicated than it has to be, but that’s how it evolved.  When I upgrade to Exchange 2010, I’ll change that around a bit.


Mail Server Recovery

Anyway, back in July of this year, the server handling the mailing list duties started to get flaky.  This didn’t bother me too much, because for a few years, I had been backing up the mail once a day to another server and to my NAS.  I had this system down pat.  I had even used this method to migrate from one server to another in late 2007, and in 2008, I had modified the daily script to create a backup of the registry entries pertaining to the application.

Mistake #1:  Not verifying the test plan.

I knew the test plan worked because I had used it before.  The problem was that after I migrated to the new server (circa 2007), I kept the old one as a backup.  Later, that old one died, and I selected another machine as the backup, but I didn’t actually install the application there!

On Friday, July 23, the main list server keeled over early in the morning, but I didn’t check it before I left for work.  As a result, I was informed that it was down when I was nowhere near the scene of the crime.  Since it was down, I figured I could get the “backup” online fairly quickly.  That’s when I discovered two things:

1 – The application wasn’t installed as yet on the new “backup” system.
2 – I didn’t have immediate access to all the necessary configuration settings.

The first problem was minor.  I maintain all my install files for all applications on a network share, so getting the install done took mere minutes.  The second problem, however, was much more problematic.  The application stores its configuration across a set of files and also within the registry – some vital settings in each location.  I had all the files.  I didn’t have the registry settings. 

(At least not recent registry settings.  I did initially find registry backups from 2001 and 2002, but these were useless, being configured for older versions of the software and for different ISPs and such.)

Why didn’t my disaster recovery plan contain the most recent files that were clearly being backed up you ask?  Because, although I was carefully backing up all the application and data files to two separate network locations, for some inexplicable reason, the registry backup destination was ONLY to the local server.  So, when it died, my access to the backups died with it.  LOL.

Thanks to so heroics by my daughter, who is pretty good in the “remote hands” capacity, I was able to get the primary server up and running for a few hours.  I immediately pulled off all the registry backups, and applied them to the “backup” system.

(For those who don’t know, there was another fun moment in late August, when several of my domains expired, even though I knew about them and got an email that said, “You need to take care of this in xx days.  They must have meant xx minutes.) 


IP Address Changes

For quite some time, I’ve been desirous of changing my IP address scheme from a crowded 192.168.x.x range to something a little less common.  I’m also in the middle of a DD-WRT router replacement for my venerable Netscreen 5XT, so I figured I’d tackle them both at once.   Yay for expanded project scope!!

So, I planned what I needed to change and mapped it out for an evening.  It is advisable that you do these things in the evening, and then check on them in the morning vs implementing them first thing in the morning.  (This is more true for a home network than for an enterprise network, where I usually prefer to handle migrations the other way around.)

Mistake #2:  Not planning out the test cases for QA purposes.

Yes, it’s my network.  I’m intimately familiar with *most* of the details, and I do have some of it documented – but not enough to cover all bases.  Here’s what I did:

I backed up my Netscreen firewall, made a copy of the config, and did a search and replace for the first 3 octets of the network address.  Then I imported it back into the firewall and completely overwrote the old config. Perfect.

Next, I RDP’d into each of my servers, changed their IP addresses to the new IP address, made the necessary changes to DNS and WINS (yes, I’m still using it for fun),  and then added the old address back as a secondary IP.  Most of the RDP sessions only blinked.

I made changes to the DHCP scope, I restarted web services and checked DNS for all the new entries.  I even watched my mail go out successfully.

But, I didn’t actually keep looking at email.  More precisely, I watched emails come in through the SPAM filter and get processed successfully, and I saw emails leave successfully, but I didn’t pay enough attention to notice that the emails going out were all originated from inside the network, and that none of the new emails was going all the way through the system.  The lack of a detailed test plan allowed me to conclude that all was well, even though I had only verified that *some* was well.

Mail being received by the email security server?  Check!
Mail going out from the mail server to the ISP?  
Check!
Mail moving from either system to the Exchange server?  I guess so…

The lack of a detailed test plan allowed me to conclude that all was well, even though I had only verified that *some* was well.

Mistake #3:  Not investigating weird behavior immediately.

When something you expect to work suddenly starts failing, you should check it out *immediately*.  This has gotten me more times than I care to admit.  I was pushing around some scripts on the network, getting MRTG back in place and attending to other oft-neglected elements of the network.  At some point, I noticed that my synchronization script was failing on my virtual host server.

“That’s weird,” I thought, and carried right on with trying to get the SNMP configuration done on the other servers.  If I had only stopped to look into this issue right away, it would have been addressed in under 10 minutes, and much of the subsequent chaos would have been averted.  (The small percentage of problems that would still have occurred because of issue #2, would have only taken another 10 or 15 minutes to troubleshoot and fix).

The root cause of this particular issue was that while I had correctly made the IP address changes on eight (8) of the servers on my network (physical and virtual), I had made an itty bitty mistake on the last server.  On this server – the one hosting all my virtual systems – I had made the changes to the wrong network card.  It’s only because I kept the old IP as a backup that any of the virtual servers continued to be accessible.  Ironically, if I had not dragged out the maintenance into two phases, but had moved completely over to the new IP address range in one shot, I would have discovered this problem with the host server immediately.  And, of course, if I had used a robust set of test cases for validating all of the changes, I would have identified and resolved this issue before it caused other problems.

Mistake #4:  Don’t rely on memory – rely on documentation.

Because mistake #3 compounded the issues with mistake #2, I didn’t realize that there was a real problem with email flow until after I had gotten to work. And I was informed by a good friend and colleague.  Hmmm… That’s weird, because I can telnet into the email security server.  Oh, well…  No big deal. I’ll just RDP into the virtual host server –  as I have done 1 million times before – and troubleshoot the issue.

Er…  Nope.  It turns out that the scope of mistake #3 was a bit more extensive than I had first imagined.  Not only could I not copy some files to the server, but it was completely isolated from the outside.  Of course, as this point, I am not sure what the real problem is.  I’m thinking that possibly I did something wrong with the firewall portion of the IP conversion.  So, I got my wife involved for “remote hands” duty.

After some difficulty in trying to assess what was going on (details obscured to protect the guilty and innocent alike), I had her reboot a couple of things, but unfortunately, I couldn’t get a good sense of what was broken.  I did suspect an Exchange setting was the root of my email routing woes, but it’s hard enough trying to talk an Exchange engineer through that sort of configuration change, much less a spouse who would rather be doing non-technical work.  🙂

So, I settled for dealing with it when I got home.

As I suspected, it turned out that the Exchange server was ready and willing to accept traffic from devices on the old subnet, but I had conveniently forgotten to add the new subnet to the approved list – even though I *had* remembered to make the corresponding change on the mailing list server!

This problem would have been caught if I had either documented everything, or tested thoroughly after the fact. And especially if I had done both.

None of these issues was so major or mind boggling that they couldn’t have been avoided. Thankfully, I built enough robustness into the environment that no mail was lost.  Mail was received by my email security server, and queued up for Exchange, which didn’t want to hear about it from someone on a seemingly strange subnet, so it didn’t go anywhere.  My queue is designed to hold mail for 6 days, in the event that vacations or long weekends catch me far enough away from the environment to fix things.


Lessons Learned

  • Don’t cut corners or break your own established rules of operation
  • Simple and straightforward plans are preferred
  • Validate changes with thorough tests; partial testing ensure partial success
  • Document it if you want to remember it when it counts

The primary lesson learned is, whenever you break your own rules for process and protocol, bad things are more likely to happen.  Never get lulled to sleep by a string of good results from following the process, such that you are tempted to skip steps in subsequent migrations or deployments.

We’re always inclined to say, if cut this corner here or there, what’s the big deal?  Everything went fine the last 5 times…     Yeah, well everything went fine *because* you were following the process properly.

Another lesson is that the cleaner and more straightforward the migration or deployment, the easier it is to identify, troubleshoot and remediate issues that arise.  In retrospect, I needlessly complicated the migration by keeping the old IP address on the network, and splitting the work over two days.  (If you don’t have time to perform a migration in one fell swoop, then get enough time, or remove the dependences between the different phases of the migration.)   Keeping the old addresses allowed enough things to work that shouldn’t have under the circumstances, and it made troubleshooting harder.

People often wonder why I’m such a stickler for process when I’m at work.  It’s because I feel the pain whenever I deviate from the process.  Certainly, these are great experiences and help to keep me on my toes, but I’d much rather have them on my home network, if I must experience them at all.

I’ll be updating my documentation and internal DR plans this winter, and I’ll be capturing lots of details – more than normal for even my home network. 

The only things cut corners are good for is drawing blood.

Leave a Reply

Your email address will not be published. Required fields are marked *