I’ve been given flack in the past for my belief that an engineering degree gives you an advantage in the tech space, but I’m going to say it again. First of all, before you write your hate mail, I do believe it’s possible to gain the tools necessary to be quite good in the field without an engineering degree, and many people DO gain that ability. However, I still believe that because engineering schools pound into the students certain ways of looking at things, especially around how to do a controlled experiment, engineers get a leg up on diagnosing problems.
So why am I bringing this up again? This week Steve and I solved an interesting problem and it was fun how we solved it. A couple of years ago we switched from Time Warner cable over to Verizon FiOS. I had no problems with Time Warner, they just didn’t offer the speeds of FiOS. About six months ago I noticed a problem with our connectivity – we lose our internet connection whenever the phone rings. This wouldn’t be an issue for normal people, but if you’re on a Skype call and the phone rings, you’re gone. Not exactly what a podcaster is looking for. Luckily it never happened when I was on anyone else’s show but a few times when Bart and I have been chatting I’ve lost him for a brief while, maybe 10-15 seconds while my Internet took a coffee break. I should point out that Skype very gracefully recovers when this happens, we don’t have to reconnect because it does it automatically. Pretty impressive actually.
I called Verizon a few months ago and they asked me to check the frequency band for our wireless phones and sure enough it was 5.8GHz, which means they’re likely interrupting the 5GHz band of our wireless router. I did have to confess to them that I had circumvented their router and was using my Airport Extreme to provide wireless and wired DHCP access to my network. They didn’t freak out which was good but didn’t offer a solution other than getting rid of our phones. Just replacing them didn’t seem smart because without isolating the problem how would you be sure you were solving it?
This week Steve and I started noodling the problem. We have two base station phone units in the house, one downstairs in the kitchen that talks to the downstairs phones, and one upstairs right next to my wireless router in my office that controls the phone in Steve’s office. In addition there’s a 3rd base station unit in one of the upstairs bedrooms only servicing itself.
To do our experiments, we have to make the phone ring. This introduces another problem, we can’t use the home phones because they’re part of the problem, right? We have to use a cell phone, but they rely on our AT&T Microcell to get a good signal inside the house. That meant we had to turn off wifi on the cell phone to make the call to be sure that we had a good test.
We also needed a reliable way to tell if the network was getting interrupted. It’s hard to notice if you’re just surfing the net, or reading email. Even watching a video online isn’t a good test because they buffer up and might not indicate an interruption. Well remember last week when we talked about using screen sharing to see the Mac Mini as it runs our Drobo to Drobo backups? It turns out, if the phone rings, that screensharing connection gets dropped and we get a “connecting” message with a giant spinning wheel on screen. This also provided another piece of information. It’s not just that we’re getting disconnected from the Internet when the phone rings, our wifi is completely stopped for those few seconds, because screen sharing is on our intranet, not externally facing. Very interesting.
Now that we have a reliable way to measure whether the wifi has been stopped, we can start changing things. To do controlled experiments, you have to change one thing at a time and then retest or you’ll never know which of your changes fixed the problem.
Steve suggested that perhaps the phone base station right next to the wireless router was the problem, so our first test was to swap the base station into Steve’s office, putting the satellite phone in my office. Steve rang our home line from his cell phone while I watched my screen share to the Mac Mini, and again we were disconnected. That eliminated the proximity of the base station to the wireless router as the root cause.
Next he suggested maybe it’s the fact that we’ve even got the phones upstairs that’s the problem, or perhaps it’s the phone system itself that’s flawed in the way it’s handling the signal. We unplugged both the upstairs base station and satellite phone and rang the house again. And again my screen share was interrupted.
The good news is that meant it’s not the upstairs phone system so replacing it would have been a waste of money. Now Steve suggested we unplug the base station and units downstairs. We both sort of groaned because the base station is mounted to the wall and it’s kind of a hassle to disconnect.
Then I remembered that 3rd base station in the spare bedroom. That one was pretty old, so maybe it was created before proper shielding was implemented in phones? Plus as I told Steve, let’s test that one first because I WANT it to be the root cause since it’s kind of cruddy. Not very engineering disciplined of me but hey, some emotion gets in sometimes, right? We unplugged the 3rd base unit, rang the phone…and my screen share stayed connected! Off that phone goes to Good Will and we won’t even replace it because what self-respecting guest would need to use a land line anyway?
The bottom line is that by doing controlled experiments, we were able to determine the root cause of a network-based problem. Had we flung around unplugging random devices, not paying attention to how the cell phone was connected, or just buying new equipment, we likely never would have found the root cause. Again, I do believe non-engineering-trained people can learn to do this but it’s ingrained in our whole way of thinking over at Chez Sheridan.
Michigan Engineering taught the same process.
CJ from Ann Arbor
Hi Allison. We computer programmers learn the same techniques when fixing bugs. Change only one thing at a time or you won’t know which fixed it. I suppose we are engineers of a sort without the fancy title and awe and respect from our friends. 😉 But the first rule is always check the cables and the second rule is check the cables. Then you can look for other problems.
For your connectivity test you could go the route of the terminal. There is a command called ping that will just send little blips of data to another computer and report the time it took. You can type man ping to see all the fancy options but you don’t need any of them for this exercise. If you say ping google.com you will get something like:
64 bytes from 173.194.37.142: icmp_seq=1 ttl=55 time=16.780 ms
or, if it can’t connect
Request timeout for icmp_seq 128
This tells you that it took 16+ ms to get the answer back to your ping. I guess it was named after the sonar pings in submarines “Give me a ping, Vasili. One ping only, please.” that go out and see if anything reflects back.
Anyway, this will continue until you type Ctrl-C and then you get a nice summary of total packets sent, received, percentage packet loss and min/avg/max/stddev on the times.
If you were doing this during your test you would see them coming in just fine and then when you called your home phone it would do a few Request timeout’s until the connection sorted itself again. It picks back up when the connection comes back so it’s an easy way to see if your connection is slow or being interrupted. It can also show something going on that slows the network down but doesn’t break the connection. I’ve had electrical interference from a breaker box malfunction slow the network down to a crawl for a few seconds but then go back to normal – drove me nearly crazy. I have also seen a hard drive malfunction cause the same result for a different reason which is crazy but that’s what it was. When we eliminated all the other possibilities that is what was left and a new drive fixed it.
The speed, of course, depends on your connection and other factors and not every site out there is configured to answer pings for security reasons. They figure you can get too much information or something – or maybe they are just paranoid. At the very least it confirms a computer exists at that address.
You can also ping an IP address so you could ping your router or another computer on your network from within your network unless you have firewalls or the like in between. It’s also fun to ping google and bing to see if one of them is faster to answer you. Remember, though, this is just answering a blip of data at the networking level – if they are under a massive denial of service attack, for instance, finding your favorite wireless phone store may take a long time even if the pings answer back quickly.
Hope this is helpful. Hope you don’t ever need it but if you do it’s a useful tool.
Jim
We did the same thing only ours was easier cuz we have just one base. Yep, the phone zaps the wireless, and not just when its ringing. It does it if you use a wireless handset too. Will replace someday…
Engineering solution? Programming solution? How about the lawyer solution? Find out who’s to blame for making and selling the bad gear — and sue ’em!
Well, maybe not. Since our (delete scathing rant here) U.S. Supreme Court has obliterated the class action rights of consumers.
Anyway, here’s a serious tip for Allison. Before sending your phone to Goodwill, you’ll want to wipe its memory. And from a Continuing Legal Education class I took on computer forensics for lawyers, that may not be truly effective on flash memory gear like phones, fax machines, and printers with embedded or expanded memory.
I learned this method back in the System 7 days before I got Conflict Catcher. Disable half your system inits, reboot. See if problem still exists, keep narrowing down. That’s faster if you know exactly how many items you will be changing.
Lewis – what a coincidence – I JUST suggested that very method to the Mac Geek Gab, learned from exactly the same problem with System 7 extensions! Great minds…