If I had a dollar for every time I declared, “it’s not the network”, I’d be a millionaire. Of course, I’ve always felt sorry for my Security Engineer compatriots with their firewalls, because they’d be billionaires. I think that’s the actual flowchart: first blame the firewall, then blame the network.
What’s so ironic is that it’s been my experience (as I’m sure it has with most every other network engineer out there), that it’s usually not the firewall or the network. As network engineers, we’ve sort have accepted this accusation as part of the troubleshooting process when something does go haywire. And the vindication when it’s discovered not to be the network can be very sweet indeed. At least, until the next time…
So why do the accusations fly then? What makes the network an easy target?
First off, don’t take it personal. It’s not. You’re simply the face of their perceived problem. A good analogy to think of is when someone is downing. They’re in panic mode and they will often attack and lash out at the lifeguard trying to save them because of that panic.
Well, and to be honest, it is an easy target. It’s at the core of every transaction a user makes. Servers, Databases and other systems can be so ambiguous to users that they simply surmise what they see, which is a perceived network issue. When you think about it, I can hardly blame them. Well, almost not.
And please don’t get me wrong, this isn’t a “server vs. network” thing, or me bashing the server folks. I’ve got a lot of respect for them, definitely. But they do kind of have it a little easier (I did say a little easier, not a lot!) when it comes to troubleshooting an initial issue.
But that’s when the fun begins. Our job as network engineers is to certify the network. Run the pings, traceroutes, check the logs, interface stats, CPU stats, routing tables, spanning tree, etc, etc. Once that’s done, we can pass the baton to the other folks. However, if we do, we better be damn sure we haven’t missed anything. And unfortunately, that takes time. And with every outage, the clock is ticking.
Hopefully, the server/system folks are simultaneously troubleshooting instead of serially troubleshooting, so the mean time to recover is less. I’ve had that challenge in the past, where it’s a “wait and see” mentality. That’s been a political struggle I’m sure a lot of you are familiar with: the whole, “my system is fine, call me when you prove it’s not the network”.
The better the teams work with each other, of course, the faster the issue gets resolved, which is and always will be job one. I always say, “we’re going to succeed or fail together, so we might as well fight together.”
Unfortunately, it happens more often than it should. It happens when people work in silos. It happens when fear takes over. The fear that you’ll fail. The fear that your humanity will get you fired for some reason. There’s a lot of pressure in these situations, and it makes the mind wander and makes you doubt yourself and your abilities. That’s when you get into trouble. Insecurity is the enemy. The best advice I could give is to focus on the work, do your best, and you’re going to be just fine. The quicker it’s over, the quicker you can move on to the next outage.
Also, don’t fall for the “it’s my system” mentality. This is when pride gets in the way. “Nothing is wrong with my system, it must be your system that’s the problem”. The mentality of “mine” is not even a realistic viewpoint. “This is my gear”. Nope, it’s the company’s gear. You’ve just been hired to look after it.
But it can be understandable why some folks might feel this way: they surmise if there’s something wrong with the system, there’s something wrong with them. You have to fight that attitude, because at the end of the day, it really has nothing to do with you. Things break, things get horked. We’re firefighters. We can’t be blamed or held accountable for the fire itself.
So what can we do as the stewards of networking to minimize these accusations? Is there anything we can do? To be honest, we can’t. We have no control over what others will think, feel or do. But that’s OK. What we can do is minimize the outage by being proactive about the network in general. By monitoring the health of the network, and to certify that it’s good in an expeditious manner is the best we can do. When we do report that it’s not the network to management, we haven’t wasted much time.
After all, the whole point is to restore the entire system for the customer, not to debate whose issue it actually is.