It is interesting to think about what Don Draper, the famed creative director on AMC’s Mad Men (played by Jon Hamm), would do in the 21st-century if his advertising firm’s delivery of time-sensitive media were disrupted.
How would he handle cyber-crime and other critical IT failures when the advertising of his Fortune 500 clients was on the line?
Today’s creative agencies depend upon successful delivery of their electronic media across the world. We can scoff at Draper and his “three-martini root cause assessment,” but our recent work with an advertising firm client showed me, in terms of network security, we may be closer to the 60s than we realize.
The client was preparing for a major media release. They placed media on DVDs and hard drives and put them on airplanes to get them to market on time because, back at headquarters, the network team frustratedly attempted to diagnose a critical failure in their distribution. That was when they called us.
The problem was with high-speed file transfers. Some of the transfers were aborting for no apparent reason. Not all transfers failed, but those that did would do so consistently.
Consistent failure provides the opportunity to discover root cause. However, this opportunity is often squandered in an attempt to solve the problem hastily, using reactionary quick-fix solutions. Many of these measures exacerbate problems. Those undiagnosed problems tend to revisit at the least convenient times. Don Draper, despite his vices, is a man who delivers. He would demand root cause analysis and effective mitigation by his team.
For this client, the aborted file transfers between distant locations provided a very interesting challenge. How does one analyze inside or across the Internet? It’s done by capturing data, where possible, at one or more controlled test points.
But what if you don’t have test points?
Many organizations do not design network test points that allow them to monitor and analyze problems, much to their frustration, when problems arise. If this is your case, believe me — you’re not alone. Perhaps you believed the sales pitch that your network infrastructure was perfect. Maybe it was just honest inexperience and naiveté.
Fortunately, this client did design test points inside their firewalls, servers, and client machines, so we were able to capture packet trace evidence of the problem. Upon examining packets at one location, we found that TCP/IP’s layer 4 TCP (responsible for session setup and reliable data delivery) was sending a TCP session Reset command before the transfer completed.
A native TCP Reset is sent when a host wants to abruptly abort a session, like when a server will reset all outstanding TCP sessions before it drops offline to reboot.
However, in this situation there was no reason for such a reset.
We analyzed the data collected at both locations and found a troubling result. Both locations indicated that the Reset was initiated by the other end, which is obviously not possible. The only other conclusion was that the Reset was being caused by an MITM (man in the middle).
The client was sure the MITM was nefarious — a hacker, a “bad guy.” We weren’t convinced that “foul play” was to blame, so we dug deeper. Sometimes a security device such as a firewall can detect suspicious activity and send out a TCP Reset to stop the activity.
As we suspected, the offending party was indeed much closer to home — TCP Reset from Linux.
We took a look at TCP/IP’s network layer IP and its HOP value field, or hop count, to corroborate congruent number of router hops between the two nodes. Normal packet exchanges between the two nodes verified the same number of router hops, but the TCP Reset Control exchanges showed differing hop counts in both directions. This incongruence led to more questions.
Had the MITM exposed itself by maintaining its own hop count rather than surreptitiously using a congruent HOP value of the other node? Yes, the MITM used not only its own HOP value but also TCP/IP’s IP Fragment ID, a field that uniquely numbers data segments sent from a node.
On analyzing the evidence, we found that a device somewhere in the path was inserting the TCP Reset simultaneously, making both nodes believe the other wanted them to reset. By tracing the number of router hops, we were able to trace the location of the device at fault. We found it was nothing nefarious, but the client’s own Linux IPTABLES firewall.
(In a noble effort to save the company money, the Network Security Manager had opted to employ an open-source firewall instead of a paid solution from a vendor)
No admission from the internals of the Linux IPTABLES firewall application or platform could be found. Neither log — verbose or otherwise — nor advanced debug levels indicated the device was responsible.
The entire core IT team had been sequestered for two straight weeks, and we had now been onsite for two days, all to solve this problem. It reminded me of a Mad Men episode where Chevy wants new concepts for their ad campaign by Monday, and they bring in a doctor to give the team a bit of “pep” for their 72-hour marathon weekend session.
Our team was a little more productive than the hopped-up Mad Men team, quickly replacing the open-source firewall with a paid vendor solution.
In the interest of root cause analysis, it was found that the behavior of the Linux kernel was the same when we did a clean install of a plain version of Linux (without IPTABLES installed) on another hardware platform.
At this point, the new commercial solution was acceptable, in place, and operating successfully, so it was not important for the client to continue using resources to dig deeper into the Linux problem. The Network Security Manager was forgiven, and everyone went home for some well-deserved rest.
Don Draper would have been proud, signed bonus checks, and thrown a wild office party. We left, quietly, satisfied that a mystery (and a serious TCP reset problem) was solved.