Debugging for slow/faulty network traffic

by Uwe Meding

Many cloud-based application that are distributed across the internet to accomplish tasks like

energy monitoring
security applications
building management

The internet connections are usually very easy to setup by using existing wired services or wireless air card services. In reality though none of these services gives you a 100% guarantee of a clean and good connection. This results are connections dropping intermittently, data missing, possibly bad data. A lot of errors are actually corrected by the network transportation protocols (TCP/IP) and do not interfere with the application. At most we perceive the connection to be slow. Errors that cannot be dealt with are sent to the application. In practice, these errors are very hard to debug, as they tend be intermittent, random, and not repeatable. For example, a Java stack trace like this gives very little hints about the cause:

[SEVE] 
 | java.io.IOException: Asynchronous failure: Inbound closed before receiving p
 | eers close_notify: possible truncation attack?
 |      at os.hive.channels.ssl.SSLChannel.checkChannelStillValid(SSLChannel.java:127)
 |      at os.hive.channels.ssl.SSLChannel.read(SSLChannel.java:134)
 |      at os.hive.io.impl.ServerPacketChannel.handleRead(ServerPacketChannel.java:112)
 |      at os.hive.channels.ssl.SSLChannel.fireReadEvent(SSLChannel.java:392)
 |      at os.hive.channels.ssl.SSLChannelManager.fireEvents(SSLChannelManager.java:40)
 |      at os.hive.io.SelectorThread.run(SelectorThread.java:504)
 |      at java.lang.Thread.run(Thread.java:722)
 | Caused by: javax.net.ssl.SSLException: Inbound closed before receiving peers
 |  close_notify: possible truncation attack?
 |      at sun.security.ssl.Alerts.getSSLException(Alerts.java:208)
 |      at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1639)
 |      at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1607)
 |      at sun.security.ssl.SSLEngineImpl.closeInbound(SSLEngineImpl.java:1537)
 |      at os.hive.channels.ssl.SSLChannel.readAndUnwrap(SSLChannel.java:177)
 |      at os.hive.channels.ssl.SSLChannel.doHandshake(SSLChannel.java:525)
 |      at os.hive.channels.ssl.SSLChannel.handleRead(SSLChannel.java:584)
 |      at os.hive.io.SelectorThread.run(SelectorThread.java:585)
 |      ... 1 more

Traffic control to the rescue
Traffic control (tc) is a program that lets you control and shape the traffic flow on a Linux system, by policing, classifying, scheduling, shaping and dropping. We can use this to create a development/debugging environment with which we can simulate various transmit errors and application behavior. Remember that these networking errors cannot be fixed in the application, however, we must be able to cope with the errors in the application and either restart a connection, or rollback a transaction etc.

Higher latency
Simulate the higher latency on the remote systems of well-behaved aircard-like behavior

# tc qdisc add dev eth0 root netem delay 120ms 15ms 50%

This delays all traffic 120ms, and varies this by 15ms about 50% of the time. This is pretty powerful already in that it delays the traffic like you would find in the real world. By the way you can have different settings on different systems.

Higher latency with some errors
Simulate a higher latency on the remote systems, we are also dropping in some random errors. Any traffic is delayed by 180ms with 30ms spread (50% of the time), 5% of all a packages are dropped, 1% is corrupt, and 2% of the packets will get shuffled.

# tc qdisc add dev eth0 root netem delay 180ms 30ms 50% drop 5% corrupt 1% reorder 2%

The interesting aspect of this is that most errors will get corrected by the low level protocol handlers. Every now and again we will see “weird” application errors happening just as you would in a full deployed setup.

Higher latency with a lot of errors

# tc qdisc add dev eth0 root netem delay 280ms 60ms 50% drop 15% corrupt 5% reorder 5%

Again, a lot of errors will get repaired on the transport level. However, this setup will cause application faults fairly consistently.

Resetting the network to normal operation
When we are done with debugging:

# tc qdisc del dev eth0 root

This removes any traffic control parameters we have setup.