Adventures in TCP latency measurement

 

There and back in quantifiable time.

Re­cently, Google have pub­lished an art­icle on BRR, an al­gorithm that is de­signed to op­timise throughput and latency over wide-area net­works. It does this by ex­pli­citly meas­ures the round-trip latency and band­width ca­pa­city of the link between two ma­chines (be it in a data­center, or a mo­bile phone) to avoid sending more traffic than is use­ful, causing queues to build up in the net­work that need­lessly in­crease latency. So I thought I’d dig into some of the mech­an­isms in use, es­pe­cially as they’re also used in gen­eral per­form­ance mon­it­or­ing.

One of the ways to measure round-trip latency is de­scribed in TCP Ex­ten­sions for High Per­form­ance amongst other ex­ten­sions to TCP.

In normal op­er­a­tion, each packet that a com­puter sends will have two timestamp values at­tached, a value from the local clock, and an echo of the latest timestamp value seen from the re­mote side of the con­nec­tion.

On ex­ample from the RFC looks like this:

             TCP  A                                     TCP B
                             <A,TS­val=1,T­Secr=120> -->
                  <-- <ACK­(A),TS­val=127,T­Secr=1>
                             <B,TS­val=5,T­Secr=127> -->
                  <-- <ACK­(B),TS­val=131,T­Secr=5>
               . . . . . . . . . . . . . . . . . . . . . .
                             <C,TS­val=65,T­Secr=131> -->
                  <-- <ACK­(C),TS­val=191,T­Secr=65>

Each packet in this scheme ends up being an­not­ated with two timestamp val­ues, the TSval (the out­going timestamp) and the TSecr (the echo value). So, to measure the round-trip time on a con­nec­tion, you can re­cord the time between seeing an out­going TSval on the out­going stream, and the same value being echoed on the in­coming stream.

My quick hack will track these for each tcp flow, and track them using a his­to­gram buck­eted by latency, and then ex­port those into pro­metheus.

However, my ini­tial im­ple­ment­a­tion con­tained rather un­for­tu­nate bug–and one that is spe­cific­ally men­tioned in the RFC linked above. So, for ex­ample, I’m run­ning SSH to a ma­chine hosted in Europe, and it’s sending an up­date every 1 second. Graphing the res­ults using grafana res­ults in some­thing like the fol­low­ing:

Graph with erroneous 1sec round trip time

Graph with er­ro­neous 1sec round trip time

For this graph, the green line rep­res­ents the time take to pro­cess packets on the local ma­chine, and the or­ange is the os­tens­ible time taken for the re­mote host to send and re­spond to pack­ets. However, this is some­what less than use­ful.

I know for sure that the round trip time is not al­most ex­actly a second, and is more like ~20ms. It’s also quite telling that the re­ported round trip time is the same as the fre­quency of up­dates from the re­mote ma­chine.

In this case, the dotted line rep­res­ents a pause in com­mu­nic­a­tion of about a minute. So, packet C is the first packet that A sends after the pause, and hence con­tains the echoed timestamp value from be­fore the pause, which un­for­tu­nately, im­plies that the round trip time between packets ACK(B) and C is 60s, which clearly isn’t right. To rec­tify this, the RFC sug­gests:

RTTM Rule: A TSecr value re­ceived in a seg­ment MAY be used to up­date the av­er­aged RTT meas­ure­ment only if the seg­ment ad­vances the left edge of the send win­dow, i.e., SND.UNA is in­creased.

SND.UNA in this case means the un­ac­know­ledged se­quence num­ber, so the first byte that we don’t know the re­ceiver has con­firmed re­ceipt for. In other words, only use echo times from packets where the re­mote end ac­know­ledges re­ceipt of some pre­vi­ously un­ac­know­ledged data. Whilst this does mean we have fewer samples, being able to ask ques­tions of the wrong data doesn’t really buy you much at all.

Corrected Graph

Cor­rected Graph

In this case, I’m just typing a com­mand in at the prompt (hence the short peri­od), and we end up with the rather more real­istic RTT time of ~20ms across the in­ternet and back.

So, why would I care about any of this, when BBR is de­signed to be used in the ker­nel? Well, it’s quite common for net­work glitches to happen from time to time, es­pe­cially when you rent space from a cloud pro­vider. So having a tool to mon­itor track per­form­ance of in­di­vidual net­work flows can be very usful when trying to debug a per­form­ance is­sue.

And hey, graphs are cool.