Introduction to TCP Windows and Window Shifting/Scaling
Some years ago when I was at the National Center for Atmospheric Research in Boulder, Colorado, some colleagues and I worked on a DARPA-funded project in which we had two supercomputers a couple of thousand miles apart running a distributed application communicating using TCP/IP stream sockets. The physical layer was a 155 megabit per second (OC-3) SONET link via a geosynchronous satellite. The link was capable of 622 megabit per second (OC-12) speeds, but the microwave power required would have cooked any birds flying in front of the satellite dish. We had to use TCP window shifting and nine megabyte socket buffers to make effective use of the TCP/IP pipeline. That was my introduction to the need for windowing protocols.
More recently I had reason to revisit what I had learned regarding TCP windows and window shifting. These notes are the result.
TCP sockets are bidirectional byte streams connecting two end points. Bytes may flow in either direction. Bytes are packaged for transmission into TCP segments, which are in turn packaged into IP packets. (I tend to use the terms segment and packet interchangeably, but they are different things. Things get even more complicated as packets may be combined or even split up into frames for final transmission on the physical layer.)
When sockets are established, there is an originating side and a terminating side. On systems based on the BSD IP protocol stack (which is most of them), the originating side calls the connect() function to begin the establishment of a socket connection to the terminating side. The terminating side has called the listen() function to indicate its willingness to accept incoming socket connections, and calls the accept() function to accept an incoming connection when it arrives.
Once a socket connection is established, either or both sides may function in the role of either sender or receiver or both. The sender transmits data to the receiver. Whether you send or receive data over a socket has nothing to do with whether you originated or terminated the connection.
TCP is a reliable byte stream protocol. The protocol is reliable because all bytes sent must be acknowledged by the receiver. After a timeout period, unacknowledged bytes are resent until they are acknowledged. Byte order is preserved.
TCP is a sliding window protocol. The window size in sliding window protocols specifies the amount of data that can be sent before the sender has to pause and wait for the receiver to acknowledge them. This limit accomplishes several things. First, it is a form of flow control, preventing the sending side from overrunning the receive buffer on the receiving side. Second, it is a form of speed matching, allowing the sending side to keep sending at its own pace without having to stall and wait for the receiving side to acknowledge the sent bytes. The window size specifies how far the sender can get ahead of the receiver. (Students of producer-consumer queuing systems will recognize this immediately.) Finally, as we will see below, it is a performance mechanism to take best advantage of the characteristics of the underlying network.
(Another example of a sliding window protocol is the Link Access Protocol for the D channel. LAP-D is an ITU standard protocol used in ISDN signaling for digital telephony.)
The number of bytes that may be sent at any time before the sender must pause and wait for acknowledgement is limited by two factors: the size of the receiver’s buffer, and the size of the sender’s buffer. The size of the receiver’s buffer matters because the sender cannot send more bytes than the receiver has room to buffer; otherwise data is lost. The size of the sender’s buffer matters because the sender cannot recycle its own buffer space until the receiver has acknowledged the bytes in the send buffer, in case the network loses the data and the bytes must be resent.
The sender knows the receiver’s remaining buffer size because the receiver advertises this value as the TCP window size in each acknowledgement replied to the sender. The sender always knows its own send buffer size. But the effective window size used by the sender is actually the minimum of the TCP window size advertised by the receiver, based on the unused space in its receive buffer, and the sender’s own send buffer size. To change the effective window size for best performance, both buffer sizes, one at either end of the connection, must be tuned.
The TCP window size specifies the number of unacknowledged bytes that may be outstanding from the sender to the receiver. The window size field in the TCP header is an unsigned sixteen-bit value. This provides for a maximum TCP window size of 0xffff or 65535 bytes, although as will be explained below, this can be circumvented. A socket will have two window sizes, one in each direction. They can be different sizes.
The receiver advertises its window size in each acknowledgement replied to the sender. Acknowledgements may be standalone segments, called pure acknowledgements, or they may be piggy backed on data segments being sent in the other direction. The advertised window size is the space remaining in the receiver’s buffer. This is the flow control aspect of the sliding window. The window size is also the largest number of bytes that may be sent before the sender has to wait for the receiver to reply with an acknowledgement. Sent bytes must be buffered by the sender until they are acknowledged by the receiver, in case the sender must resend them. This is the reliability aspect of TCP. The sender can run at its own rate until the receiver advertises a window size of zero. This is the speed matching aspect of TCP.
The initial TCP window size advertised by the receiver is based on the receive buffer size. It has a default size which can be different for different systems, for example Linux versus VxWorks. The default size typically isn’t optimal for any particular network (more on that later). On systems based on the BSD IP protocol stack (which includes both Linux and VxWorks), the receive buffer size may be set on a per socket basis using the setsockopt() function and the socket option SO_RCVBUF. The buffer size is specified in units of bytes.
The effective window size also depends on the send buffer size. The send buffer size may be set on a per socket basis using the same setsockopt() function and the socket option SO_SNDBUF. As before, the buffer size is in units of bytes.
In Linux and other UNIXen based on the BSD IP protocol stack, the window size computation uses the SO_SNDBUF and SO_RCVBUF of the listen socket on the terminating end at the time when it calls accept(), and of the socket on the originating end at the time when it calls connect(). The SO_SNDBUF and SO_RCVBUF socket options can be set on a per socket basis using the setsockopt() call. Not only can the terminating-side socket inherit the SO_SNDBUF and SO_RCVBUF options from the listen socket from which it is accepted, it must. The setsockopt() call must be done on the listen socket on the terminating side before the accept() call is made. Likewise, the setsockopt() on the originating side must be done before the connect() call is made. Otherwise it has no effect because the window size establishment has already been completed.
Bandwidth * Delay Product
On a perfectly reliable network, the optimal effective window size for maximum throughput is ideally the result of the bandwidth * delay product. The bandwidth is the speed of the physical layer over which the connection runs. The delay is the round trip time or RTT of a typical data segment on that network. Long RTT can be due to either propagation delays or latency introduced by network devices. Over LAN connections round trip times are on the order of microseconds or milliseconds. Over geosynchronous satellite connections, it is more than a half a second. Over telemetry links to cometary probes to the Kuiper Belt, it is much longer.
For example, given a 100 megabit per second Ethernet and a round trip time of 2 milliseconds, the bandwidth * delay product is 25,000 bytes: (100 * 1000000 / 8) * (2 / 1000).
The Linux at al. ping(8) command displays the RTT for each sent ICMP packet it receives back. It does this by embedding a timestamp in each sent packet and comparing it to its time when the reflected packet is received. The Linux et al. traceroute(8) command works similarly.
> ping 192.168.1.110
64 bytes from 192.168.1.110: icmp_seq=1 ttl=127 time=0.184 ms
64 bytes from 192.168.1.110: icmp_seq=2 ttl=127 time=0.202 ms
64 bytes from 192.168.1.110: icmp_seq=3 ttl=127 time=0.207 ms
64 bytes from 192.168.1.110: icmp_seq=4 ttl=127 time=1.94 ms
spent a month in the People’s Republic of
The distributed supercomputer project had a bandwidth of 155 megabits per second and an RTT of half a second. The bandwidth * delay product was (155 * 1000000 / 8) * 0.5; that works out to more than nine million bytes. That may not sound like much, but that was nine megabytes of non-virtual high-speed SRAM.
Bandwidth versus Latency
Large window sizes may be necessary for networks which have high bandwidth and large latencies (due to either network or propagation delays) in order to keep the connection “pipe” full. Failure to keep the pipe full results in the end points being able to make use of only a fraction of the available bandwidth. The sending end must pause once the window size is reached, wait for the receiver to send an acknowledgement, and receive and process it. A significant percentage of the pipe remains empty, and the sender and receiver are both idle or stalled much of the time. On high latency links, this can result in a significant loss in performance.
For any link, the speed of light places a hard limit on how short the end-to-end propagation latency can be. Even on an “infinite bandwidth” link, if such a thing existed, end-to-end propagation will never be zero.
Think of it this way: given a geosynchronous satellite link with an RTT of half a second and a window size of one byte, it takes at least one second to send every byte, regardless of the bandwidth of the link. You send a byte, it takes half a second to reach the receiver, and the acknowledgement takes another half a second to reach the sender, before another byte is sent. This reduces the network bandwidth to no better than one byte per second, regardless of the bandwidth of the link, unless a larger window size is used.
This is similar to the need to use larger block sizes to increase performance on I/O devices like disk drives. It reduces the per-byte overhead by amortizing the latency over a larger number of bytes.
RFC1323 window shifting, called window scaling by the BSD stack, allows window sizes larger than the 65535 byte maximum. Recall that the TCP window size is the most bytes the sender can have un-acknowledged by the receiver before the sender stalls. Window shifting is used automatically (a vast improvement over when I first used it) when the SO_SNDBUF and SO_RCVBUF values result in a window size exceeding the unsigned sixteen-bit maximum. Window shifting increases the window size by successive powers of two, allowing both end points to scale the window size value by shifting it left or right. Using this RFC1323 feature causes effective increases in window size to be of very coarse granularity; it allows you to double it, or half it, but no values in between.
The ability to keep the pipe full is affected directly by the size of the send buffer; the sender must be able to buffer the bandwidth * delay product number of bytes pending acknowledgement by the receiver. The ability to recover from lost packets is affected by both the size of the send buffer and of the receive buffer; the sender must be able to resend unacknowledged bytes, and the receiver must buffer received bytes until an ordered TCP byte sequence can be reconstructed and delivered to the application.
Setting SO_SNDBUF and SO_RCVBUF to honkin’ big values, whether window scaling is used or not, is not the no-brainer it might seem, even if memory is not a constraint. The larger the TCP window size, the more bytes must be retransmitted in the event of the loss of a single TCP data segment. This consumes bandwidth and time resending bytes that would have been received and acknowledged successfully had a smaller window size been used. New bytes to be sent must wait behind the bytes being resent, adding a lot of latency for both the resent and new bytes. This can lead to jitter in constant rate byte streams, to processes on both the sending and receiving sides being blocked, to missed real-time responses; all sorts of wackiness may ensue. Tuning the socket buffer sizes for sensitive applications may be a non-trivial matter.
The Linux TCP minimum, default, and maximum SO_SNDBUF and SO_RCVBUF values are displayed by the sysctl command.
> sysctl –a
net.ipv4.tcp_rmem = 4096 87380 174760
net.ipv4.tcp_wmem = 4096 16384 131072
So in this example, unless setsockopt() is used, SO_RCVBUF will be 87380 bytes and SO_SNDBUF will be 16384 bytes. The ip-sysctl.txt documentation states that the default value of 87380 bytes for SO_RCVBUF “results in window of 65535 with default setting of tcp_adv_win_scale and tcp_app_win:0”, indicating that the use of window scaling (which appears to be what Linux calls the window shifting described in RFC1323) will not be needed if SO_RCVBUF is 87380 or smaller.
The VxWorks Reference Manual 5.4 (setsockopt(), pp. 2.736-737) indicates that the default SO_SNDBUF and SO_RCVBUF are both 8192 bytes for TCP sockets unless set otherwise by setsocketopt().
documentation suggests that this value is not enough to induce RFC1323 window
shifting because of overhead subtracted from the receive buffer space. Since
the IP stacks in both Linux and VxWorks are based on the
BSD stack, I’d guess the VxWorks stack works
similarly. Technical documentation on the
Linux 2.4, socket(7)
Linux 2.4, tcp(7)
Linux 2.4, sysctl(8)
Linux 2.4, linux/Documentation/networking/ip-sysctl.txt
Linux 2.4, linux/include/net/tcp.h and other source files
Wind River Systems, http://secure.windriver.com/windsurf
V. Jacobson et al., “TCP Extensions for High Performance”, RFC1323, 1992
V. Welch, “A User’s Guide to TCP Windows”, NCSA, 1996
J. Mahdavi, “Enabling High Performance Data Transfers on Hosts”, PSC, 1996
W. Stevens, TCP/IP Illustrated Volume 1: The Protocols, Addison-Wesley, 1994
G. Wright et al., TCP/IP Illustrated Volume 2: The Implementation, Addison-Wesley, 1995
J. L. Sloan firstname.lastname@example.org 2005-04-29
© 2005 by the Digital Aggregates Corporation. All rights reserved.