Why speed tests aren’t always the answer when troubleshooting Wi-Fi networks

Table of Contents

    Background

    In my current role, we sometimes receive complaints about the Wi-Fi being slow or not working properly. When we ask what the issue is, we’re often sent responses referring to speed test results only that are supposed to serve as the definitive proof that something’s wrong with the Wi-Fi. What our user base often doesn’t understand is that there are many variables when it comes to speed tests in general, but when running these speed tests while connected to Wi-Fi, even more variables exist. Let me try to explain.

    Whether it be wired or Wi-Fi, there are theoretical and real-world throughput maximums in networking that are affected by a number of things. For example, even when you have a 1 Gbps wired connection, chances are you’ll never get full 1 Gbps line-rate speeds in a raw throughput test due to at minimum, the overhead needed to put bits onto the wire, not to mention whether the latency and TCP Window Size (if using TCP) can support the line-rate speed. Latency and the TCP Window Size are two things I’ll come back to in more detail later.

    Wi-Fi considerations

    With Wi-Fi, your theoretical max speed is governed by several variables and usually the biggest limitations fall on the client side and not the infrastructure (AP). A few of those variables that I’m referring to include the number of antennas and number of spatial streams (SS) supported, the bandwidth of the Wi-Fi channel being used, and the guard interval (GI). I’ll take the GI out of the equation as we typically configure the infrastructure to support the lowest possible value and it only accounts for an approximate 10-11% increase when using the long GI of 0.8µs vs the short GI of 0.4µs. The GI has the lowest effect on the theoretical max out of the variables I just mentioned with the number of supported SS and channel bandwidth being the two biggest factors. For a number of reasons including battery savings and cost, most client devices (including the newer Apple Silicon MacBooks) only support 1 or 2 spatial streams. When it comes to channel bandwidth in enterprise environments where APs are deployed in large numbers, we typically use 20 MHz channels as the recommended best practice to prevent interference due to neighboring APs and clients potentially communicating on the same channel. At home, it’s much easier to increase the channel bandwidth up to 40 or even 80 MHz where only a single or few APs exist which would effectively double the maximum theoretical data rate.

    Fig. 1 MCS data rates and index table courtesy of https://www.mcsindex.com

    As you can see in the above screenshot (Fig. 1), the maximum theoretical data rate on a 2SS, Wi-Fi 5 (802.11ac) device using a 20 MHz channel is 173.3 Mbps; compared to a 1SS Wi-Fi 5 device with a max data rate of 86.7 Mbps, that’s a 2X increase just by supporting one additional spatial stream. That number bumps up to 286 Mbps if Wi-Fi 6 (802.11ax) is in use by both the client and infrastructure which should be the case in most enterprise networks at the time this blog post is published. Now look at what happens when you increase the channel bandwidth – it more than doubles each time you move up going from 173.3 Mbps to 400 Mbps to 866.7 Mbps for 20, 40, and 80 MHz bandwidths respectively when referring to that same 2SS, Wi-Fi 5 device.

    At this time, I think it’s important to note that Wi-Fi data rates do NOT equal actual throughput. That’s a common mistake that is made and I just wanted to call it out. Additionally, that max data rate is a best-case scenario which depends on a number of things including having a clean channel with strong signal quality that way the transmitter which is in charge of choosing the data rate used for each frame sent will choose a higher data rate. Not to mention, there is still overhead involved with using Wi-Fi so even if you had 95% of that maximum (173.3 Mbps) which would be considered extremely high and probably unheard of in a busy enterprise WLAN, you would expect at most 165 Mbps of real-world throughput. As you begin to add more clients to the WLAN, the overall throughput goes down due to the overhead of management traffic, checking and clearing the channel, collisions, etc since only 1 device can speak at a time in Wi-Fi, including the AP itself. Is it starting to become more clear? Even in perfect conditions, your maximum throughput on most client devices (assuming Wi-Fi 6 is supported) in a typical enterprise office is going to be roughly 272 Mbps with just Wi-Fi in mind.

    The effects of latency and TCP Window Size

    I could stop here. I think it’s safe to say that in most enterprise environments, the limited number of spatial streams and the 20 MHz channel widths are going to be the most common caps for your max throughput. However, I still want to circle back around to latency and TCP window size because they can both play a part, especially when you start to think about remote office locations that have to traverse a WAN to tunnel traffic back to a distant, centralized location which will add latency into the mix… And unless you’re using something like iPerf to do throughput testing which does allow testing using UDP (TCP is the default) and will typically result in faster speeds due to its connectionless behavior, most of these application or web-based speed tests are using TCP. And if you weren’t already aware, the maximum throughput that a device can expect to see using TCP as the transport protocol can be easily determined with a formula:

    TCP Throughput (bits/second) = TCP Window Size (in bits) / Latency (in seconds)

    The further you are away from a destination, the more latency you incur. It is common to assume 1ms of latency for every 60 miles to the destination. This doesn’t factor in the type of connection you are using (e.g. cable, DSL) which also adds latency to the equation (see below).

    • Cable modem: 5-40ms
    • DSL modem: 10-70ms
    • Dial-up modem: 100-220ms
    • Cellular: 200-600ms
    • T1: 0-10ms

    The above numbers courtesy of https://www.pingplotter.com/wisdom/article/is-my-connection-good/

    Then you’ll need to factor in the TCP Window Size. For the download portion of a speed test, the client’s window size is the number to focus on. Every operating system’s defaults are different and it also depends on the capabilities of the NIC and machine itself. Oh, and don’t forget about TCP Slow Start which can cause the numbers to be skewed during a short test as the window size is increased over time. So let’s take a look at some examples from my 2023 M2 MacBook Air that has 2SS and supports Wi-Fi 6 when connected to a 20 MHz channel:

    Note – While performing these tests, I had Wireshark running to capture the traffic so I could analyze the TCP window size, latency, and perceived throughput as these web-based tests are known for not reporting the most accurate speeds.

    Topology and benchmark tests

    Fig 2. Simplified topology of network used for testing

    The first test shows the client performing a speed test against the OpenSpeedTest server running on the WLANPi. The AP and the WLANPi were both connected to the same L2 switch at 1 Gbps so the absolute max throughput you could get would be 1 Gbps. Based on what we know already, our throughput will be much lower due to testing over Wi-Fi and the configuration of the WLAN, probably 25% or less of that number. By using Adrian’s WiFi Signal app, I was able to verify that my MacBook had both great RSSI and SNR, and was using the max data rate of 286.8 Mbps while performing this test; the same appeared to be the case for the AP as verified from the Mist UI, but it’s hard to get accurate numbers due to the delay of client data populating. The latency was extremely low since again, the AP and the WLANPi are connected to the same L2 switch (Fig. 2). Understanding that 286 Mbps is the bottleneck, we saw about 85% (247 Mbps) of the 286.8 Mbps with these 2 devices being the only 2 on the channel (see Fig. 4 from Adrian’s WiFi Explorer Pro 3). I’d say 85% is very good!

    Fig 3. OpenSpeedTest on a clean channel with just the AP and client
    Fig. 4 MacBook Air connected to the test WLAN “Mist-5GHz-Only” using 20 MHz channel BW and 2SS.
    No other APs are broadcasting on channel 112 and no other clients were connected.

    Just to set the stage a bit, what would have been the max TCP throughput of the download test in a perfect world if we just looked at latency and the window size. I took the average window size reported by my MacBook Air along with the latency of 2ms to use in the formula and came up with this:

    3.32 Gbps (3320000000 bits/second) = 6640000 bits / 0.002 seconds


    Fig. 5 I/O graph of Wireshark filtered on TCP/3000 which is used by OpenSpeedTest

    With 3.32 Gbps being the max TCP throughput, our wired connections to the AP and WLANPi would have become the bottleneck. We’d need multi-gig (802.3bz) to take advantage of those speeds.

    That’s great, but most speed test servers aren’t going to be 2ms away, even at home… Best case, you’ll probably see 20-40ms of latency based on the factors I mentioned above. How can we test this though? Great question! tc or traffic control is a Linux utility that allows you manipulate the kernel packet scheduler. You can do things like add artificial latency or even bandwidth limitations to simulate distance or lower speed WAN links. Even better, I found a wrapper online called tcconfig that makes configuring tc (Fig. 6) even easier.

    Fig. 6 tcconfig even allows you to add latency to a Docker container instead of the host

    What happens if I add 35ms of latency? What does the formula say about that when using that same 830K window size?

    189 Mbps (189714285 bits/second) = 6640000 bits / 0.035 seconds — See how that number starts to drastically change just by adding latency? And here’s the OpenSpeedTest result below (Fig. 7) . Important to remember, I used a rough average of the window size, not the max, so the numbers will be off slightly.

    Fig 7. OpenSpeedTest results with 35ms of artificial latency added
    Fig. 8 iPerf using UDP with 35ms of artificial latency added is higher than the TCP-based OpenSpeedTest

    What happens with 70ms of latency?

    95 Mbps (94857142 bits/second) = 6640000 bits / 0.070 seconds — That number falls by about half! And when you’re talking about WAN links that traverse half the country or more, 70ms is not an absurd number to expect. See the speed test results below (Fig. 9)

    Fig. 9 OpenSpeedTest results with 70ms of artificial latency added

    “But why is the download speed lower than the upload speed?” Have you ever gotten this question before? There could be multiple reasons for this such as better SNR on the AP side resulting in higher data rates being used. Assuming the downlink and uplink of the Wi-Fi connection were balanced, if you remember from the first Wireshark I/O graph (Fig. 5), the server’s average TCP window size (the blue line) during the upload portion (MacBook sending data to the WLANPi) was significantly higher than my MacBook Air’s to the tune of 3,140,000 bytes which is almost 4X larger than the 830,000 bytes. Using the TCP throughput formula with that same 70ms of latency, that would give you a TCP throughput max of 359 Mbps which is of course higher than the 286 Mbps limit that the Wi-Fi connection can actually provide. And as you can see, the upload throughput was still quite a bit lower so there were probably even more factors that I was not able to discover involved.

    Fig. 10 iPerf throughput using UDP with 70ms of artificial latency added is higher than the TCP-based OpenSpeedTest

    What if we simulate a slower WAN link? It’s almost 2024 and yes, I’m still running into DIA (dedicated Internet access) and WAN links of 100 Mbps. With the WAN link being the bottleneck, it doesn’t matter how many spatial streams you have or what channel bandwidths you are using, you’re going to be limited to that 100 Mbps and you can see that in the download speed results below (Fig. 11):

    Fig. 11 Even though the latency is back to 2ms and the window sizes are unchanged, the download speed was limited to the 100Mbps we configured using the tc utility.
    Fig 12. As a note, I did have to set tcconfig to rate limit the “incoming” direction as well since “outgoing” is the default.

    And don’t forget to remove any configurations you might have made using tc or tcconfig (Fig. 13)!


    Fig. 13 Example of removing configurations set against the Docker container’s virtual Ethernet NIC

    Conclusion

    As you can see, there are a variety of factors involved in determining what results a speed test will have, especially when you add Wi-Fi into consideration. Without proper context and more information, it’s hard to take those results at face value because there are so many variables with just several of them listed below:

    • What were the conditions of the RF environment at the time of the test?
    • How is the WLAN configured?
    • How many Wi-Fi clients were connected and active?
    • What data rates were in use, both at the client and AP?
    • What was the RTT (round-trip time) to that speed test server?
    • What TCP window sizes were used if the speed test was using TCP?
    • If traversing a WAN link, what is the link’s speed?

    Instead of chasing speed test results, focus on asking the customer more specific questions about the problems or issues that they are experiencing which prompted them to run the speed test to begin with. Gather as much information as possible and use whatever tools you have at your disposal to get to the bottom of what’s being reported and when possible, educate your customer, especially on why speed tests aren’t always the answer!

    As always, let me know what you think and feel free to join in on the discussion.