I've just read this thread:
http://marc.info/?t=131277868500001&r=1&w=2
Since its not linux-wireless I'll chime in here. It seems that you are
trying to write an algorithm that will work for all networking and
802.11 devices. For networking is seems tough given driver
architecture and structure and the hope that all drivers will report
things in a fairly similar way. For 802.11 it was pointed out how we
have varying bandwidths and depending on the technology used for
connection (AP, 802.11s, IBSS) a different number of possible peers
need to be considered. 802.11 faced similar algorithmic complexities
with rate control and the way Andrew and Derek resolved this was to
not assume you could solve this problem and simply test out the water
by trial and error, that gave birth to the minstrel rate control
algorithm which Felix later rewrote for mac80211 with 802.11n support
[1]. Can the BQL algorithm make use of the same trial and error
mechanism and simply try different values and and use EWMA [2] to pick
the best size for the queue ?
[1] http://wireless.kernel.org/en/developers/Documentation/mac80211/RateControl/minstrel
[2] http://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average
Luis
> 2. Constant small drops in throughput
> =============================
Huge drops in throughput, actually.
> How to explain this? I have no clue. Two current theories:
>
> a. Dynamic Power save
> b. Offchannel operations on bgscans
> c. Bufferbloat: large hw queue size and sw retries
>
> One can rule out (a) and (b) by disabling Dynamic Power Save (iw dev
> wlan0 power_save off) and also bg scans. If its (c) then we can work
> our way up to proving a solution with the same fixes for the first
> latency issue. But there are more subtle issues here.
(in part, your) recent analysis of the periodicity of bandwidth drops
does point
strongly to points a and b being mis-implemented by many vendors and OS-makers,
and shows a clear path toward further testing by several labs.
> Bufferbloat
> folks talk about "ants" and "elephants". They call "Elephants" as
> frames that are just data, but "ants" are small frames that build make
> the networks work --
This is an over-simplification of a complex issue, and wrong.
Elephants are very large *streams* of tcp data, the kind you get after
a few seconds of operation, and mice are short ones, the kind you
commonly get with non-pipelined http protocols and normal websites.
As noted elsewhere, with tons of reference material on google and citeseer,
"tcp mice and elephants" are a well researched and understood topic -
or at least were, back when network research still happened, which
was prior to the explosion and predominance of wireless.
The need for AQMs such as RED, Blue, etc, to manage elephants was
well demonstrated back in the early 90s. Being able to shoot/slow down
the elephants on a sane basis is also required to have sane shared
network behavior.
There has been a huge shift in TCP flows with first the advent of the
web for all things, and more recently, back again, with the rise of video.
> so consider 802.11 management frames, and TCP
> ACKs, and so forth.
Um, actually, if TCP acks are 'ANT's, it's news to me, and I co-coined the term.
There is one special case, commonly implemented in cable modems, of
Syn and Syn/Ack optimization, that might pay off to some extent in wireless.
Upstream ACK prioritization does seem to help when it comes to
managing interactive
flows on asymmetric (e.g. home) networks as (for example) demonstrated by
wondershaper and other QoS mechanism.
But anything involving TCP fits into the well defined mice and
elephant categories
already, not ANTs.
I'm told that most management frames are in a separately managed queue already.
> They argue we should prioritize these more and
> ensure we use whatever techniques we can to ensure we reduce latency
> for them.
ARP, DHCP, ND, RA, and various forms of ipv6's ping based protocols,
most routing protocols
in many aspects, numerous other non-tcp protocols, all count as ANTs.
> At least on ath9k we only aggregate data frames, but that
> doesn't mean we are not aggregating other "ant" frames.
The approach that we are experimenting with now, is putting
ANT-like packets into the VO and VI queues as appropriate and out
of the BE or BK queues.
Too early to publish preliminary results, but I am encouraged by what
I've seen so far.
It's interesting to note that a lot of web traffic is actually marked
as bulk traffic (tos of bulk)
which ends up in the wireless BK queue, rather than BE.
--
Dave T?ht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://the-edge.blogspot.com
On 30 August 2011 11:34, Tom Herbert <[email protected]> wrote:
> The generalization of BQL would be to set the queue limit in terms of
> a cost function implemented by the driver. ?The cost function would
> most likely be an estimate of time to transmit a packet. ?So C(P)
> could represent cost of a packet, sum(C(P) for P queued) is aggregate
> cost of queue packets, and queue limit is the maximum cost sum. ?For
> wired Ethernet, number of bytes in packet might be a reasonable
> function (although framing cost could be included, but I'm not sure
> that would make a material difference). ?For wireless, maybe the
> function could be more complex possibly taking multicast, previous
> history of transmission times, or other arbitrary characteristics of
> the packet into account...
>
> I can post a new patch with this generalization if this is interesting.
As I said before, I think this is the kind of thing the rate control
code needs to get its dirty hands into.
With 802.11 you have to care about the PHY side of things too, so your
cost suddenly would include the PER for combinations of {remote node,
antenna setup, TX rate, sub-frame length, aggregate length}, etc. Do
you choose that up front and then match a cost to it, or do you
involve the rate control code in deciding a "good enough" way of
handling what's on the queue by making rate decisions, then implement
random/weighted/etc drop of what's left? Do you do some weighted/etc
drop beforehand in the face of congestion, then pass what's left to
the rate control code, then discard the rest?
C(P) is going to be quite variable - a full frame retransmit of a 4ms
long aggregate frame is SUM(exponential backoff, grab the air,
preamble, header, 4ms, etc. for each pass.)
Adrian
On Fri, Aug 26, 2011 at 04:27:22PM -0700, Luis R. Rodriguez wrote:
> I've just read this thread:
>
> http://marc.info/?t=131277868500001&r=1&w=2
>
> Since its not linux-wireless I'll chime in here. It seems that you are
> trying to write an algorithm that will work for all networking and
> 802.11 devices. For networking is seems tough given driver
> architecture and structure and the hope that all drivers will report
> things in a fairly similar way. For 802.11 it was pointed out how we
> have varying bandwidths and depending on the technology used for
> connection (AP, 802.11s, IBSS) a different number of possible peers
> need to be considered. 802.11 faced similar algorithmic complexities
> with rate control and the way Andrew and Derek resolved this was to
> not assume you could solve this problem and simply test out the water
> by trial and error, that gave birth to the minstrel rate control
> algorithm which Felix later rewrote for mac80211 with 802.11n support
> [1]. Can the BQL algorithm make use of the same trial and error
> mechanism and simply try different values and and use EWMA [2] to pick
> the best size for the queue ?
>
> [1] http://wireless.kernel.org/en/developers/Documentation/mac80211/RateControl/minstrel
> [2] http://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average
>
> Luis
Since Luis has stirred things up, I took the liberty of adding Tom's
BQL series to the debloat-testing tree:
git://git.infradead.org/debloat-testing.git
I'm not sure how well BQL will interact with my eBDP-inspired hack,
debloat-testing commit d2566a176589775cb3a1de4fade972aa62d551ff
("mac80211: implement eBDP algorithm to fight bufferbloat"). So if
anyone wants to try enabling a wireless driver for BQL, you might
consider reverting that patch or at least testing without it.
<shameless plug>
And, of course, don't forget to come to the Bufferbloat and Networking
track at Linux Plumbers Conference next week in Santa Rosa, CA! :-)
</shameless plug>
John
--
John W. Linville Someday the world will need a hero, and you
[email protected] might be all we have. Be ready.
On Mon, Aug 29, 2011 at 4:10 PM, Dave Taht <[email protected]> wrote:
> My goal in collecting these recordings is to get TRANSCRIPTS of what
> is important and what is not, so that ultimately better documentation
> can be written from those than exist today.
>
> Since the byte queuing issue is hot this week, I decided to make more
> people aware that these recordings existed in the hope that more
> available background information on how wireless works could be
> internalized by more people.
>
> I did note that the talk with felix was an attempt at capturing and
> summarizing some bufferbloat history and only towards the end did we
> start moving forward to things we planned to try, while the talk with
> andrew was an attempt at doing a more technical overview of quite a
> few issues.
>
> Multiple people have found these conversations useful. Certainly I
> find them useful, often reviewing them a few weeks after the fact to
> find subtleties that I missed originally, or needed further reading to
> really understand or write code for. The earliest conversation I had
> with felix (march?) still has some useful nuggets in it about how
> aggregation actually works that I hope to mine one day for better
> documentation. It also had some glaring errors in it...
> (mostly MINE!)
>
> http://mirrors.bufferbloat.net/podcasts/
>
> I would certainly like to have a (recorded!) conversation with you to
> have a constructive
> discussion of the issues you find important, correct, and essential,
> towards making wireless networking better.
>
> Many people are capable of doing otherwise useful work while having a
> podcast running in the background that might have a few percentage
> points of useful information.
>
> By all means, don't listen if you can't split the brain cells, as I
> wrote I will have transcript of that latest available and will try to
> summarize and extract the best of what was discussed, after we get
> some code written and some tests performed.
>
> There are many other folk I would like to interview in this way, as well.
Honestly I do not have > 1 hour to review random audio on some topic
when the only meat of what I want is in 5 minutes of the > 1 hour
talk; the reason I stuck to hearing most of the talk though is I am
quite tired of the FUD around bufferbloat and also tired of the random
adhoc patches, e-mails and rants about it. My recommendation is edit
crap out and make a more condensed clip with the stuff technical folks
really care about. I'm not saying no one will find it useful, I'm
saying technical folks will puke all over it and likely not want to
even review more bufferbloat crap later. That's where I'm at least.
Luis
On 31/08/2011, at 1:58 AM, Jim Gettys wrote:
> On 08/29/2011 11:42 PM, Adrian Chadd wrote:
>> On 30 August 2011 11:34, Tom Herbert <[email protected]> wrote:
>>
>> C(P) is going to be quite variable - a full frame retransmit of a 4ms
>> long aggregate frame is SUM(exponential backoff, grab the air,
>> preamble, header, 4ms, etc. for each pass.)
>>
> It's not clear to me that doing heroic measures to compute the cost is
> going to be worthwhile due to the rate at which the costs can change on
> wireless; just getting into the rough ballpark may be enough. But
> buffering algorithms and AQM algorithms are going to need an estimate of
> the *time* it will take to transmit data, more than # of bytes or packets.
That's not heroic measures; mac80211 needs all the code to calculate these times anyway, it's just a matter of collecting together some things we already know and calling the right function.
Andrew
On 08/29/2011 08:24 PM, Dave Taht wrote:
> On Mon, Aug 29, 2011 at 2:02 PM, Luis R. Rodriguez <[email protected]> wrote:
>> On Fri, Aug 26, 2011 at 4:27 PM, Luis R. Rodriguez <[email protected]> wrote:
>> Let me elaborate on 802.11 and bufferbloat as so far I see only crap
>> documentation on this and also random crap adhoc patches.
> I agree that the research into bufferbloat has been an evolving topic, and
> the existing documentation and solutions throughout the web is inaccurate
> or just plan wrong in many respects. While I've been accumulating better
> and more interesting results as research continues, we're not there yet...
>
>> Given that I
>> see effort on netdev to try to help with latency issues its important
>> for netdev developers to be aware of what issues we do face today and
>> what stuff is being mucked with.
> Hear, Hear!
>
>> As far as I see it I break down the issues into two categories:
>>
>> * 1. High latencies on ping
>> * 2. Constant small drops in throughput
> I'll take on 2, in a separate email.
>
>> 1. High latencies on ping
>> ===================
> For starters, no, "high - and wildly varying - latencies on all sorts
> of packets".
>
> Ping is merely a diagnostic tool in this case.
>
> If you would like several gb of packet captures of all sorts of streams
> from various places and circumstances, ask. JG published a long
> series about 7 months back, more are coming.
>
> Regrettably most of the most recent traces come from irreproducible
> circumstances, a flaw we are trying to fix after 'CeroWrt' is finished.
>
>> It seems the bufferbloat folks are blaming the high latencies on our
>> obsession on modern hardware to create huge queues and also with
>> software retries. They assert that reducing the queue length
>> (ATH_MAX_QDEPTH on ath9k) and software retries (ATH_MAX_SW_RETRIES on
>> ath9k) helps with latencies. They have at least empirically tested
>> this with ath9k with
>> a simple patch:
The retries in wireless interact here only because they have encouraged
buffering for the retries. This is not unique to 802.11, but also
present in 3g networks (there, they fragment packets and put in lots of
buffering hoping to get the packet fragment transmitted at some future
time; they really hate dropping a packet if only a piece got damaged.
>> https://www.bufferbloat.net/attachments/43/580-ath9k_lowlatency.patch
>>
>> The obvious issue with this approach is it assumes STA mode of
>> operation, with an AP you do not want to reduce the queue size like
>> that. In fact because of the dynamic nature of 802.11 and the
> If there is any one assumption about the bufferbloat issue that people
> keep assuming we have, it's this one.
>
> In article after article, in blog post after blog post, people keep
> 'fixing' bufferbloat by setting their queues to very low values,
> and almost miraculously start seeing their QoS start working
> (which it does), and then they gleefully publish their results
> as recommendations, and then someone from the bufferbloat
> effort has to go and comment on that piece, whenever we
> notice, to straighten them out.
>
> In no presentation, no documentation, anywhere I know of,
> have we expressed that queuing as it works today
> is the right thing.
>
> More recently, JG got fed up and wrote these...
>
> http://gettys.wordpress.com/2011/07/06/rant-warning-there-is-no-single-right-answer-for-buffering-ever/
>
> http://gettys.wordpress.com/2011/07/09/rant-warning-there-is-no-single-right-answer-for-buffering-ever-part-2/
Yes, I got really frustrated....
>
> There has been no time, since the inception of the bufferbloat
> concept, have we had a fixed buffer size in any layer of the
> stack as even a potential solution.
Right now, we have typically 2 (large) buffers: the transmit queue and
the driver rings. Some hardware/software hides buffers in additional
places (e.g. on the OLPC X0-1, there are 4 packets in the wireless
module and 1 hidden in the driver itself. YMWV.
>
> And you just did applied that preconception to us again.
>
> My take on matters is that *unmanaged* buffer sizes > 1 is a
> problem. Others set the number higher.
>
> Of late, given what tools we have, we HAVE been trying to establish
> what *good baseline* queue sizes (txqueues, driver queues, etc)
> actually are for wireless under ANY circumstance that was
> duplicate-able.
>
> For the drivers JG was using last year, that answer was: 0.
>
> Actually, less than 0 would have been good, but that
> would have involved having tachyon emitters in the
> architecture.
Zero is what I set the transmit queue in my *experiments* ***only***
because I knew by that point the drivers underneath the transmit queue
had another 250 or so packets of buffering on the hardware I (and most
of you) have; I went and looked at quite a few Linux drivers, and
confirmed similar ring buffer sizes on Mac and Windows both empirically
and when possible from driver control panel information. At the
bandwidth delay product of my experiments, 250 packets is way more than
TCP will ever need. See:
http://gettys.wordpress.com/2010/11/29/home-router-puzzle-piece-one-fun-with-your-switch/
Most current ethernet and wireless drivers have that much in the
transmit rings today, on all operating systems that I've played with.
The hardware will typically support up to 4096 packet rings, but the
defaults in the drivers seem to be typically in the 200-300 packet range
(sometimes per queue).
Remember that any long lived TCP session (an "elephant" flow), will fill
any size buffer just before the bottleneck link in a path, given time.
It will fill the buffer at the rate at one packet/ack; in the traces I
took over cable modems you can watch the delay go up and up cleanly,
and up (in my case, to 1.2 seconds when they filled after of order 10
seconds. The same thing happens on 802.11 wireless, but its noisier in
my traces as I don't have a handy faraday cage ;-). An additional
problem, which was a huge surprise to everyone who studied the traces is
that congestion avoidance is getting terminally confused. And by
delaying packet drop (or ECN marking), TCP never slows down; it actually
continues to speed up (since current TCP algorithms typically do not
take notice of the RTT). The delay is so long that TCP's servo system is
no longer stable and it oscillates with a constant period. I have no
clue if this is at all related to the other periodic behaviour people
have noticed. If you think about it, the fact that the delay is several
orders of magnitude larger than the actual delay of the path makes it
less surprising than it might be.
Indeed there is no simple single right answer for buffering; it needs to
be dynamic, and ultimately we need to have AQM
even in hosts to control buffering (think about the case of two
different long lived TCP sessions over vastly different bandwidth/delay
paths). The gotcha is we don't have a AQM algorithm known to work in
the face of the highly dynamic bandwidth variation that is wireless;
classic RED does not
have the output bandwidth as a parameter in its algorithm. This was/is
the great surprise to me as I had always thought of AQM as a property of
internet routers, not hosts.
That buffering between the transmit queue is completely divorced from
driver buffering, when it needs to be treated together in some fashion.
What the "right" way to do that is, I don't know, though Andrew's
interview gave me some hope. And it needs to be dynamic, over (in the
802.11 case) at least 3 orders of magnitude.
This is a non-trivial, hard problem we have on our hands.
Computing the buffering in bytes is better than in packets; but since on
wireless multicast/broadcast is transmitted at a radically different
rate than other packets, I expect something based on time is really the
long term solution; and only the driver has any idea how long a packet
of a given flavour will likely take to transmit.
- Jim
On Mon, Aug 29, 2011 at 3:45 PM, Dave Taht <[email protected]> wrote:
> http://huchra.bufferbloat.net/~d/talks/felix/
> http://huchra.bufferbloat.net/~d/talks/andrew/
I had already listened to Andrew's audio, and got up to 1 hour with
your session with Felix but seriously considered that entire 1 hour a
waste of my life. I think you digressed completely and rambled on
about random crap instead of focusing on the core of the matter, I
gave up after 1 hour.
Luis
On 30/08/2011, at 9:02 AM, Luis R. Rodriguez wrote:
> On Fri, Aug 26, 2011 at 4:27 PM, Luis R. Rodriguez <[email protected]> wrote:
>> I've just read this thread:
>>
>> http://marc.info/?t=131277868500001&r=1&w=2
>>
>> Since its not linux-wireless I'll chime in here. It seems that you are
>> trying to write an algorithm that will work for all networking and
>> 802.11 devices. For networking is seems tough given driver
>> architecture and structure and the hope that all drivers will report
>> things in a fairly similar way. For 802.11 it was pointed out how we
>> have varying bandwidths and depending on the technology used for
>> connection (AP, 802.11s, IBSS) a different number of possible peers
>> need to be considered. 802.11 faced similar algorithmic complexities
>> with rate control and the way Andrew and Derek resolved this was to
>> not assume you could solve this problem and simply test out the water
>> by trial and error, that gave birth to the minstrel rate control
>> algorithm which Felix later rewrote for mac80211 with 802.11n support
>> [1]. Can the BQL algorithm make use of the same trial and error
>> mechanism and simply try different values and and use EWMA [2] to pick
>> the best size for the queue ?
>>
>> [1] http://wireless.kernel.org/en/developers/Documentation/mac80211/RateControl/minstrel
>> [2] http://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average
>
> Let me elaborate on 802.11 and bufferbloat as so far I see only crap
> documentation on this and also random crap adhoc patches. Given that I
> see effort on netdev to try to help with latency issues its important
> for netdev developers to be aware of what issues we do face today and
> what stuff is being mucked with.
>
> As far as I see it I break down the issues into two categories:
>
> * 1. High latencies on ping
> * 2. Constant small drops in throughput
>
> 1. High latencies on ping
> ===================
>
> It seems the bufferbloat folks are blaming the high latencies on our
> obsession on modern hardware to create huge queues and also with
> software retries. They assert that reducing the queue length
> (ATH_MAX_QDEPTH on ath9k) and software retries (ATH_MAX_SW_RETRIES on
> ath9k) helps with latencies. They have at least empirically tested
> this with ath9k with
> a simple patch:
>
> https://www.bufferbloat.net/attachments/43/580-ath9k_lowlatency.patch
>
> The obvious issue with this approach is it assumes STA mode of
> operation, with an AP you do not want to reduce the queue size like
> that. In fact because of the dynamic nature of 802.11 and the
> different modes of operation it is a hard question to solve on what
> queue size you should have. The BQL effort seems to try to unify a
> solution but obviously did not consider 802.11's complexities. 802.11
> makes this very complicated given the PtP and PtMP support we have and
> random number of possible peers.
>
> Then -- we have Aggregation. At least AMPDU Aggregation seems to
> empirically deteriorate latency and bufferbloat guys seem to hate it.
> Of course their statements are baseless and they are ignoring a lot of
> effort that went into this. Their current efforts have been to reduce
> segment size of a aggregates and this seems to help but the same
> problem looms over this resolution -- the optimal aggregation segment
> size should be dynamic and my instincts tell me we likely need to also
> rely on a minstrel-like based algorithm for finding the optimal length.
Luis, as the author of that patch... I agree entirely that we want something dynamic. I was actually writing that assuming AP operation... and it works pretty well, but I wouldn't expect it to be anything like optimal in all situations.
It looks like the time limits (called segment_size in the code) in Minstrel-HT were just copied from Minstrel-classic, which was tuned for 11g timings. The patch reduces the time limits and retry counts (the SW retry limit number goes up because of another patch that changes the accounting for SW retries from passes through the queue to transmitter shots, but 50 shots is way less than 20 passes through the queue). Now, this is just a matter of having sensible defaults that are not crazily trying to use hundreds of transmit opportunities on one packet.
That patch also reduces the aggregation opportunities a bit; the driver has something approximating a target queue depth, ATH_MAX_QDEPTH, that by default was 123 packets, and I reduced it to 34. Now, why 34? Because at any smaller number, it's going to impact aggregation and therefore performance really badly, while at a much larger number it's going to affect latency. I do understand why aggregation is important; the MAC has a lot of dead air in it, the scarce resource is transmit opportunities rather than bandwidth per se, and we need to get as much as reasonably possible done with each transmit opportunity.
What this patch does assume is that most of the clients have fairly decent coverage; it's going to be really unfair to .11b clients and those right out on the fringes of the APs coverage. But, with a fairly busy AP and decent coverage, it does actually work well in practice on the AP (I run it at home, and there's a fair bit of stuff on my network...). The demonstration is, I can have a few TCP streams running flat out from my laptop to something on the wired LAN, and still make a voip call from any device on the wireless... including the laptop (which runs OS X Lion and has a Broadcom radio, by the way, so Linux is not the only stack that does pretty well for latency). The TCP throughput is barely affected by that patch, but the latency is dramatically improved; what was over 100 ms and wildly varying (under load from TCP traffic) is now very stable and a little under 20 ms on the wireless hop.
So, yes, I do think some kind of dynamic queue sizing is called for. Also, I think we need a queue bundle per active peer (just to cover all the operation modes... that would be just the AP in STA mode), that is dynamically sized per traffic class per peer. Also, a decent qdisc that will do a reasonable job of getting latency-sensitive stuff into queues that are trying for latency rather than throughput.
And to the buffer bloat guys: yes, I do see how many buffers I'm proposing using here... it might be a couple or four thousand on a really busy AP. But the point is not to eliminate buffering, it is to use it sensibly to help performance rather than hurt as it so often does at present. The number of buffers in total doesn't matter, it's the depth of the queues seen by any particular packet passing through the network that is really critical... and too few is just as bad as too many, especially when it leads to wasting a scarce link resource, as too few buffers will do on an 802.11n AP.
>
> 2. Constant small drops in throughput
> =============================
>
> How to explain this? I have no clue. Two current theories:
>
> a. Dynamic Power save
> b. Offchannel operations on bgscans
> c. Bufferbloat: large hw queue size and sw retries
d. Someone has a broken rate control algorithm, and that is causing slowdowns and packet loss.
The study that Dave was referring to earlier was done on Windows XP clients, using APs running who knows what (we know the best of them was ath9k and Minstrel-classic, because it was running code based on OpenWRT 8.something, but the others I have no idea). There's all kinds of variables there, and I wouldn't rule out any kind of weirdness.
I think Linux in general is in pretty good shape with respect to most of these issues, what with your work on power save and scanning, and with minstrel for rate control. Part of the bufferbloat effort is to get the awareness of queue size issues... and what we're doing here is, basically, documenting the tradeoffs and considerations in 802.11 land.
I think a distilled version of these conversations needs to be written up as an informational RFC so the IETC has something to refer to.
>
> One can rule out (a) and (b) by disabling Dynamic Power Save (iw dev
> wlan0 power_save off) and also bg scans. If its (c) then we can work
> our way up to proving a solution with the same fixes for the first
> latency issue. But there are more subtle issues here. Bufferbloat
> folks talk about "ants" and "elephants". They call "Elephants" as
> frames that are just data, but "ants" are small frames that build make
> the networks work -- so consider 802.11 management frames, and TCP
> ACKs, and so forth. They argue we should prioritize these more and
> ensure we use whatever techniques we can to ensure we reduce latency
> for them. At least on ath9k we only aggregate data frames, but that
> doesn't mean we are not aggregating other "ant" frames. We at least
> now have in place code to not aggregate Voice Traffic -- that's good
> but we can do more. For example we can use AMSDU TX support for small
> frame. This means we'd need to prioritize AMSDU TX support, which we
> do not have support for in mac80211. I think this will help here, but
> consider queue size too -- we can likely get even better results here
> by ensuring we reduce latency further for them.
>
> Hope this helps sum up the issue for 802.11 and what we are faced with.
>
> Luis
It sure does help, thanks Luis.
Andrew
> Computing the buffering in bytes is better than in packets; but since on
> wireless multicast/broadcast is transmitted at a radically different
> rate than other packets, I expect something based on time is really the
> long term solution; and only the driver has any idea how long a packet
> of a given flavour will likely take to transmit.
The generalization of BQL would be to set the queue limit in terms of
a cost function implemented by the driver. The cost function would
most likely be an estimate of time to transmit a packet. So C(P)
could represent cost of a packet, sum(C(P) for P queued) is aggregate
cost of queue packets, and queue limit is the maximum cost sum. For
wired Ethernet, number of bytes in packet might be a reasonable
function (although framing cost could be included, but I'm not sure
that would make a material difference). For wireless, maybe the
function could be more complex possibly taking multicast, previous
history of transmission times, or other arbitrary characteristics of
the packet into account...
I can post a new patch with this generalization if this is interesting.
Tom
My goal in collecting these recordings is to get TRANSCRIPTS of what
is important and what is not, so that ultimately better documentation
can be written from those than exist today.
Since the byte queuing issue is hot this week, I decided to make more
people aware that these recordings existed in the hope that more
available background information on how wireless works could be
internalized by more people.
I did note that the talk with felix was an attempt at capturing and
summarizing some bufferbloat history and only towards the end did we
start moving forward to things we planned to try, while the talk with
andrew was an attempt at doing a more technical overview of quite a
few issues.
Multiple people have found these conversations useful. Certainly I
find them useful, often reviewing them a few weeks after the fact to
find subtleties that I missed originally, or needed further reading to
really understand or write code for. The earliest conversation I had
with felix (march?) still has some useful nuggets in it about how
aggregation actually works that I hope to mine one day for better
documentation. It also had some glaring errors in it...
(mostly MINE!)
http://mirrors.bufferbloat.net/podcasts/
I would certainly like to have a (recorded!) conversation with you to
have a constructive
discussion of the issues you find important, correct, and essential,
towards making wireless networking better.
Many people are capable of doing otherwise useful work while having a
podcast running in the background that might have a few percentage
points of useful information.
By all means, don't listen if you can't split the brain cells, as I
wrote I will have transcript of that latest available and will try to
summarize and extract the best of what was discussed, after we get
some code written and some tests performed.
There are many other folk I would like to interview in this way, as well.
On Mon, Aug 29, 2011 at 3:54 PM, Luis R. Rodriguez <[email protected]> wrote:
> On Mon, Aug 29, 2011 at 3:45 PM, Dave Taht <[email protected]> wrote:
>> http://huchra.bufferbloat.net/~d/talks/felix/
>> http://huchra.bufferbloat.net/~d/talks/andrew/
>
> I had already listened to Andrew's audio, and got up to 1 hour with
> your session with Felix but seriously considered that entire 1 hour a
> waste of my life. I think you digressed completely and rambled on
> about random crap instead of focusing on the core of the matter, I
> gave up after 1 hour.
>
> ?Luis
>
--
Dave T?ht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://the-edge.blogspot.com
On 30/08/2011, at 1:22 PM, Jim Gettys wrote:
> The gotcha is we don't have a AQM algorithm known to work in
> the face of the highly dynamic bandwidth variation that is wireless
It's worse than highly dynamic... the bandwidth may be completely undefined moment to moment, as it is dependent on both the wireless environment, which varies on timescales that can be about equal to a packet transmit time, and on the traffic mix. There's about 30 ms of correlation time at best.
> This was/is
> the great surprise to me as I had always thought of AQM as a property of
> internet routers, not hosts.
There's no distinction in the forwarding plane, every router is a host, every host is a router.
Andrew
On Mon, Aug 29, 2011 at 2:02 PM, Luis R. Rodriguez <[email protected]> wrote:
> On Fri, Aug 26, 2011 at 4:27 PM, Luis R. Rodriguez <[email protected]> wrote:
> Let me elaborate on 802.11 and bufferbloat as so far I see only crap
> documentation on this and also random crap adhoc patches.
I agree that the research into bufferbloat has been an evolving topic, and
the existing documentation and solutions throughout the web is inaccurate
or just plan wrong in many respects. While I've been accumulating better
and more interesting results as research continues, we're not there yet...
> Given that I
> see effort on netdev to try to help with latency issues its important
> for netdev developers to be aware of what issues we do face today and
> what stuff is being mucked with.
Hear, Hear!
> As far as I see it I break down the issues into two categories:
>
> ?* 1. High latencies on ping
> ?* 2. Constant small drops in throughput
I'll take on 2, in a separate email.
>
> ?1. High latencies on ping
> ===================
For starters, no, "high - and wildly varying - latencies on all sorts
of packets".
Ping is merely a diagnostic tool in this case.
If you would like several gb of packet captures of all sorts of streams
from various places and circumstances, ask. JG published a long
series about 7 months back, more are coming.
Regrettably most of the most recent traces come from irreproducible
circumstances, a flaw we are trying to fix after 'CeroWrt' is finished.
> It seems the bufferbloat folks are blaming the high latencies on our
> obsession on modern hardware to create huge queues and also with
> software retries. They assert that reducing the queue length
> (ATH_MAX_QDEPTH on ath9k) and software retries (ATH_MAX_SW_RETRIES on
> ath9k) helps with latencies. They have at least empirically tested
> this with ath9k with
> a simple patch:
>
> https://www.bufferbloat.net/attachments/43/580-ath9k_lowlatency.patch
>
> The obvious issue with this approach is it assumes STA mode of
> operation, with an AP you do not want to reduce the queue size like
> that. In fact because of the dynamic nature of 802.11 and the
If there is any one assumption about the bufferbloat issue that people
keep assuming we have, it's this one.
In article after article, in blog post after blog post, people keep
'fixing' bufferbloat by setting their queues to very low values,
and almost miraculously start seeing their QoS start working
(which it does), and then they gleefully publish their results
as recommendations, and then someone from the bufferbloat
effort has to go and comment on that piece, whenever we
notice, to straighten them out.
In no presentation, no documentation, anywhere I know of,
have we expressed that queuing as it works today
is the right thing.
More recently, JG got fed up and wrote these...
http://gettys.wordpress.com/2011/07/06/rant-warning-there-is-no-single-right-answer-for-buffering-ever/
http://gettys.wordpress.com/2011/07/09/rant-warning-there-is-no-single-right-answer-for-buffering-ever-part-2/
There has been no time, since the inception of the bufferbloat
concept, have we had a fixed buffer size in any layer of the
stack as even a potential solution.
And you just did applied that preconception to us again.
My take on matters is that *unmanaged* buffer sizes > 1 is a
problem. Others set the number higher.
Of late, given what tools we have, we HAVE been trying to establish
what *good baseline* queue sizes (txqueues, driver queues, etc)
actually are for wireless under ANY circumstance that was
duplicate-able.
For the drivers JG was using last year, that answer was: 0.
Actually, less than 0 would have been good, but that
would have involved having tachyon emitters in the
architecture.
For the work that felix and andrew recently performed on the ath9k,
it looks to be about 37 for STA mode, but further testing is required
and is taking place... as well as instrumentation of the tcp stack, changes
to the system clock interrupt, and a horde of other things.
As for AP performance... don't know. Am testing, capturing streams,
testing contention, building up labs, and creating a distro to deploy
to the field to test everything with.
> different modes of operation it is a hard question to solve on what
> queue size you should have. The BQL effort seems to try to unify a
> solution but obviously did not consider 802.11's complexities.
I only noticed the presentation and thread a few days ago. I do happen
to like byte, rather than packet limits, as a start towards sanity, but
framing overhead is still a problem, and an effective API and
set of servos more so.
> 802.11
> makes this very complicated given the PtP and PtMP support we have and
> random number of possible peers.
>
> Then -- we have Aggregation. At least AMPDU Aggregation seems to
> empirically deteriorate latency and bufferbloat guys seem to hate it.
I think aggregation is awesome, actually. I've said so on multiple occasions.
After reading the early work done on it, back in the early 2000s, I thought it
could not be made to work, at all. I was wrong.
The problems I have with aggregation as seemingly commenly
implemented today are:
0) Attempts to make the media look 100% perfect
1) Head of line blocking
2) Assumption that all packets are equal.
3) Co-existence with previous forms of 802.11 is difficult
4) Horrible behavior with every AQM algorithm, particularly with fair queuing
> Of course their statements are baseless and they are ignoring a lot of
> effort that went into this. Their current efforts have been to reduce
> segment size of a aggregates and this seems to help but the same
I note that some of the ongoing suggestions actually will make it statistically
more likely that stuff emerging from higher levels of the stack will be even
more aggregate-able than it is currently.
Also, it would be good to have some statistics as to how good aggregation
is currently, actually working under normal multiuser workloads.
A suggestion of felix's was to make that available via netlink broadcast.
> problem looms over this resolution -- the optimal aggregation segment
> size should be dynamic and my instincts tell me we likely need to also
> rely on a minstrel-like based algorithm for finding the optimal length.
We agree very much it should be dynamic.
--
Dave T?ht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://the-edge.blogspot.com
On Mon, Aug 29, 2011 at 2:02 PM, Luis R. Rodriguez <[email protected]> wrote:
> Hope this helps sum up the issue for 802.11 and what we are faced with.
I should elaborate a bit more here on ensuring people understand that
the "bufferbloat" issue assumes simply not retrying frames is a good
thing. This is incorrect. TCP's congestion algorithm is designed to
help with the network conditions, not the dynamic PHY conditions. The
dyanmic PHY conditions are handled through a slew of different means:
* Rate control
* Adaptive Noise Immunity effort
Rate control is addressed either in firmware or by the driver.
Typically rate control algorithms use some sort of metrics to do best
guess at what rate a frame should be transmitted at. Minstrel was the
first to say -- ahhh the hell with it, I give up and simply do trial
and error and keep using the most reliable one but keep testing
different rates as you go on. You fixate on the best one by using
EWMA.
What I was arguing early was that perhaps the same approach can be
taken for the latency issues under the assumption the resolution is
queue size and software retries. In fact this same principle might be
able to be applicable to the aggregation segment size as well.
Now, ANI is specific to hardware and does adjustments on hardware
based on some known metrics. Today we have fixed thresholds for these
but I wouldn't be surprised if taking minstrel-like guesses and doing
trial and error and EWMA based fixation would help here as well.
Luis
On 30 August 2011 09:22, Jim Gettys <[email protected]> wrote:
Note: I'm knee deep in the aggregation TX/RX path at the present time
- I'm porting the atheros 802.11n TX aggregation code to FreeBSD.
> Computing the buffering in bytes is better than in packets; but since on
> wireless multicast/broadcast is transmitted at a radically different
> rate than other packets, I expect something based on time is really the
> long term solution; and only the driver has any idea how long a packet
> of a given flavour will likely take to transmit.
And the driver (hopefully!) can find out how long the packet -did-
take to transmit.
There are a bunch of different reasons for why the packet isn't
transmitting or why it can take so long. If (say) an aggregate has 10
hardware (long, short) retries at a high MCS rate, and then 10
software retries, that's up to 100 attempts at transmitting the
sub-frames in some way. It may also involve 10 attempts at an RTS
exchange. But it may also be 10 attempts at transmitting the -whole-
frame. In the case of a long aggregate (say the upper bounds of 4ms,
easily achievable when lower MCS rates are selected), this can take a
long time.
I'm occasionally seeing this in my testing, where the block-ack isn't
seen by the sender. The whole aggregate frame is thus retransmitted in
its entirety. This causes occasional bumps in the testing latency. The
obvious solution is to not form such large aggregates at lower MCS
rates but even single events have an impact on latency.
I'm not at the point yet where I can start tinkering with rate control
and queue management in this way but the idea of asking the rate
control code to manage per-node and overall airtime has crossed my
mind. Ie, the rate control code can see how successful transmissions
are to a given node (at given streams, rates, antenna configurations,
etc) and then enforce aggregate size and retransmission limits there.
Since a decision for any given node will affect the latency on all
subsequent nodes, it makes sense for the rate control code to keep a
global idea of the airtime involved as well as the (current) per-node
logic.
2c, which'll be more when I work the 11n TX A-MPDU kinks out in the
FreeBSD driver,
Adrian
I have a few other things to correct and comment on from the earlier
postings on this topic, but I'll start here, and work backwards.
On Mon, Aug 29, 2011 at 2:10 PM, Luis R. Rodriguez <[email protected]> wrote:
> On Mon, Aug 29, 2011 at 2:02 PM, Luis R. Rodriguez <[email protected]> wrote:
>> Hope this helps sum up the issue for 802.11 and what we are faced with.
>
> I should elaborate a bit more here on ensuring people understand that
> the "bufferbloat" issue assumes simply not retrying frames is a good
> thing. This is incorrect.
No, we don't assume that "simply not retrying frames is a good thing".
The particular patch to which you are referring is part of a series of
patches still under development and test and is already obsolete.
In particular, the bug we were stomping in that thread involved
excessive retries in the packet aggregation queue on the ath9k driver,
where a ping at distance, was taking 1.6 seconds to get there.
http://www.bufferbloat.net/issues/216 - I note there was a lot of
confusing activity around this bug as this was the final piece of a
solution to why my mesh network in Nica went to h*ll in rain, and I
was trapped in a hotel at the time.
Far worse ping times have been observed in the field - 20+ seconds, or
more - and as most internet protocols were designed with a little over
a lunar diameter in mind at most (~2 seconds of latency) - induced
latency of over a certain point is a very bad thing as it introduces
major problems/timeouts in what we now call the 'ANT' protocols -
DHCP, DNS, ARP, ND, etc - as well as begin to serious muck with the
servo mechanisms within TCP itself.
Please note that ping is merely a diagnostic - all sorts of packets
are exhibiting unbounded latency across most wireless standards.
Retrying wireless frames with bounded latency is a good thing.
Dropping packets to signal congestion is a good thing, also. Knowing
when to drop a packet is a very good thing. Preserving all packets, no
matter the cost, leads to RFC970.
> TCP's congestion algorithm is designed to
> help with the network conditions, not the dynamic PHY conditions.
Agreed, although things like TCP westwood and the earlier vegas,
attempt to also measure latencies more dynamically.
Also, I have always been an advocate of using "split tcp" when making
the jump to from wired to wireless.
>The
> dyanmic PHY conditions are handled through a slew of different means:
>
> ?* Rate control
> ?* Adaptive Noise Immunity effort
>
> Rate control is addressed either in firmware or by the driver.
> Typically rate control algorithms use some sort of metrics to do best
> guess at what rate a frame should be transmitted at. Minstrel was the
> first to say -- ahhh the hell with it, I give up and simply do trial
> and error and keep using the most reliable one but keep testing
> different rates as you go on. You fixate on the best one by using
> EWMA.
Which I like, very much.
I note that the gargoyle traffic shaper attempts to use the number of
packets in a conntracked connection as a measure of the TCP
mice/elephant transition to determine what traffic class the stream
should be in.
It cannot, however, detect a elephant/mouse transition, and perhaps
if there was also EWMA in conntrack, it may help the shaping problem
somewhat.
The concepts of "TCP mice and elephants" are well established in the
literature (see google), however the concept of an 'Ant' is not, it's
a new usage we have tried to establish to raise the importance of
lower latency needed, system critical packets on local wireless lans.
> What I was arguing early was that perhaps the same approach can be
> taken for the latency issues under the assumption the resolution is
> queue size and software retries. In fact this same principle might be
> able to be applicable to the aggregation segment size as well.
EWMA time and feedback to higher layers, of how long it takes packets
to make a given next-hop destination
Also somehow passing up the stack that 'this (wireless-n) destination
can handle an aggregate of 32 packets or XX bytes', and this
destination (g or b), can't, would make
applying higher level traffic shaping and fairness algorithms such as
SFB, RED, SFQ, HSFC, actually somewhat useful.
There is also the huge weight of wireless multicast packets on
wireless-n, which can be well over 300x1 vs normal packets at present,
to somehow account for throughout the stack. It only takes a little
multicast to mess up your whole day.
> Now, ANI is specific to hardware and does adjustments on hardware
> based on some known metrics. Today we have fixed thresholds for these
> but I wouldn't be surprised if taking minstrel-like guesses and doing
> trial and error and EWMA based fixation would help here as well.
In this 80 minute discussion of the interaction of wireless stack
between myself and felix feitkau, we attempt to summarize all the work
re bufferbloat and wireless specific to the ath9k done to date, and in
the last 10 minutes or so, speculate as to further solutions,based on
and expanding on andrew's input from the previous week.
http://huchra.bufferbloat.net/~d/talks/felix/
There will be a transcript available soon.
Felix has (as of this morning) already implemented pieces of what we discussed.
There is an earlier discussion with Andrew McGregor, as well as
transcript, here:
http://huchra.bufferbloat.net/~d/talks/andrew/
The transcriptionist struggles mightily with the acronyms and I
haven't got around to trying to fix it yet. I'd like very much to
capture andrew's descriptions of how minstrel actually works and one
day fold all this stuff together, along with how power management
actually works on wireless, etc, so more people internalize how
wireless really works.
--
Dave T?ht
http://www.bufferbloat.net
On 30/08/2011, at 3:42 PM, Adrian Chadd wrote:
> On 30 August 2011 11:34, Tom Herbert <[email protected]> wrote:
>
>> The generalization of BQL would be to set the queue limit in terms of
>> a cost function implemented by the driver. The cost function would
>> most likely be an estimate of time to transmit a packet.
That's a great idea. Best that it be in nanoseconds, we may well have some seriously fast network interfaces to deal with.
>> So C(P)
>> could represent cost of a packet, sum(C(P) for P queued) is aggregate
>> cost of queue packets, and queue limit is the maximum cost sum. For
>> wired Ethernet, number of bytes in packet might be a reasonable
>> function (although framing cost could be included, but I'm not sure
>> that would make a material difference). For wireless, maybe the
>> function could be more complex possibly taking multicast, previous
>> history of transmission times, or other arbitrary characteristics of
>> the packet into account...
>>
>> I can post a new patch with this generalization if this is interesting.
>
> As I said before, I think this is the kind of thing the rate control
> code needs to get its dirty hands into.
>
> With 802.11 you have to care about the PHY side of things too, so your
> cost suddenly would include the PER for combinations of {remote node,
> antenna setup, TX rate, sub-frame length, aggregate length}, etc. Do
> you choose that up front and then match a cost to it, or do you
> involve the rate control code in deciding a "good enough" way of
> handling what's on the queue by making rate decisions, then implement
> random/weighted/etc drop of what's left? Do you do some weighted/etc
> drop beforehand in the face of congestion, then pass what's left to
> the rate control code, then discard the rest?
Since Minstrel already knows an estimate of the PER to each remote node (expressed in terms of success probability per shot, so there's a bit of math to do), and the stack knows about transmit times, implementing a way to ask the question isn't particularly hard. Other rate controls could make up their own guesstimates based on whatever factors they want to use.
However, that's going to change rapidly, so I suggest that we want some backlog grooming on a regular basis (like, after each rate control iteration) that might well reevaluate and drop or mark packets that are already in the queues.
>
> C(P) is going to be quite variable - a full frame retransmit of a 4ms
> long aggregate frame is SUM(exponential backoff, grab the air,
> preamble, header, 4ms, etc. for each pass.)
>
>
> Adrian
Indeed.
Andrew
On 08/29/2011 09:59 PM, Andrew McGregor wrote:
> On 30/08/2011, at 1:22 PM, Jim Gettys wrote:
>
>> The gotcha is we don't have a AQM algorithm known to work in
>> the face of the highly dynamic bandwidth variation that is wireless
> It's worse than highly dynamic... the bandwidth may be completely undefined moment to moment, as it is dependent on both the wireless environment, which varies on timescales that can be about equal to a packet transmit time, and on the traffic mix. There's about 30 ms of correlation time at best.
Yup. It makes ethernet look trivial. If we can handle buffers there,
everywhere else is easy by comparison.
>> This was/is
>> the great surprise to me as I had always thought of AQM as a property of
>> internet routers, not hosts.
> There's no distinction in the forwarding plane, every router is a host, every host is a router.
Exactly; but I hadn't thought this through to hosts, nor, I think had
many other people. Naive me, having been scarred by '90's congestion,
was aware of RED, and roughly how it worked, but always thought of AQM
as something you did in routers. Realising that in principle I needed
it turned on in my laptop was not something I expected.
- Jim
On Wed, Aug 31, 2011 at 6:28 AM, Jim Gettys <[email protected]> wrote:
> On 08/30/2011 05:47 PM, Andrew McGregor wrote:
>> On 31/08/2011, at 1:58 AM, Jim Gettys wrote:
>>
>>> On 08/29/2011 11:42 PM, Adrian Chadd wrote:
>>>> On 30 August 2011 11:34, Tom Herbert <[email protected]> wrote:
>>>>
>>>> C(P) is going to be quite variable - a full frame retransmit of a 4ms
>>>> long aggregate frame is SUM(exponential backoff, grab the air,
>>>> preamble, header, 4ms, etc. for each pass.)
>>>>
>>> It's not clear to me that doing heroic measures to compute the cost is
>>> going to be worthwhile due to the rate at which the costs can change on
>>> wireless; just getting into the rough ballpark may be enough. But
>>> buffering algorithms and AQM algorithms are going to need an estimate of
>>> the *time* it will take to transmit data, more than # of bytes or packets.
>> That's not heroic measures; mac80211 needs all the code to calculate these times anyway, it's just a matter of collecting together some things we already know and calling the right function.
>>
>>
>
> Fine; if it's easy, accurate is better (presuming the costs get
> recalculated when circumstances change). We also will need the amount of
> data being transmitted; it is the rate of transmission (the rate at
> which the buffers are draining) that we'll likely need.
>
> Here's what I've gleaned from reading "RED in a different light", Van
> Jacobson's Mitre talk and several conversations with Kathleen Nichols
> and Van: AQM algorithms that can handle variable bandwidth environments
> will need to be able to know the rate at which buffers empty. It's the
> direction they are going with their experiments on a "RED light" algorithm.
>
> The fundamental insight as to why classic RED can't work in the wireless
> environment is that the instantaneous queue length has little actual
> information in it.
Can you elaborate? Mind you I know squat about AQM but am curious
about what *is* required here.
> Classic RED is tuned using the queue length as its
> basic parameter. Their belief as to algorithms that will work is that
> the need to keep track of the running queue length *minimum over time*;
> you want to keep the buffers small on a longer term basis (so they both
> can absorb transients, which is their reason for existence, while
> keeping the latency typically low).
I think I get this.
Now, I do care about the above details of what is needed but -- it
still puzzles me that no algorithm on addressing this has been
developed that simply does trial and error to fixate on the
appropriate queue length. Do we know the outcome of using a "bad queue
length", is it the latency issues? If so then can we not use latency
as a reactive measure to help us fine-tune the desired queue length.
This would be PHY agnostic. This is what I was luring to earlier when
I had mentioned using the approach Minstrel takes to solve the rate
control problem. I've lately been fixating on these sort of algorithms
approaches as solutions to complex problems as inspired by Tim
Harford's elaboration on the God complex on a TED Talk:
http://www.youtube.com/watch?v=K5wCfYujRdE
> The additional major challenge we
> face that core routers do not is the highly variable traffic of mixed
> mice and elephants. What actually works only time will tell.
Thank you for elaborating on this, it helps to further dissect the
issues. When mixing them together it makes the task harder to address.
As for 802.11 we'd just need to worry about management frames. Dave
and Andrew had also pointed out issues with multicast and the low
rates used for them in comparison to unicast transmissions.
> So in an environment in which the rate of transmission is highly
> variable,
Not only that, remember we have different possible topologies
depending on the type of 802.11 service used, IBSS, the typical AP-STA
environment, Mesh, P2P (AP-STA really), and now with 802.11ad we get
PBSS. What counts here is we may have variable rate of transmission to
different possible peers, as such the queue length will depend on all
of the expected transmissions.
> such as wireless, or even possibly modern broadband with
> PowerBoost, classic RED or similar algorithms that do not take the
> buffer drain rate cannot possibly hack it properly.
Understood, just curious if anyone has tried a Minstrel approach.
Lets try to document all of this here:
http://wireless.kernel.org/en/developers/bufferbloat
Luis
On Fri, Aug 26, 2011 at 4:27 PM, Luis R. Rodriguez <[email protected]> wrote:
> I've just read this thread:
>
> http://marc.info/?t=131277868500001&r=1&w=2
>
> Since its not linux-wireless I'll chime in here. It seems that you are
> trying to write an algorithm that will work for all networking and
> 802.11 devices. For networking is seems tough given driver
> architecture and structure and the hope that all drivers will report
> things in a fairly similar way. For 802.11 it was pointed out how we
> have varying bandwidths and depending on the technology used for
> connection (AP, 802.11s, IBSS) a different number of possible peers
> need to be considered. 802.11 faced similar algorithmic complexities
> with rate control and the way Andrew and Derek resolved this was to
> not assume you could solve this problem and simply test out the water
> by trial and error, that gave birth to the minstrel rate control
> algorithm which Felix later rewrote for mac80211 with 802.11n support
> [1]. Can the BQL algorithm make use of the same trial and error
> mechanism and simply try different values and and use EWMA [2] to pick
> the best size for the queue ?
>
> [1] http://wireless.kernel.org/en/developers/Documentation/mac80211/RateControl/minstrel
> [2] http://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average
Let me elaborate on 802.11 and bufferbloat as so far I see only crap
documentation on this and also random crap adhoc patches. Given that I
see effort on netdev to try to help with latency issues its important
for netdev developers to be aware of what issues we do face today and
what stuff is being mucked with.
As far as I see it I break down the issues into two categories:
* 1. High latencies on ping
* 2. Constant small drops in throughput
1. High latencies on ping
===================
It seems the bufferbloat folks are blaming the high latencies on our
obsession on modern hardware to create huge queues and also with
software retries. They assert that reducing the queue length
(ATH_MAX_QDEPTH on ath9k) and software retries (ATH_MAX_SW_RETRIES on
ath9k) helps with latencies. They have at least empirically tested
this with ath9k with
a simple patch:
https://www.bufferbloat.net/attachments/43/580-ath9k_lowlatency.patch
The obvious issue with this approach is it assumes STA mode of
operation, with an AP you do not want to reduce the queue size like
that. In fact because of the dynamic nature of 802.11 and the
different modes of operation it is a hard question to solve on what
queue size you should have. The BQL effort seems to try to unify a
solution but obviously did not consider 802.11's complexities. 802.11
makes this very complicated given the PtP and PtMP support we have and
random number of possible peers.
Then -- we have Aggregation. At least AMPDU Aggregation seems to
empirically deteriorate latency and bufferbloat guys seem to hate it.
Of course their statements are baseless and they are ignoring a lot of
effort that went into this. Their current efforts have been to reduce
segment size of a aggregates and this seems to help but the same
problem looms over this resolution -- the optimal aggregation segment
size should be dynamic and my instincts tell me we likely need to also
rely on a minstrel-like based algorithm for finding the optimal length.
2. Constant small drops in throughput
=============================
How to explain this? I have no clue. Two current theories:
a. Dynamic Power save
b. Offchannel operations on bgscans
c. Bufferbloat: large hw queue size and sw retries
One can rule out (a) and (b) by disabling Dynamic Power Save (iw dev
wlan0 power_save off) and also bg scans. If its (c) then we can work
our way up to proving a solution with the same fixes for the first
latency issue. But there are more subtle issues here. Bufferbloat
folks talk about "ants" and "elephants". They call "Elephants" as
frames that are just data, but "ants" are small frames that build make
the networks work -- so consider 802.11 management frames, and TCP
ACKs, and so forth. They argue we should prioritize these more and
ensure we use whatever techniques we can to ensure we reduce latency
for them. At least on ath9k we only aggregate data frames, but that
doesn't mean we are not aggregating other "ant" frames. We at least
now have in place code to not aggregate Voice Traffic -- that's good
but we can do more. For example we can use AMSDU TX support for small
frame. This means we'd need to prioritize AMSDU TX support, which we
do not have support for in mac80211. I think this will help here, but
consider queue size too -- we can likely get even better results here
by ensuring we reduce latency further for them.
Hope this helps sum up the issue for 802.11 and what we are faced with.
Luis
On 08/30/2011 05:47 PM, Andrew McGregor wrote:
> On 31/08/2011, at 1:58 AM, Jim Gettys wrote:
>
>> On 08/29/2011 11:42 PM, Adrian Chadd wrote:
>>> On 30 August 2011 11:34, Tom Herbert <[email protected]> wrote:
>>>
>>> C(P) is going to be quite variable - a full frame retransmit of a 4ms
>>> long aggregate frame is SUM(exponential backoff, grab the air,
>>> preamble, header, 4ms, etc. for each pass.)
>>>
>> It's not clear to me that doing heroic measures to compute the cost is
>> going to be worthwhile due to the rate at which the costs can change on
>> wireless; just getting into the rough ballpark may be enough. But
>> buffering algorithms and AQM algorithms are going to need an estimate of
>> the *time* it will take to transmit data, more than # of bytes or packets.
> That's not heroic measures; mac80211 needs all the code to calculate these times anyway, it's just a matter of collecting together some things we already know and calling the right function.
>
>
Fine; if it's easy, accurate is better (presuming the costs get
recalculated when circumstances change). We also will need the amount of
data being transmitted; it is the rate of transmission (the rate at
which the buffers are draining) that we'll likely need.
Here's what I've gleaned from reading "RED in a different light", Van
Jacobson's Mitre talk and several conversations with Kathleen Nichols
and Van: AQM algorithms that can handle variable bandwidth environments
will need to be able to know the rate at which buffers empty. It's the
direction they are going with their experiments on a "RED light" algorithm.
The fundamental insight as to why classic RED can't work in the wireless
environment is that the instantaneous queue length has little actual
information in it. Classic RED is tuned using the queue length as its
basic parameter. Their belief as to algorithms that will work is that
the need to keep track of the running queue length *minimum over time*;
you want to keep the buffers small on a longer term basis (so they both
can absorb transients, which is their reason for existence, while
keeping the latency typically low). The additional major challenge we
face that core routers do not is the highly variable traffic of mixed
mice and elephants. What actually works only time will tell.
So in an environment in which the rate of transmission is highly
variable, such as wireless, or even possibly modern broadband with
PowerBoost, classic RED or similar algorithms that do not take the
buffer drain rate cannot possibly hack it properly.
- Jim
On 08/29/2011 11:42 PM, Adrian Chadd wrote:
> On 30 August 2011 11:34, Tom Herbert <[email protected]> wrote:
>
>> The generalization of BQL would be to set the queue limit in terms of
>> a cost function implemented by the driver. The cost function would
>> most likely be an estimate of time to transmit a packet. So C(P)
>> could represent cost of a packet, sum(C(P) for P queued) is aggregate
>> cost of queue packets, and queue limit is the maximum cost sum. For
>> wired Ethernet, number of bytes in packet might be a reasonable
>> function (although framing cost could be included, but I'm not sure
>> that would make a material difference). For wireless, maybe the
>> function could be more complex possibly taking multicast, previous
>> history of transmission times, or other arbitrary characteristics of
>> the packet into account...
>>
>> I can post a new patch with this generalization if this is interesting.
> As I said before, I think this is the kind of thing the rate control
> code needs to get its dirty hands into.
>
> With 802.11 you have to care about the PHY side of things too, so your
> cost suddenly would include the PER for combinations of {remote node,
> antenna setup, TX rate, sub-frame length, aggregate length}, etc. Do
> you choose that up front and then match a cost to it, or do you
> involve the rate control code in deciding a "good enough" way of
> handling what's on the queue by making rate decisions, then implement
> random/weighted/etc drop of what's left? Do you do some weighted/etc
> drop beforehand in the face of congestion, then pass what's left to
> the rate control code, then discard the rest?
>
> C(P) is going to be quite variable - a full frame retransmit of a 4ms
> long aggregate frame is SUM(exponential backoff, grab the air,
> preamble, header, 4ms, etc. for each pass.)
>
It's not clear to me that doing heroic measures to compute the cost is
going to be worthwhile due to the rate at which the costs can change on
wireless; just getting into the rough ballpark may be enough. But
buffering algorithms and AQM algorithms are going to need an estimate of
the *time* it will take to transmit data, more than # of bytes or packets.
Ultimately, if the queue starts builds, we'll need an AQM algorithm to
control the buffer growth.
Hopefully we can start testing SFB and other possibilities in CeroWrt
soon; Kathleen Nichols and Van Jacobson have been making some progress
on an algorithm called "RED light" which is based on the observed
transfer rate as well. The eBDP algorithm in debloat testing also
helps, which Van pointed us at late last year when this came up (though
John Linville says eBDP needs rework before it can go upstream). We
didn't want to start testing SFB and other options while we were aware
of other problems in the wireless driver itself; Andrew and Felix's work
with Dave have apparently brought that problem to a decent point.
- Jim
.. Whilst knee-deep in it all, I should also mention things like
"transmission opportunity/beacon transmit time being enforced by
hardware" - ie, where the hardware stops a packet TX from occuring
because it would exceed the programmed TXOP window, or because it
would interfere with an upcoming beacon TX from the hostap.
Another 2c,
Adrian
On 09/02/2011 05:59 PM, Luis R. Rodriguez wrote:
> BTW Jim, as you may have found out after sending this reply, vger
> hates HTML e-mails so your e-mail only got to those on the CC list but
> not to the list. I'm replying below, but since I'm now using ASCII
> text the look of the e-mail will look like poo, I'll try to trim it
> somehow as I did find valuable information you provided there. I also
> make some suggestions below.
Thunderbird is giving me fits in this department. Fixed now. My
apologies to everyone who abhors HTML mail.
>
> On Thu, Sep 1, 2011 at 8:39 AM, Jim Gettys <[email protected]> wrote:
>> On 08/31/2011 04:50 PM, Luis R. Rodriguez wrote:
>>
>> On Wed, Aug 31, 2011 at 6:28 AM, Jim Gettys <[email protected]> wrote:
>>
>> On 08/30/2011 05:47 PM, Andrew McGregor wrote:
>>
>> On 31/08/2011, at 1:58 AM, Jim Gettys wrote:
>>
>> On 08/29/2011 11:42 PM, Adrian Chadd wrote:
>>
>> On 30 August 2011 11:34, Tom Herbert <[email protected]> wrote:
>>
>> C(P) is going to be quite variable - a full frame retransmit of a 4ms
>> long aggregate frame is SUM(exponential backoff, grab the air,
>> preamble, header, 4ms, etc. for each pass.)
>>
>> It's not clear to me that doing heroic measures to compute the cost is
>> going to be worthwhile due to the rate at which the costs can change on
>> wireless; just getting into the rough ballpark may be enough. But
>> buffering algorithms and AQM algorithms are going to need an estimate of
>> the *time* it will take to transmit data, more than # of bytes or packets.
>>
>> That's not heroic measures; mac80211 needs all the code to calculate these
>> times anyway, it's just a matter of collecting together some things we
>> already know and calling the right function.
>>
>>
>> Fine; if it's easy, accurate is better (presuming the costs get
>> recalculated when circumstances change). We also will need the amount of
>> data being transmitted; it is the rate of transmission (the rate at
>> which the buffers are draining) that we'll likely need.
>>
>> Here's what I've gleaned from reading "RED in a different light", Van
>> Jacobson's Mitre talk and several conversations with Kathleen Nichols
>> and Van: AQM algorithms that can handle variable bandwidth environments
>> will need to be able to know the rate at which buffers empty. It's the
>> direction they are going with their experiments on a "RED light" algorithm.
>>
>> The fundamental insight as to why classic RED can't work in the wireless
>> environment is that the instantaneous queue length has little actual
>> information in it.
>
>> Classic RED is tuned using the queue length as its
>> basic parameter. Their belief as to algorithms that will work is that
>> the need to keep track of the running queue length *minimum over time*;
>> you want to keep the buffers small on a longer term basis (so they both
>> can absorb transients, which is their reason for existence, while
>> keeping the latency typically low).
>
> More details explained by Jim:
>
>> TCP will fill any buffer just before the bottleneck in a path. And, of
>> course, other traffic transients can help fill the buffers too. This is a
>> natural consequence of it trying to figure out how fast it can run.
>>
>> Remember, TCP's job is to figure out just how fast it can go, so in its
>> congestion avoidance phase, it increases the rate at which it transmits
>> slowly (typical congestion avoidance algorithms will add one packet/ack to
>> the bottleneck buffer). This is why in my traces you can find
>> at:
>> http://gettys.wordpress.com/2010/12/06/whose-house-is-of-glasse-must-not-throw-stones-at-another/
>> you see something that sort of looks like a TCP sawtooth, but not: with
>> periodicity of order 10 seconds (given the amount of buffering in that
>> modem).
>>
>> What's happening is that as the buffer fills, the acks come back slower and
>> slower
>
> Wait, why is that though?
Because the data packets are more and more delayed through the link as
the buffer fills, delaying their arrival and therefore the acks.
Saturating one direction kills performance in the other direction; in
that case, your poor acks are stuck behind data packets. Some cable
modems do prioritise acks above bulk data, but that's just about all the
classification they currently do. Most broadband has a single queue.
So it takes time for TCP to fill the buffers; not so long in slow start,
but in the congestion avoidance phase, it becomes seconds to refill the
buffer over even short paths like the 10ms path I was testing on.
>
>> (so the buffer fills somewhat more slowly) with time. But as they
>> fill, not only do you get additional latency, but the buffers become less
>> and less effective at doing what they are intended for: smoothing transients
>> of higher rate incoming packets.
>
> Is this bufferbloat or TCP not taking into consideration possible
> variations in latency?
>
>> So I ended up with 1.2 seconds of latency on my cable modem.
>
> Jeesh.
Heh. Look at the netalyzr scatter plots in my blog posting. Lots of
the broadband edge is at multiple seconds. 1.2 seconds isn't really bad
as it goes (even though it's already so painful as to make web surfing
downright slow). It's really, really scarily bad.
Dave Clark at MIT first realised some hardware was overbuffered about 7
years ago when he found 6 second latencies on the DSLAM he owns.
He used this as a DOS attack against his son's excessive World of
Warcraft playing :-). But until last summer, we didn't have the
netalyzer or mlabs data showing just how horribly common the bufferbloat
problem is in broadband. The FCC data that just came out shows the
problem too...
On cable, the last I heard no one had managed to find a cable modem made
in the last 8 or more years that was "correctly buffered" according to
the "canonical" BDP rule of thumb (which isn't really actually very
useful, as my blog rants point out).
>
>> Depending on
>> how much buffering is in the device, you get different amounts of latency
>> under load. So at the provisioned bandwidth of those tests, my cable modem
>> has at least 10x the "classic" BDP amount of buffering.
>
> And for readers BDP is:
>
> http://en.wikipedia.org/wiki/TCP_tuning#Bandwidth-delay_product_.28BDP.29
>
> The max in-flight possible unacknowledged data in transit.
>
> Just curious -- why would they have 10x the amount of BDP??
Heh. The possible reasons are large: Because the memory was there?
Becasue the lessons of the 1990's congestion had been forgotten or never
learned? Because everyone thinks they can only focus on their little
engineering problem, without having to think about what happens as part
of the entire end-to end system? Because almost no engineer tests
latency under load routinely (except crazies like me who are supposed to
be working on teleconferencing and know slightly too much)? Because
Speedtest.net and pingtest.net don't run the tests at the same time?
Because a short simple test doesn't fill the buffer? Because most
people have been using Windows XP for their testing, and without TCP
window scaling, they don't have more than 64K bytes in-flight so one TCP
won't fill the buffers and they don't see the extreme behaviour that can
hurt them (except when their kids leave bittorrent running)? Because if
a manufacturer has a firmware bug, and they can paper over it by using
more buffering, and the buffer costs them zero and they want to ship
their broken firmware with the bug, what do you think companies do?
If you'd like some more why's, I can probably generate them; I spent
months last fall figuring out how we got in this mess.
Seriously, memory has gotten cheap. So sometimes the memory is
literally just along for the ride, and the engineers don't even think
about it.
You can't get small enough chips any more on the low end cheap box. You
deliberately have to *not* use memory you have.
And each generation of hardware has a higher bandwidth (to use the very
good ethernet history, we went from 3, to 10 to 100, to 1gig and now 10
gigE).
The amount of buffering you actually need to keep TCP running flat out
(for a single flow) is in fact one BDP (presuming multiple flows do not
synchronise); it's a worst case. Recent research has pretty clearly
shown that in high speed core routers, radically less buffering is
actually needed. But home routers do see the *worst case*, like my
simple file copy test that core routers never see.
So we've all be tuning for the highest bandwidth the hardware can ever
operate at, ignoring the fact that it is usually operating at a much
lower bandwidth.
In the DOCSIS example, the amount of buffering I have in my modem may in
fact be only 2-3 times too much if it is operating at 100MBPS.
But then people plug in that modem and are provisioned to run at
10-20Mbps for most tiers of service.
So with a fixed single buffer size, it's N times too big. It can easily
be a factor of 10 too big, even if the modem has been "correctly" sized
at its highest possible operating bandwidth (presuming CONUS delay of
around 100ms).
The DOCSIS change is the sane mitigation until we have a working AQM
(Powerboost on cable means it too is variable bandwidth, on an instant
by instant basis): that is a change to the management MIB for cable
modems so the ISP can set the buffer size the modem is allowed to use;
so at least they'll be able to tell the modem to use buffers that are
sized proportional to the maximum rate the data can flow through the
cable modem given what the customer pays for. This means that my ISP
could tell my cable modem at 10Mbps service to use 1/10 the buffering it
might at 100Mbps.
We should do a similar simple mitigation on ethernet at 100 and 10 Mbps
and cut down the rings and transmit queue proportionately as a first
step on ethernet. Wireless is harder, though some mitigation can be
done when we know what we're doing (and don't just blindly have our
knobs set as though this were ethernet).
But real solutions will require AQM....
>
>> Eventually, the bloated buffer does fill, (which might be a megabyte in
>> size, btw, on some cable modems), and you get packet loss.
>
> Why is that though?
>
The packets have no place to go: you are putting them in faster than
they can go out, so once the buffer is completely fill, you've got a
full bag of potatoes. Depending on the code there, it may tail drop, or
sometimes you might even see some hardware that is completely braindead
and drop a pile at once.
>> But well before this should have occurred, TCP Cubic has decided the path
>> may have changed, and goes into a more aggressive search for a new operating
>> point (so you see something that starts like a TCP sawtooth, but then
>> switches back to something that looks more like slow start). This causes a
>> long period bursty oscillation; the bursts are when multiple packet drops
>> are occurring (because the TCP is hammering at the cable modem at way too
>> high a speed for a short period).
>
> Got it.
>
>> Buffer bloat has therefore managed to destroy congestion avoidance
>> entirely. Sigh... It also does a number on slow start, as the amount of
>> buffering is way more than it should be and it takes several sets of packet
>> loss before TCP even starts congestion avoidance.
>
> So an observable trait is we likely can see some packet loss during
> the shifts in throughput? Is that accurate?
The easiest observed trait (if Linux is on both end, in which case the
TCP timestamp option will be on) is to look at the timestamps in the
packets. Without knowing it, Van told me I had wonderful proof in my
packet captures right there.
Pragmatically, I did most of my work with ping, traceroute and
scp/netperf: (not even knowing of mtr or pingplotter on windows). If
you see a link in a path you've saturated (in my case, by scp or
netperf), and the latency is unexpectedly high and varying, you've
probably got bufferbloat.
But bufferbloat certainly causes some excess packet loss, and bursty
behaviour as poor TCP becomes terminally confused. It's causing DNS
lookup failures in my traces, when TCP really momentarily hammers the
cable modem. Really quite remarkable stuff.
I'm not a TCP expert (IANATE); I knew just enough to grab the packet
captures, and since I had to stare at tcptrace results working on HTTP
in the 1990's, what I saw looked broken enough to me I called in the
really heavy guns go go over my traces (e.g. Dave Clark, Dave Reed, Greg
Chesson, Van Jacobson, and others). I'm a complex systems guy.
Once aware of the problem (and knowing 3G had the problem from Dave
Reed's report the year before), and seeing my home router doing strange
things, I kept digging, looking for buffers. Vint Cerf urged I just
start blogging about it to alert people late last year rather than doing
the formal publication thing. I just finally finished a first draft of a
formal paper for publication last month; it's probably another 1-2
months from completion.
The problem is, as soon as you get a bunch of TCP and other flows, it
gets really hard to figure out from simple symptoms or even from
tcptrace unless you're a pretty wizardly TCP guy. I stumbled across the
"textbook" case first, and then could recognise symptoms I saw elsewhere
and dig further. We need better tools badly.
>
> For 802.11 this could be a lot of things -- some new noise,
> interference, etc, but for Ethernet I'm curious what would cause this
> assuming we ware blaming the fact that a lot of buffers are queued, is
> the issue that the time it takes to flush out buffers surpasses some
> timer on the OS to actually transmit the frames?
Cause which?
If you put data into a finite size buffer faster than it can come out,
something has to give once the buffer fills... If you have infinite
memory, you can end up with infinite latency.
On wireless, you are more likely to get actual packet drop due to noise;
if you get 3 packet losses in a row, you force TCP to slow start and the
buffers may get a chance to drain a bit. So ironically, the cleaner the
network, the worse things may behave. My cable turns out to be textbook
clean, so I got a really clean trace that could be understood.
>
>> On some home routers, the buffering is even more extreme; so whenever your
>> bottleneck is in your home router (easy to do in my house as I have chimneys
>> and lots of walls) the buffers in the home router become the problem.
>>
>> So I caught my home router of that era (it was a DIR 825, IIRC) with 8
>> seconds of latency one evening.
I tried several other routers after that, with similar results.
>
> Shit.
>
I can make it much worse than 8 seconds; that's just what I documented
it doing one night in my house; you do the math:
We typically have about 1250 packets of buffering at the moment on most
Linux hardware: 1000 in the transmit queue, and 250 or so in the ring
buffers. I'll ignore the socket buffers in this discussion.
That's almost 2 megabytes.
So if you happen to have a very slow link (my chimney), or are working
on a busy conference network where you "fair share" of the bandwidth
might be 1Mbps even much less, or a noisy network where you are having
lots of trouble getting data through it will take a long time. At 1Mbps
you can expect that to take about 16 seconds to drain those buffers.
Of course, you've usually gone home in disgust long before, presuming
that this is just "802.11 brain damage". When in fact, it's often
bufferbloat, which has crept into nearly everything in the internet over
the last decade, including our operating systems and drivers. It may
actually be your laptop, or your broadband, or your isp...
And yes, there may also be 802.11 brain damage (e.g. how multicast
works), or driver bugs, or problems with aggregation, etc as well.
So far from every problem with horrid latency is bufferbloat...
3g has similar problems to 802.11, it turns out; but only in an
aggregate case due to details of how that works: buffers are being
filled by everyone's traffic. 30 seconds or more has been reported: the
most I've seen personally is 6 seconds.
The people time out before the packets get dropped; so packets get to
the destination, (and the network guys does not notice the packet drop
and aren't looking at the time the packets were in flight). So the user
times out, packets arrive to applications either thrown away or being
ignored the load drops, and you get horrific performance.
>> The additional major challenge we
>> face that core routers do not is the highly variable traffic of mixed
>> mice and elephants. What actually works only time will tell.
>
> Just FYI for 802.11 we have some effort to help us classify management
> frames further into different access categories, typically we just
> always use AC_VO to transmit management frames but the work undergoing
> in 802.11ae Prioritization of Management frames allows us to assign
> different ACs to different types of management frames, and the AP can
> actually define these rules. People interested in this problem should
> look at that and try out the defined default QoS management frame
> profile and see if that helps. That is -- something like this might be
> good for the bufferbloat experimental work -- but get it upstream,
> don't fork ;)
>
Yeah, and we should probably classify some other packets into other than
bulk data queues: e.g. dhcp, dns, nd, etc. Dave calls these "Ants";
little packets that scurry around and do necessary things in the network.
>> All this was well understood in the 1990's. Most of us *thought* the
>> problem was solved by AQM algorithms such as RED. I was noddingly aware of
>> AQM in the 1990's as I was working on HTTP in the IETF in that era, and AQM
>> and buffer management in core routers was a very hot topic then.
>>
>> But the buffer management problem wasn't actually solved, and that result
>> never got properly published, for the reasons I went into in:
>>
>> http://gettys.wordpress.com/2010/12/17/red-in-a-different-light/
>>
>> The field has languished since, and the problems that variable bandwidth
>> present went unrecognised.
>
> Then I stated:
>
>> Thank you for elaborating on this, it helps to further dissect the
>> issues. When mixing them together it makes the task harder to address.
>> As for 802.11 we'd just need to worry about management frames. Dave
>> and Andrew had also pointed out issues with multicast and the low
>> rates used for them in comparison to unicast transmissions.
>
> And you said:
>
>> What matters for bufferbloat is how much payload gets moved per unit time:
>> that is what is accumulating in the OS packet buffers.
>>
>>
>> So in an environment in which the rate of transmission is highly
>> variable,
>
> I had interjected with:
>
>> Not only that, remember we have different possible topologies
>> depending on the type of 802.11 service used, IBSS, the typical AP-STA
>> environment, Mesh, P2P (AP-STA really), and now with 802.11ad we get
>> PBSS. What counts here is we may have variable rate of transmission to
>> different possible peers, as such the queue length will depend on all
>> of the expected transmissions.
Yes, and that is what AQM algorithms handle and why they are essential;
if too much starts to be queued for a given path, it's probability of
getting marked goes up, so the endpoints slow down from the marking
(packet drop or ECN, if it were deployed) signalling TCP to go more
slowly, despite the fact that it's sharing the queue with traffic that
is able to move many times faster to a different path. So packets never
accumulate on one path to block other paths that happen to cross at a
particular buffer.
Classic RED works this way; it's just that it doesn't adjust its
behaviour according to the rate at which the buffer can empty; it's
tuned by the queue length, which isn't something that is really useful
when our queue length necessarily has to vary according to the
instantaneous bandwidth that is available.
I was startled when I realised we should even be running AQM in our
hosts; if you have multiple paths, you really need AQM algorithms to
keep the traffic from one path from doing bad things to other paths.
(think about sharing a queue between a local file server, and a high
bandwidth path with intercontinental delays...)
I'd always thought of AQM as something "internet routers have to do".
The first time I really saw a large 802.11 network working well under
load was at NANOG in June (500 or so attendees) and the IETF last
summer: it's using enterprise access points with minimal buffering,
hooked up to cisco routers that are running with RED enabled, and doing
diffserv labelling to get the right packets priority. When you have
enough aggregated traffic (and ensure that everyone has really good
signal strength), you can manage to sort of make RED work well enough.
But homes are much harder: no aggregation, highly variable application
load, and chimneys and multiple walls. (Hotels are somewhat similar).
There's no reason why we can't get Linux do as well as what I've seen
now at NANOG and the IETF (if we can find an AQM algorithm that works
given the different load and traffic mix). We just don't have the
amount of traffic flow that a big conference does that *sort of* allows
RED to work in that environment in a small (e.g. home) environment.
Note that the June NANOG was the first time I had a laptop running eBDP,
which was keeping my laptop queues from killing me under load. In the
upstream direction, you're *always* at a bottleneck link these days when
running on wireless. I could have done VOIP over that loaded conference
network, and had latencies in the 50ms range....
>
> and you had continued with:
>
>> such as wireless, or even possibly modern broadband with
>> PowerBoost, classic RED or similar algorithms that do not take the
>> buffer drain rate cannot possibly hack it properly.
>
> I had said:
>
>> Understood, just curious if anyone has tried a Minstrel approach.
>>
>> Lets try to document all of this here:
>>
>> http://wireless.kernel.org/en/developers/bufferbloat
>
> Then you ended with:
>
>> Please do. Or work on the material in the bufferbloat.net wiki. I'm now
>> too close to the problem to be good at explaining it any better than I try
>> to here and in my other writings. Thomas Jefferson writing is *hard*.
>>
>> Probably best is that generic descriptions go into the bufferbloat.net wiki,
>> and linux specific stuff on kernel.org; but I don't honestly care all that
>> much where it is so long as it does get written down, along with the
>> importance of Minstrel like algorithms in dealing with the variable
>> bandwidth problem (which isn't in any of the material in my blog or
>> bufferbloat.net, having not understood the problem there myself at all
>> before a month or two ago when Andrew McGregor explained to me in detail
>> what Minstrel is and does and why it needs to exist).
>
> Sure, will work on this.
>
Greatly appreciated. I think pretty few people understand anywhere just
how clever Minstrel is at figuring out how fast to try to transmit.
Now, if we can connect it to a AQM algorithm that works.... Kathleen
Nichols was much happier when I saw her at the last IETF than in June.
There is hope to figure out an AQM algorithm that "just works". But
until we've actually tested algorithms in real situations that actually
work.... And while a single algorithm to rule them all would be nice,
we just have to find one that works for the environments we face.
- Jim
On Wed, Aug 31, 2011 at 01:50:48PM -0700, Luis R. Rodriguez wrote:
> On Wed, Aug 31, 2011 at 6:28 AM, Jim Gettys <[email protected]> wrote:
> > such as wireless, or even possibly modern broadband with
> > PowerBoost, classic RED or similar algorithms that do not take the
> > buffer drain rate cannot possibly hack it properly.
>
> Understood, just curious if anyone has tried a Minstrel approach.
FWIW, eBDP and the related algorithms from Tianji Li's paper are
philosophically similar to minstrel. They depend on measuring recent
conditions and modifying the current queue length accordingly.
http://www.hamilton.ie/tianji_li/buffersizing.pdf
The hack I added in debloat-testing is based on my understanding
of eBDP. It timestamps the SKBs when they are handed to the driver
for Tx and then checks the timestamp when the SKB is orphaned. It is
a bit crude and is an abuse of the skb_orphan API. Also while it
accounts for the 802.11e queues separately, it doesn't account for
802.11n aggregation. Still, it seems to improve latency w/o hugely
impacting throughput in at least some environments -- YMMV!
John
--
John W. Linville Someday the world will need a hero, and you
[email protected] might be all we have. Be ready.
BTW Jim, as you may have found out after sending this reply, vger
hates HTML e-mails so your e-mail only got to those on the CC list but
not to the list. I'm replying below, but since I'm now using ASCII
text the look of the e-mail will look like poo, I'll try to trim it
somehow as I did find valuable information you provided there. I also
make some suggestions below.
On Thu, Sep 1, 2011 at 8:39 AM, Jim Gettys <[email protected]> wrote:
> On 08/31/2011 04:50 PM, Luis R. Rodriguez wrote:
>
> On Wed, Aug 31, 2011 at 6:28 AM, Jim Gettys <[email protected]> wrote:
>
> On 08/30/2011 05:47 PM, Andrew McGregor wrote:
>
> On 31/08/2011, at 1:58 AM, Jim Gettys wrote:
>
> On 08/29/2011 11:42 PM, Adrian Chadd wrote:
>
> On 30 August 2011 11:34, Tom Herbert <[email protected]> wrote:
>
> C(P) is going to be quite variable - a full frame retransmit of a 4ms
> long aggregate frame is SUM(exponential backoff, grab the air,
> preamble, header, 4ms, etc. for each pass.)
>
> It's not clear to me that doing heroic measures to compute the cost is
> going to be worthwhile due to the rate at which the costs can change on
> wireless; just getting into the rough ballpark may be enough. But
> buffering algorithms and AQM algorithms are going to need an estimate of
> the *time* it will take to transmit data, more than # of bytes or packets.
>
> That's not heroic measures; mac80211 needs all the code to calculate these
> times anyway, it's just a matter of collecting together some things we
> already know and calling the right function.
>
>
> Fine; if it's easy, accurate is better (presuming the costs get
> recalculated when circumstances change). We also will need the amount of
> data being transmitted; it is the rate of transmission (the rate at
> which the buffers are draining) that we'll likely need.
>
> Here's what I've gleaned from reading "RED in a different light", Van
> Jacobson's Mitre talk and several conversations with Kathleen Nichols
> and Van: AQM algorithms that can handle variable bandwidth environments
> will need to be able to know the rate at which buffers empty. It's the
> direction they are going with their experiments on a "RED light" algorithm.
>
> The fundamental insight as to why classic RED can't work in the wireless
> environment is that the instantaneous queue length has little actual
> information in it.
> Classic RED is tuned using the queue length as its
> basic parameter. Their belief as to algorithms that will work is that
> the need to keep track of the running queue length *minimum over time*;
> you want to keep the buffers small on a longer term basis (so they both
> can absorb transients, which is their reason for existence, while
> keeping the latency typically low).
More details explained by Jim:
> TCP will fill any buffer just before the bottleneck in a path. And, of
> course, other traffic transients can help fill the buffers too. This is a
> natural consequence of it trying to figure out how fast it can run.
>
> Remember, TCP's job is to figure out just how fast it can go, so in its
> congestion avoidance phase, it increases the rate at which it transmits
> slowly (typical congestion avoidance algorithms will add one packet/ack to
> the bottleneck buffer). This is why in my traces you can find
> at:
> http://gettys.wordpress.com/2010/12/06/whose-house-is-of-glasse-must-not-throw-stones-at-another/
> you see something that sort of looks like a TCP sawtooth, but not: with
> periodicity of order 10 seconds (given the amount of buffering in that
> modem).
>
> What's happening is that as the buffer fills, the acks come back slower and
> slower
Wait, why is that though?
> (so the buffer fills somewhat more slowly) with time. But as they
> fill, not only do you get additional latency, but the buffers become less
> and less effective at doing what they are intended for: smoothing transients
> of higher rate incoming packets.
Is this bufferbloat or TCP not taking into consideration possible
variations in latency?
> So I ended up with 1.2 seconds of latency on my cable modem.
Jeesh.
> Depending on
> how much buffering is in the device, you get different amounts of latency
> under load. So at the provisioned bandwidth of those tests, my cable modem
> has at least 10x the "classic" BDP amount of buffering.
And for readers BDP is:
http://en.wikipedia.org/wiki/TCP_tuning#Bandwidth-delay_product_.28BDP.29
The max in-flight possible unacknowledged data in transit.
Just curious -- why would they have 10x the amount of BDP??
> Eventually, the bloated buffer does fill, (which might be a megabyte in
> size, btw, on some cable modems), and you get packet loss.
Why is that though?
> But well before this should have occurred, TCP Cubic has decided the path
> may have changed, and goes into a more aggressive search for a new operating
> point (so you see something that starts like a TCP sawtooth, but then
> switches back to something that looks more like slow start). This causes a
> long period bursty oscillation; the bursts are when multiple packet drops
> are occurring (because the TCP is hammering at the cable modem at way too
> high a speed for a short period).
Got it.
> Buffer bloat has therefore managed to destroy congestion avoidance
> entirely. Sigh... It also does a number on slow start, as the amount of
> buffering is way more than it should be and it takes several sets of packet
> loss before TCP even starts congestion avoidance.
So an observable trait is we likely can see some packet loss during
the shifts in throughput? Is that accurate?
For 802.11 this could be a lot of things -- some new noise,
interference, etc, but for Ethernet I'm curious what would cause this
assuming we ware blaming the fact that a lot of buffers are queued, is
the issue that the time it takes to flush out buffers surpasses some
timer on the OS to actually transmit the frames?
> On some home routers, the buffering is even more extreme; so whenever your
> bottleneck is in your home router (easy to do in my house as I have chimneys
> and lots of walls) the buffers in the home router become the problem.
>
> So I caught my home router of that era (it was a DIR 825, IIRC) with 8
> seconds of latency one evening.
Shit.
> The additional major challenge we
> face that core routers do not is the highly variable traffic of mixed
> mice and elephants. What actually works only time will tell.
Just FYI for 802.11 we have some effort to help us classify management
frames further into different access categories, typically we just
always use AC_VO to transmit management frames but the work undergoing
in 802.11ae Prioritization of Management frames allows us to assign
different ACs to different types of management frames, and the AP can
actually define these rules. People interested in this problem should
look at that and try out the defined default QoS management frame
profile and see if that helps. That is -- something like this might be
good for the bufferbloat experimental work -- but get it upstream,
don't fork ;)
> All this was well understood in the 1990's. Most of us *thought* the
> problem was solved by AQM algorithms such as RED. I was noddingly aware of
> AQM in the 1990's as I was working on HTTP in the IETF in that era, and AQM
> and buffer management in core routers was a very hot topic then.
>
> But the buffer management problem wasn't actually solved, and that result
> never got properly published, for the reasons I went into in:
>
> http://gettys.wordpress.com/2010/12/17/red-in-a-different-light/
>
> The field has languished since, and the problems that variable bandwidth
> present went unrecognised.
Then I stated:
> Thank you for elaborating on this, it helps to further dissect the
> issues. When mixing them together it makes the task harder to address.
> As for 802.11 we'd just need to worry about management frames. Dave
> and Andrew had also pointed out issues with multicast and the low
> rates used for them in comparison to unicast transmissions.
And you said:
> What matters for bufferbloat is how much payload gets moved per unit time:
> that is what is accumulating in the OS packet buffers.
>
>
> So in an environment in which the rate of transmission is highly
> variable,
I had interjected with:
> Not only that, remember we have different possible topologies
> depending on the type of 802.11 service used, IBSS, the typical AP-STA
> environment, Mesh, P2P (AP-STA really), and now with 802.11ad we get
> PBSS. What counts here is we may have variable rate of transmission to
> different possible peers, as such the queue length will depend on all
> of the expected transmissions.
and you had continued with:
> such as wireless, or even possibly modern broadband with
> PowerBoost, classic RED or similar algorithms that do not take the
> buffer drain rate cannot possibly hack it properly.
I had said:
> Understood, just curious if anyone has tried a Minstrel approach.
>
> Lets try to document all of this here:
>
> http://wireless.kernel.org/en/developers/bufferbloat
Then you ended with:
> Please do. Or work on the material in the bufferbloat.net wiki. I'm now
> too close to the problem to be good at explaining it any better than I try
> to here and in my other writings. Thomas Jefferson writing is *hard*.
>
> Probably best is that generic descriptions go into the bufferbloat.net wiki,
> and linux specific stuff on kernel.org; but I don't honestly care all that
> much where it is so long as it does get written down, along with the
> importance of Minstrel like algorithms in dealing with the variable
> bandwidth problem (which isn't in any of the material in my blog or
> bufferbloat.net, having not understood the problem there myself at all
> before a month or two ago when Andrew McGregor explained to me in detail
> what Minstrel is and does and why it needs to exist).
Sure, will work on this.
Luis
I think there's enough interesting ideas here to keep people busy
experimenting for a while.
I'm going to refrain from chiming in further until I've got the
FreeBSD 11n TX code at the point where I can begin tinkering with
adaptive queue management in the driver and rate control layer.
Thanks for the pointers and interesting discussion,
Adrian
On 09/01/2011 10:13 AM, John W. Linville wrote:
> On Wed, Aug 31, 2011 at 01:50:48PM -0700, Luis R. Rodriguez wrote:
>> On Wed, Aug 31, 2011 at 6:28 AM, Jim Gettys <[email protected]> wrote:
>> such as wireless, or even possibly modern broadband with
>> PowerBoost, classic RED or similar algorithms that do not take the
>> buffer drain rate cannot possibly hack it properly.
>> Understood, just curious if anyone has tried a Minstrel approach.
> FWIW, eBDP and the related algorithms from Tianji Li's paper are
> philosophically similar to minstrel. They depend on measuring recent
> conditions and modifying the current queue length accordingly.
>
> http://www.hamilton.ie/tianji_li/buffersizing.pdf
>
> The hack I added in debloat-testing is based on my understanding
> of eBDP. It timestamps the SKBs when they are handed to the driver
> for Tx and then checks the timestamp when the SKB is orphaned. It is
> a bit crude and is an abuse of the skb_orphan API. Also while it
> accounts for the 802.11e queues separately, it doesn't account for
> 802.11n aggregation. Still, it seems to improve latency w/o hugely
> impacting throughput in at least some environments -- YMMV!
>
Certainly eBDP *sort of* works for me; I see somewhat more variation
(jitter) than I'd like, but at least latencies are at least somewhat
controlled and I've not seen performance (throughput) problems. This is
on an Intel chipset.
Tanji Li's work and eBDP came to our attention after a conversation I
had with Van at the beginning of this year; Van send me a pointer to it
with the comments:
"As I said on the phone I think there's an easier, more robust way to accomplish the same
thing but they have running code and I don't."
But we've not had time to do any systematic testing on eBDP yet and the
implementation needs more work before it should go upstream.
About the time we were thinking of starting that testing, it became
clear that getting aggregation working better was pretty essential to
validating any results. I'm looking forward to trying eBDP with the
revised Ath9k driver as soon as they get put into a single pool. And if
Kathleen and Van can get a "RED Light" they are happy with, that will be
fun to try as well.
- Jim
On Thu, Sep 1, 2011 at 7:13 AM, John W. Linville <[email protected]> wrote:
> On Wed, Aug 31, 2011 at 01:50:48PM -0700, Luis R. Rodriguez wrote:
>> On Wed, Aug 31, 2011 at 6:28 AM, Jim Gettys <[email protected]> wrote:
>
>> > such as wireless, or even possibly modern broadband with
>> > PowerBoost, classic RED or similar algorithms that do not take the
>> > buffer drain rate cannot possibly hack it properly.
>>
>> Understood, just curious if anyone has tried a Minstrel approach.
>
> FWIW, eBDP and the related algorithms from Tianji Li's paper are
> philosophically similar to minstrel.
Oh look at that, awesome!!!
> They depend on measuring recent
> conditions and modifying the current queue length accordingly.
>
> http://www.hamilton.ie/tianji_li/buffersizing.pdf
>
> The hack I added in debloat-testing is based on my understanding
> of eBDP. It timestamps the SKBs when they are handed to the driver
> for Tx and then checks the timestamp when the SKB is orphaned. It is
> a bit crude and is an abuse of the skb_orphan API.
Neat!
> Also while it
> accounts for the 802.11e queues separately, it doesn't account for
> 802.11n aggregation.
I see..
> Still, it seems to improve latency w/o hugely
> impacting throughput in at least some environments -- YMMV!
Sweet dude. For aggregation it seems the way to go is to get some
helpers as Andrew has suggested. Andrew, can you elaborate a little on
that? If feasible, then maybe then we can add it to the TODO list
page:
http://wireless.kernel.org/en/developers/todo-list
and when one of us gets to it, we get cranking on it.
Luis