2016-02-04 20:13:32

by Grumbach, Emmanuel

[permalink] [raw]
Subject: [RFC] iwlwifi: pcie: transmit queue auto-sizing

As many (all?) WiFi devices, Intel WiFi devices have
transmit queues which have 256 transmit descriptors
each and each descriptor corresponds to an MPDU.
This means that when it is full, the queue contains
256 * ~1500 bytes to be transmitted (if we don't have
A-MSDUs). The purpose of those queues is to have enough
packets to be ready for transmission so that when the device
gets an opportunity to transmit (TxOP), it can take as many
packets as the spec allows and aggregate them into one
A-MPDU or even several A-MPDUs if we are using bursts.

The problem is that the packets that are in these queues are
already out of control of the Qdisc and can stay in those
queues for a fairly long time when the link condition is
not good. This leads to the well known bufferbloat problem.

This patch adds a way to tune the size of the transmit queue
so that it won't cause excessive latency. When the link
condition is good, the packets will flow smoothly and the
transmit queue will grow quickly allowing A-MPDUs and
maximal throughput. When the link is not optimal, we will
have retransmissions, lower transmit rates or signal
detection (CCA) which will cause a delay in the packet
transmission. The driver will sense this higher latency
and will reduce the size of the transmit queue.
This means that the packets that continue to arrive
will pile up in the Qdisc rather than in the device
queues. The major advantage of this approach is that
codel can now do its work.

The algorithm is really (too?) simple:
every 5 seconds, starts from a short queue again.
If a packet has been in the queue for less than 10ms,
allow 10 more MPDUs in.
If a packet has been in the queue for more than 20ms,
reduce by 10 the size of the transmit queue.

The implementation is really naive and way too simple:
* reading jiffies for every Tx / Tx status is not a
good idead.
* jiffies are not fine-grained enough on all platforms
* the constants chosen are really arbitrary and can't be
tuned.
* This may be implemented in mac80211 probably and help
other drivers.
* etc...

But already this gives nice results. I ran a very simple
experiment: I put the device in a controlled environment
and ran traffic while running default sized ping in the
background. In this configuration, our device quickly
raises its transmission rate to the best rate.
Then, I force the device to use the lowest rate (6Mbps).
Of course, the throughput collapses, but the ping RTT
shoots up.
Using codel helps, but the latency is still high. Codel
with this patch gives much better results:

pfifo_fast:
rtt min/avg/max/mdev = 1932.616/2393.284/2833.407/315.941 ms, pipe 3, ipg/ewma 2215.707/2446.884 ms

fq_codel + Tx queue auto-sizing:
rtt min/avg/max/mdev = 13.541/32.396/54.791/9.610 ms, ipg/ewma 200.685/32.202 ms

fq_codel without Tx queue auto-sizing:
rtt min/avg/max/mdev = 140.821/257.303/331.889/31.074 ms, pipe 2, ipg/ewma 258.147/252.847 ms

Clearly, there is more work to do to be able to merge this,
but it seems that the wireless problems mentioned in
https://lwn.net/Articles/616241/ may have a solution.

Cc: Stephen Hemminger <[email protected]>
Cc: Dave Taht <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Emmanuel Grumbach <[email protected]>
---
drivers/net/wireless/intel/iwlwifi/pcie/internal.h | 6 ++++
drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 32 ++++++++++++++++++++--
2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
index 2f95916..d83eb56 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
@@ -192,6 +192,11 @@ struct iwl_cmd_meta {
u32 flags;
};

+struct iwl_txq_auto_size {
+ int min_space;
+ unsigned long reset_ts;
+};
+
/*
* Generic queue structure
*
@@ -293,6 +298,7 @@ struct iwl_txq {
bool block;
unsigned long wd_timeout;
struct sk_buff_head overflow_q;
+ struct iwl_txq_auto_size auto_sz;
};

static inline dma_addr_t
diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
index 837a7d5..4d1dee6 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
@@ -572,6 +572,8 @@ static int iwl_pcie_txq_init(struct iwl_trans *trans, struct iwl_txq *txq,

spin_lock_init(&txq->lock);
__skb_queue_head_init(&txq->overflow_q);
+ txq->auto_sz.min_space = 240;
+ txq->auto_sz.reset_ts = jiffies;

/*
* Tell nic where to find circular buffer of Tx Frame Descriptors for
@@ -1043,10 +1045,14 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn,
q->read_ptr != tfd_num;
q->read_ptr = iwl_queue_inc_wrap(q->read_ptr)) {
struct sk_buff *skb = txq->entries[txq->q.read_ptr].skb;
+ struct ieee80211_tx_info *info;
+ unsigned long tx_time;

if (WARN_ON_ONCE(!skb))
continue;

+ info = IEEE80211_SKB_CB(skb);
+
iwl_pcie_free_tso_page(skb);

__skb_queue_tail(skbs, skb);
@@ -1056,6 +1062,18 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn,
iwl_pcie_txq_inval_byte_cnt_tbl(trans, txq);

iwl_pcie_txq_free_tfd(trans, txq);
+
+ tx_time = (uintptr_t)info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2];
+ if (time_before(jiffies, tx_time + msecs_to_jiffies(10))) {
+ txq->auto_sz.min_space -= 10;
+ txq->auto_sz.min_space =
+ max(txq->auto_sz.min_space, txq->q.high_mark);
+ } else if (time_after(jiffies,
+ tx_time + msecs_to_jiffies(20))) {
+ txq->auto_sz.min_space += 10;
+ txq->auto_sz.min_space =
+ min(txq->auto_sz.min_space, 252);
+ }
}

iwl_pcie_txq_progress(txq);
@@ -2185,6 +2203,7 @@ int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb,
struct iwl_device_cmd *dev_cmd, int txq_id)
{
struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
+ struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
struct ieee80211_hdr *hdr;
struct iwl_tx_cmd *tx_cmd = (struct iwl_tx_cmd *)dev_cmd->payload;
struct iwl_cmd_meta *out_meta;
@@ -2234,13 +2253,20 @@ int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb,

spin_lock(&txq->lock);

- if (iwl_queue_space(q) < q->high_mark) {
+ info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2] =
+ (void *)(uintptr_t)jiffies;
+
+ if (time_after(jiffies,
+ txq->auto_sz.reset_ts + msecs_to_jiffies(5000))) {
+ txq->auto_sz.min_space = 240;
+ txq->auto_sz.reset_ts = jiffies;
+ }
+
+ if (iwl_queue_space(q) < txq->auto_sz.min_space) {
iwl_stop_queue(trans, txq);

/* don't put the packet on the ring, if there is no room */
if (unlikely(iwl_queue_space(q) < 3)) {
- struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
-
info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 1] =
dev_cmd;
__skb_queue_tail(&txq->overflow_q, skb);
--
2.5.0



2016-02-04 20:45:58

by Ben Greear

[permalink] [raw]
Subject: Re: [RFC v2] iwlwifi: pcie: transmit queue auto-sizing

On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote:
> As many (all?) WiFi devices, Intel WiFi devices have
> transmit queues which have 256 transmit descriptors
> each and each descriptor corresponds to an MPDU.
> This means that when it is full, the queue contains
> 256 * ~1500 bytes to be transmitted (if we don't have
> A-MSDUs). The purpose of those queues is to have enough
> packets to be ready for transmission so that when the device
> gets an opportunity to transmit (TxOP), it can take as many
> packets as the spec allows and aggregate them into one
> A-MPDU or even several A-MPDUs if we are using bursts.

I guess this is only really usable if you have exactly one
peer connected (ie, in station mode)?

Otherwise, you could have one slow peer and one fast one,
and then I suspect this would not work so well?

For reference, ath10k has around 1400 tx descriptors, though
in practice not all are usable, and in stock firmware, I'm guessing
the NIC will never be able to actually fill up it's tx descriptors
and stop traffic. Instead, it just allows the stack to try to
TX, then drops the frame...

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2016-02-04 20:16:17

by Grumbach, Emmanuel

[permalink] [raw]
Subject: [RFC RESEND] iwlwifi: pcie: transmit queue auto-sizing

As many (all?) WiFi devices, Intel WiFi devices have
transmit queues which have 256 transmit descriptors
each and each descriptor corresponds to an MPDU.
This means that when it is full, the queue contains
256 * ~1500 bytes to be transmitted (if we don't have
A-MSDUs). The purpose of those queues is to have enough
packets to be ready for transmission so that when the device
gets an opportunity to transmit (TxOP), it can take as many
packets as the spec allows and aggregate them into one
A-MPDU or even several A-MPDUs if we are using bursts.

The problem is that the packets that are in these queues are
already out of control of the Qdisc and can stay in those
queues for a fairly long time when the link condition is
not good. This leads to the well known bufferbloat problem.

This patch adds a way to tune the size of the transmit queue
so that it won't cause excessive latency. When the link
condition is good, the packets will flow smoothly and the
transmit queue will grow quickly allowing A-MPDUs and
maximal throughput. When the link is not optimal, we will
have retransmissions, lower transmit rates or signal
detection (CCA) which will cause a delay in the packet
transmission. The driver will sense this higher latency
and will reduce the size of the transmit queue.
This means that the packets that continue to arrive
will pile up in the Qdisc rather than in the device
queues. The major advantage of this approach is that
codel can now do its work.

The algorithm is really (too?) simple:
every 5 seconds, starts from a short queue again.
If a packet has been in the queue for less than 10ms,
allow 10 more MPDUs in.
If a packet has been in the queue for more than 20ms,
reduce by 10 the size of the transmit queue.

The implementation is really naive and way too simple:
* reading jiffies for every Tx / Tx status is not a
good idead.
* jiffies are not fine-grained enough on all platforms
* the constants chosen are really arbitrary and can't be
tuned.
* This may be implemented in mac80211 probably and help
other drivers.
* etc...

But already this gives nice results. I ran a very simple
experiment: I put the device in a controlled environment
and ran traffic while running default sized ping in the
background. In this configuration, our device quickly
raises its transmission rate to the best rate.
Then, I force the device to use the lowest rate (6Mbps).
Of course, the throughput collapses, but the ping RTT
shoots up.
Using codel helps, but the latency is still high. Codel
with this patch gives much better results:

pfifo_fast:
rtt min/avg/max/mdev = 1932.616/2393.284/2833.407/315.941 ms, pipe 3, ipg/ewma 2215.707/2446.884 ms

fq_codel + Tx queue auto-sizing:
rtt min/avg/max/mdev = 13.541/32.396/54.791/9.610 ms, ipg/ewma 200.685/32.202 ms

fq_codel without Tx queue auto-sizing:
rtt min/avg/max/mdev = 140.821/257.303/331.889/31.074 ms, pipe 2, ipg/ewma 258.147/252.847 ms

Clearly, there is more work to do to be able to merge this,
but it seems that the wireless problems mentioned in
https://lwn.net/Articles/616241/ may have a solution.

Cc: Stephen Hemminger <[email protected]>
Cc: Dave Taht <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Emmanuel Grumbach <[email protected]>
---
Fix Dave's email address
---
drivers/net/wireless/intel/iwlwifi/pcie/internal.h | 6 ++++
drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 32 ++++++++++++++++++++--
2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
index 2f95916..d83eb56 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
@@ -192,6 +192,11 @@ struct iwl_cmd_meta {
u32 flags;
};

+struct iwl_txq_auto_size {
+ int min_space;
+ unsigned long reset_ts;
+};
+
/*
* Generic queue structure
*
@@ -293,6 +298,7 @@ struct iwl_txq {
bool block;
unsigned long wd_timeout;
struct sk_buff_head overflow_q;
+ struct iwl_txq_auto_size auto_sz;
};

static inline dma_addr_t
diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
index 837a7d5..4d1dee6 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
@@ -572,6 +572,8 @@ static int iwl_pcie_txq_init(struct iwl_trans *trans, struct iwl_txq *txq,

spin_lock_init(&txq->lock);
__skb_queue_head_init(&txq->overflow_q);
+ txq->auto_sz.min_space = 240;
+ txq->auto_sz.reset_ts = jiffies;

/*
* Tell nic where to find circular buffer of Tx Frame Descriptors for
@@ -1043,10 +1045,14 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn,
q->read_ptr != tfd_num;
q->read_ptr = iwl_queue_inc_wrap(q->read_ptr)) {
struct sk_buff *skb = txq->entries[txq->q.read_ptr].skb;
+ struct ieee80211_tx_info *info;
+ unsigned long tx_time;

if (WARN_ON_ONCE(!skb))
continue;

+ info = IEEE80211_SKB_CB(skb);
+
iwl_pcie_free_tso_page(skb);

__skb_queue_tail(skbs, skb);
@@ -1056,6 +1062,18 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn,
iwl_pcie_txq_inval_byte_cnt_tbl(trans, txq);

iwl_pcie_txq_free_tfd(trans, txq);
+
+ tx_time = (uintptr_t)info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2];
+ if (time_before(jiffies, tx_time + msecs_to_jiffies(10))) {
+ txq->auto_sz.min_space -= 10;
+ txq->auto_sz.min_space =
+ max(txq->auto_sz.min_space, txq->q.high_mark);
+ } else if (time_after(jiffies,
+ tx_time + msecs_to_jiffies(20))) {
+ txq->auto_sz.min_space += 10;
+ txq->auto_sz.min_space =
+ min(txq->auto_sz.min_space, 252);
+ }
}

iwl_pcie_txq_progress(txq);
@@ -2185,6 +2203,7 @@ int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb,
struct iwl_device_cmd *dev_cmd, int txq_id)
{
struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
+ struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
struct ieee80211_hdr *hdr;
struct iwl_tx_cmd *tx_cmd = (struct iwl_tx_cmd *)dev_cmd->payload;
struct iwl_cmd_meta *out_meta;
@@ -2234,13 +2253,20 @@ int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb,

spin_lock(&txq->lock);

- if (iwl_queue_space(q) < q->high_mark) {
+ info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2] =
+ (void *)(uintptr_t)jiffies;
+
+ if (time_after(jiffies,
+ txq->auto_sz.reset_ts + msecs_to_jiffies(5000))) {
+ txq->auto_sz.min_space = 240;
+ txq->auto_sz.reset_ts = jiffies;
+ }
+
+ if (iwl_queue_space(q) < txq->auto_sz.min_space) {
iwl_stop_queue(trans, txq);

/* don't put the packet on the ring, if there is no room */
if (unlikely(iwl_queue_space(q) < 3)) {
- struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
-
info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 1] =
dev_cmd;
__skb_queue_tail(&txq->overflow_q, skb);
--
2.5.0


2016-02-08 10:17:08

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: [RFC v2] iwlwifi: pcie: transmit queue auto-sizing

On Mon, Feb 8, 2016 at 12:00 PM, Michal Kazior <[email protected]> wrote:
> On 5 February 2016 at 17:47, Dave Taht <[email protected]> wrote:
>>> A bursted txop can be as big as 5-10ms. If you consider you want to
>>> queue 5-10ms worth of data for *each* station at any given time you
>>> obviously introduce a lot of lag. If you have 10 stations you might
>>> end up with service period at 10*10ms = 100ms. This gets even worse if
>>> you consider MU-MIMO because you need to do an expensive sounding
>>> procedure before transmitting. So while SU aggregation can probably
>>> still work reasonably well with shorter bursts (1-2ms) MU needs at
>>> least 3ms to get *any* gain when compared to SU (which obviously means
>>> you want more to actually make MU pay off).
>>
>> I am not sure where you get these numbers. Got a spreadsheet?
>
> Here's a nice summary on some of it:
>
> http://chimera.labs.oreilly.com/books/1234000001739/ch03.html#figure-mac-ampdu
>
> Even if your single A-MPDU can be shorter than a txop you can burst a
> few of them if my understanding is correct.
>
> The overhead associated with MU sounding is something I've been told
> only. Apparently for MU to pay off you need fairly big bursts. This
> implies that the more stations you have to service the less it makes
> sense to attempt MU if you consider latency.
>
>
>> Gradually reducing the maximum sized txop as a function of the number
>> of stations makes sense. If you have 10 stations pending delivery and
>> reduced the max txop to 1ms, you hurt bandwidth at that instant, but
>> by offering more service to more stations, in less time, they will
>> converge on a reasonable share of the bandwidth for each, faster[1].
>> And I'm sure that the person videoconferencing on a link like that
>> would appreciate getting some service inside of a 10ms interval,
>> rather than a 100ms.
>>
>> yes, there's overhead, and that's not the right number, which would
>> vary as to g,n,ac and successors.
>>
>> You will also get more opportunities to use mu-mimo with shorter
>> bursts extant and more stations being regularly serviced.
>>
>> [1] https://www.youtube.com/watch?v=Rb-UnHDw02o at about 13:50
>
> This is my thinking as well, at least for most common use cases.
>
> If you try to optimize for throughput by introducing extra induced
> latency you might end up not being able to use aggregation in practice
> anyway because you won't be able to start up connections and ask for
> enough data - or at least that's what my intuition tells me.
>
> But, like I've mentioned, there's interest in making it possible to
> maximize for throughput (regardless of latency). This surely makes
> sense for synthetic UDP benchmarks. But does it make sense for any
> real-world application? No idea.
>
>
>>> The rule of thumb is the
>>> longer you wait the bigger capacity you can get.
>>
>> This is not strictly true as the "fountain" of packets is regulated by
>> acks on the other side of the link, and ramp up or down as a function
>> of service time and loss.
>
> Yes, if you consider real world cases, i.e. TCP, web traffic, etc.
> then you're correct.
>
>
>>> Apparently there's interest in maximizing throughput but it stands in
>>> direct opposition of keeping the latency down so I've been thinking
>>> how to satisfy both.
>>>
>>> The current approach ath10k is taking (patches in review [1][2]) is to
>>> use mac80211 software queues for per-station queuing, exposing queue
>>> state to firmware (it decides where frames should be dequeued from)
>>> and making it possible to stop/wake per-station tx subqueue with fake
>>> netdev queues. I'm starting to think this is not the right way though
>>> because it's inherently hard to control latency and there's a huge
>>> memory overhead associated with the fake netdev queues.
>>
>> What is this overhead?
>
> E.g. if you want to be able to maximize throughput for 50 MU clients
> you need to be able to queue, in theory, 50*200 (roughly) frames. This
> translates to both huge memory usage and latency *and* renders
> (fq_)codel qdisc rather.. moot.

Ok - now I understand. So yes the conclusion below (make fq_codel
station aware) makes a lot sense.

>
>
>> Applying things like codel tend to dramatically shorten the amount of
>> skbs extant...
>
>> modern 802.11ac capable hardware has tons more
>> memory...
>
> I don't think it does. QCA988x is able to handle "only" 1424 tx
> descriptors (IOW 1500-byte long MSDUs) in the driver-to-firmware tx
> queue (it's a flat queue). QCA99x0 is able to handle 2500 if asked
> politely.

As I said, our design is not flat which removes for the firmware to
classify the packets by station to be able to build aggregates, but
the downside is the number of clients you can service.

>
> This is still not enough to satisfy the insane "maximize the
> capacity/throughput" expectations though.
>
> You could actually argue it's too much from the bufferbloat problem
> point of view anyway and Emmanuel's patch proves it is beneficial to
> buffer less in driver depending on the sojourn packet time.
>

In real life scenario, yes. But the more I read your comments the more
I get to the conclusion we (Intel) simply haven't been able to reach
the throughput you have which *requires* more latency to reach :)
Note that you always talk about overall throughput (to all the
stations at the same time) which, again, makes a lot of sense when you
work primarily as an AP. We look at the world with different
viewpoint.

>
>>> Also fq_codel
>>> is a less effective with this kind of setup.
>>
>> fq_codel's principal problems with working with wifi are long and
>> documented in the talk above.
>>
>>> My current thinking is that the entire problem should be solved via
>>> (per-AC) qdiscs, e.g. fq_codel. I guess one could use
>>> limit/target/interval/quantum knobs to tune it for higher latency of
>>> aggregation-oriented Wi-Fi links where long service time (think
>>> 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi
>>> works in the first place, i.e. Wi-Fi gets better throughput if you
>>> deliver bursts of packets destined to the same station. Moreover this
>>> gets even more complicated with MU-MIMO where you may want to consider
>>> spatial location (which influences signal quality when grouped) of
>>> each station when you decide which set of stations you're going to
>>> aggregate to in parallel. Since drivers have a finite tx ring this it
>>> is important to deliver bursts that can actually be aggregated
>>> efficiently. This means driver would need to be able to tell qdisc
>>> about per-flow conditions to influence the RR scheme in some way
>>> (assuming a qdiscs even understands flows; do we need a unified way of
>>> talking about flows between qdiscs and drivers?).
>>
>> This is a very good summary of the problems in layering fq_codel as it
>> exists today on top of wifi as it exists today. :/ Our conclusion
>> several years ago was that as the information needed to do things more
>> right was in the mac80211 layer that we could not evolve the qdisc
>> layer to suit, and needed to move the core ideas into the mac80211
>> layer.
>>
>> Things have evolved since, but I still think we can't get enough info
>> up to the qdisc layer (locks and so on) to use it sanely.
>
> The current per-station queue implementation in mac80211 doesn't seem
> sufficient. Each of these queues use a simple flat fifo queue
> (sk_buff_head) limited by packet count only with somewhat broken
> behavior when packet limit is reached as you end up with unfairly
> populated queues a lot. They need to have codel applied on them
> somehow. You'll eventually end up re-inventing fq_codel in mac80211
> making the qdisc redundant.
>
> Moreover these queues aren't per-station only. They are
> per-station-per-tid giving 16 queues per station. This is important
> because you can't aggregate traffic going out on different tids.
>
> Even without explicit air conditions feedback from mac80211 to
> fq_codel I suspect it'd be still beneficial if fq_codel was able to
> group (sub-)flows into per-station-tid and burst them out them in
> subsequent dequeue() calls so the chance they get aggregated is
> higher.

That would allow you to avoid classifying (and buffering) in HW and
then reduce the latency. Agree.


>
>
> Michał
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-02-04 21:14:14

by Ben Greear

[permalink] [raw]
Subject: Re: [RFC v2] iwlwifi: pcie: transmit queue auto-sizing

On 02/04/2016 12:56 PM, Grumbach, Emmanuel wrote:
>
>
> On 02/04/2016 10:46 PM, Ben Greear wrote:
>> On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote:
>>> As many (all?) WiFi devices, Intel WiFi devices have
>>> transmit queues which have 256 transmit descriptors
>>> each and each descriptor corresponds to an MPDU.
>>> This means that when it is full, the queue contains
>>> 256 * ~1500 bytes to be transmitted (if we don't have
>>> A-MSDUs). The purpose of those queues is to have enough
>>> packets to be ready for transmission so that when the device
>>> gets an opportunity to transmit (TxOP), it can take as many
>>> packets as the spec allows and aggregate them into one
>>> A-MPDU or even several A-MPDUs if we are using bursts.
>> I guess this is only really usable if you have exactly one
>> peer connected (ie, in station mode)?
>>
>> Otherwise, you could have one slow peer and one fast one,
>> and then I suspect this would not work so well?
>
> Yes. I guess this one (big) limitation. I guess that what would happen
> in this case is that the the latency would constantly jitter. But I also
> noticed that I could reduce the transmit queue to 130 descriptors
> (instead of 256) and still reach maximal throughput because we can
> refill the queues quickly enough.
> In iwlwifi, we have plans to have one queue for each peer.
> This is under development. Not sure when it'll be ready. It also requires
> firmware change obviously.

Per-peer queues will probably be nice, especially if we can keep the
buffer bloat manageable.

>> For reference, ath10k has around 1400 tx descriptors, though
>> in practice not all are usable, and in stock firmware, I'm guessing
>> the NIC will never be able to actually fill up it's tx descriptors
>> and stop traffic. Instead, it just allows the stack to try to
>> TX, then drops the frame...
>
> 1400 descriptors, ok... but they are not organised in queues?
> (forgive my ignorance of athX drivers)

I think all the details are in the firmware, at least for now.

The firmware details are probably not something I should go into, but suffice it to say
its complex and varies between firmware versions in non-trivial ways.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2016-02-04 22:06:13

by Dave Taht

[permalink] [raw]
Subject: Re: [RFC RESEND] iwlwifi: pcie: transmit queue auto-sizing

I am not on linux-wireless nor netdev presently, but...

On Thu, Feb 4, 2016 at 12:16 PM, Emmanuel Grumbach
<[email protected]> wrote:
> As many (all?) WiFi devices, Intel WiFi devices have
> transmit queues which have 256 transmit descriptors
> each and each descriptor corresponds to an MPDU.
> This means that when it is full, the queue contains
> 256 * ~1500 bytes to be transmitted (if we don't have
> A-MSDUs). The purpose of those queues is to have enough
> packets to be ready for transmission so that when the device
> gets an opportunity to transmit (TxOP), it can take as many
> packets as the spec allows and aggregate them into one
> A-MPDU or even several A-MPDUs if we are using bursts.
>
> The problem is that the packets that are in these queues are
> already out of control of the Qdisc and can stay in those
> queues for a fairly long time when the link condition is
> not good. This leads to the well known bufferbloat problem.
>
> This patch adds a way to tune the size of the transmit queue
> so that it won't cause excessive latency. When the link
> condition is good, the packets will flow smoothly and the
> transmit queue will grow quickly allowing A-MPDUs and
> maximal throughput. When the link is not optimal, we will
> have retransmissions, lower transmit rates or signal
> detection (CCA) which will cause a delay in the packet
> transmission. The driver will sense this higher latency
> and will reduce the size of the transmit queue.
> This means that the packets that continue to arrive
> will pile up in the Qdisc rather than in the device
> queues. The major advantage of this approach is that
> codel can now do its work.
>
> The algorithm is really (too?) simple:
> every 5 seconds, starts from a short queue again.
> If a packet has been in the queue for less than 10ms,
> allow 10 more MPDUs in.
> If a packet has been in the queue for more than 20ms,
> reduce by 10 the size of the transmit queue.
>
> The implementation is really naive and way too simple:
> * reading jiffies for every Tx / Tx status is not a
> good idead.
> * jiffies are not fine-grained enough on all platforms
> * the constants chosen are really arbitrary and can't be
> tuned.
> * This may be implemented in mac80211 probably and help
> other drivers.
> * etc...
>
> But already this gives nice results. I ran a very simple
> experiment: I put the device in a controlled environment
> and ran traffic while running default sized ping in the
> background. In this configuration, our device quickly
> raises its transmission rate to the best rate.
> Then, I force the device to use the lowest rate (6Mbps).
> Of course, the throughput collapses, but the ping RTT
> shoots up.
> Using codel helps, but the latency is still high. Codel
> with this patch gives much better results:
>
> pfifo_fast:
> rtt min/avg/max/mdev = 1932.616/2393.284/2833.407/315.941 ms, pipe 3, ipg/ewma 2215.707/2446.884 ms
>
> fq_codel + Tx queue auto-sizing:
> rtt min/avg/max/mdev = 13.541/32.396/54.791/9.610 ms, ipg/ewma 200.685/32.202 ms
>
> fq_codel without Tx queue auto-sizing:
> rtt min/avg/max/mdev = 140.821/257.303/331.889/31.074 ms, pipe 2, ipg/ewma 258.147/252.847 ms

This is a dramatic improvement. But I'm not sure what you are
measuring. Is this the 6mbit test? What happens when you send traffic
the other way (more pure acks, rather than big packets?)

I try to encourage folk to use flent whenever possible, for pretty
graphs and long term measurements, so you can simultaneously measure
both throughput and latency.

flent.org's .14 release just shipped.

> Clearly, there is more work to do to be able to merge this,
> but it seems that the wireless problems mentioned in
> https://lwn.net/Articles/616241/ may have a solution.

I gave talks on the problems that wifi had with bufferbloat at the
ieee 802.11 wg meeting a while back, and more recently it was filmed
at battlemesh.

https://www.youtube.com/watch?v=Rb-UnHDw02o

I have spent my time since trying to raise sufficient resources
(testbeds and test tools), orgs, people and money to tackle these
problems at more depth. We made a bit of progress recently which I can
talk about offline...

In that talk I suggested that overall we move towards timestamping
everything, that (at least in the case of the ath9k and mt72) we tie
together aggregation with a byte based estimator similar to how BQL
works, and I hoped that eventually - we'd be able to basically - at
low rates, keep no more than one aggregate in the hardware, one in the
driver queue, and one being assembled. The pending aggregate would be
sent to the hardware on the completion interrupt for the previous
aggregate, which would fire off the size estimator and start
aggrefying the one being assembled.

A hook to do that is in use on the mt72 chipset that felix is working
on... but nowhere else so far as I know (as yet).

the iwl does it's own aggregation (I think(?))... but estimates can
still be made...

There are WAY more details of course - per station queuing, a separate
multicast queue, only some in that talk!, but my hope was that under
good conditions we'd get wireless-n down below 12ms driver overhead,
even at 6mbit, before something like fq_codel could kick in (under
good conditions! Plenty of other potential latency sources beside
excessive queuing in wifi!). My ideal world would be to hold it at
under 1250us at higher rates....

Periodically sampling seems like a reasonable approach under lab
conditions but it would be nicer to have feedback from the firmware -
"I transmitted the last tx as an X byte aggregate, at MCS1, I had to
retransmit a few packets once, it took me 6ms to acquire the media, I
heard 3 other stations transmitting, etc.".

The above info we know we can get from a few chipsets, but not enough
was known about the iwl last I looked. And one reason why fq_codel -
unassisted - is not quite the right thing on top of this is that
multicast can take a really long time...

Regardless, I'd highly love to see/use this patch myself in a variety
of real world conditions and see what happens. And incremental
progress is the only way forward. Thx for cheering me up.

>
> Cc: Stephen Hemminger <[email protected]>
> Cc: Dave Taht <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> Signed-off-by: Emmanuel Grumbach <[email protected]>
> ---
> Fix Dave's email address
> ---
> drivers/net/wireless/intel/iwlwifi/pcie/internal.h | 6 ++++
> drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 32 ++++++++++++++++++++--
> 2 files changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
> index 2f95916..d83eb56 100644
> --- a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
> +++ b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
> @@ -192,6 +192,11 @@ struct iwl_cmd_meta {
> u32 flags;
> };
>
> +struct iwl_txq_auto_size {
> + int min_space;
> + unsigned long reset_ts;
> +};
> +
> /*
> * Generic queue structure
> *
> @@ -293,6 +298,7 @@ struct iwl_txq {
> bool block;
> unsigned long wd_timeout;
> struct sk_buff_head overflow_q;
> + struct iwl_txq_auto_size auto_sz;
> };
>
> static inline dma_addr_t
> diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
> index 837a7d5..4d1dee6 100644
> --- a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
> +++ b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
> @@ -572,6 +572,8 @@ static int iwl_pcie_txq_init(struct iwl_trans *trans, struct iwl_txq *txq,
>
> spin_lock_init(&txq->lock);
> __skb_queue_head_init(&txq->overflow_q);
> + txq->auto_sz.min_space = 240;
> + txq->auto_sz.reset_ts = jiffies;
>
> /*
> * Tell nic where to find circular buffer of Tx Frame Descriptors for
> @@ -1043,10 +1045,14 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn,
> q->read_ptr != tfd_num;
> q->read_ptr = iwl_queue_inc_wrap(q->read_ptr)) {
> struct sk_buff *skb = txq->entries[txq->q.read_ptr].skb;
> + struct ieee80211_tx_info *info;
> + unsigned long tx_time;
>
> if (WARN_ON_ONCE(!skb))
> continue;
>
> + info = IEEE80211_SKB_CB(skb);
> +
> iwl_pcie_free_tso_page(skb);
>
> __skb_queue_tail(skbs, skb);
> @@ -1056,6 +1062,18 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn,
> iwl_pcie_txq_inval_byte_cnt_tbl(trans, txq);
>
> iwl_pcie_txq_free_tfd(trans, txq);
> +
> + tx_time = (uintptr_t)info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2];
> + if (time_before(jiffies, tx_time + msecs_to_jiffies(10))) {
> + txq->auto_sz.min_space -= 10;
> + txq->auto_sz.min_space =
> + max(txq->auto_sz.min_space, txq->q.high_mark);
> + } else if (time_after(jiffies,
> + tx_time + msecs_to_jiffies(20))) {
> + txq->auto_sz.min_space += 10;
> + txq->auto_sz.min_space =
> + min(txq->auto_sz.min_space, 252);
> + }
> }
>
> iwl_pcie_txq_progress(txq);
> @@ -2185,6 +2203,7 @@ int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb,
> struct iwl_device_cmd *dev_cmd, int txq_id)
> {
> struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
> + struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
> struct ieee80211_hdr *hdr;
> struct iwl_tx_cmd *tx_cmd = (struct iwl_tx_cmd *)dev_cmd->payload;
> struct iwl_cmd_meta *out_meta;
> @@ -2234,13 +2253,20 @@ int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb,
>
> spin_lock(&txq->lock);
>
> - if (iwl_queue_space(q) < q->high_mark) {
> + info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2] =
> + (void *)(uintptr_t)jiffies;
> +
> + if (time_after(jiffies,
> + txq->auto_sz.reset_ts + msecs_to_jiffies(5000))) {
> + txq->auto_sz.min_space = 240;
> + txq->auto_sz.reset_ts = jiffies;
> + }
> +
> + if (iwl_queue_space(q) < txq->auto_sz.min_space) {
> iwl_stop_queue(trans, txq);
>
> /* don't put the packet on the ring, if there is no room */
> if (unlikely(iwl_queue_space(q) < 3)) {
> - struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
> -
> info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 1] =
> dev_cmd;
> __skb_queue_tail(&txq->overflow_q, skb);
> --
> 2.5.0

=
Dave Täht
Let's go make home routers and wifi faster! With better software!
https://www.gofundme.com/savewifi

2016-02-08 10:04:41

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: [RFC v2] iwlwifi: pcie: transmit queue auto-sizing

On Fri, Feb 5, 2016 at 10:44 AM, Michal Kazior <[email protected]> wrote:
> On 4 February 2016 at 22:14, Ben Greear <[email protected]> wrote:
>> On 02/04/2016 12:56 PM, Grumbach, Emmanuel wrote:
>>> On 02/04/2016 10:46 PM, Ben Greear wrote:
>>>> On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote:
>>>>>
>>>>> As many (all?) WiFi devices, Intel WiFi devices have
>>>>> transmit queues which have 256 transmit descriptors
>>>>> each and each descriptor corresponds to an MPDU.
>>>>> This means that when it is full, the queue contains
>>>>> 256 * ~1500 bytes to be transmitted (if we don't have
>>>>> A-MSDUs). The purpose of those queues is to have enough
>>>>> packets to be ready for transmission so that when the device
>>>>> gets an opportunity to transmit (TxOP), it can take as many
>>>>> packets as the spec allows and aggregate them into one
>>>>> A-MPDU or even several A-MPDUs if we are using bursts.
>>>>
>>>> I guess this is only really usable if you have exactly one
>>>> peer connected (ie, in station mode)?
>>>>
>>>> Otherwise, you could have one slow peer and one fast one,
>>>> and then I suspect this would not work so well?
>>>
>>>
>>> Yes. I guess this one (big) limitation. I guess that what would happen
>>> in this case is that the the latency would constantly jitter. But I also
>
> Hmm.. You'd probably need to track per-station packet sojourn time as
> well and make it possible to stop/wake queues per station.

Clearly here comes the difference between the devices you work on, and
the devices I work on. Intel devices are more client oriented. Our AP
mode doesn't handle many clients etc..

>
>
>>> noticed that I could reduce the transmit queue to 130 descriptors
>>> (instead of 256) and still reach maximal throughput because we can
>>> refill the queues quickly enough.
>>> In iwlwifi, we have plans to have one queue for each peer.
>>> This is under development. Not sure when it'll be ready. It also requires
>>> firmware change obviously.
>>
>> Per-peer queues will probably be nice, especially if we can keep the
>> buffer bloat manageable.
>
> Per-station queues sound tricky if you consider bufferbloat.

iwlwifi's A-MPDU model is different from athXk's I guess. In iwlwifi
(the Intel devices really since it is mostly firmware) the firmware
will try to use a TxOP whenever there is data in the queue. Then, once
we have a chance to transmit, we will look at what we have in the
queue and send the biggest aggregates we can (and bursts if allowed).
But we will not defer packets' transmission to get bigger aggregates.
Testing shows that under ideal conditions, we can have enough packets
in the queue to build big aggregates without creating artificial
latency.

>
> To maximize use of airtime (i.e. txop) you need to send big
> aggregates. Since aggregates are per station-tid to maximize
> multi-station performance (in AP mode) you'll need to queue a lot of
> frames, per each station, depending on the chosen tx rate.

Sure.

>
> A bursted txop can be as big as 5-10ms. If you consider you want to
> queue 5-10ms worth of data for *each* station at any given time you
> obviously introduce a lot of lag. If you have 10 stations you might
> end up with service period at 10*10ms = 100ms. This gets even worse if
> you consider MU-MIMO because you need to do an expensive sounding
> procedure before transmitting. So while SU aggregation can probably
> still work reasonably well with shorter bursts (1-2ms) MU needs at
> least 3ms to get *any* gain when compared to SU (which obviously means
> you want more to actually make MU pay off). The rule of thumb is the
> longer you wait the bigger capacity you can get.

I am not an expert about MU-MIMO, but I'll believe you :)
We can chat about this in a few days :)

Queueing frames under good conditions is fine in a way that it is a
"Good queue" (hey Stephen), you need those queues to maximize the
throughput because of the bursty nature of WiFi and the queue "moves"
quickly since you have high throughput so that the sojourn time in
your queue is relatively small, but when the link conditions gets less
good you need to reduce the queue length because it doesn't really
help you anyway. This is what my patch aims at fixing.
All this is true when you have a small number of stations...
I understand from your comment that even in ideal conditions you still
need to create a lot of latency to gain TPT. Then there isn't much we
can do without impacting either TPT or latency. Then, there is a real
tradeoff.
I guess that again you are facing a classic AP problem that a station
or an AP with a small number of concurrent associated clients will
likely not have.

All this encourages me in my belief that I should do something in
iwlwifi for iwlwifi and at mac80211's level since there seem to be
very different problems / use cases. But this code can still suit
those use cases can all fit and we'd just (...) have to give different
parameters to the "algorithm"?

>
> Apparently there's interest in maximizing throughput but it stands in
> direct opposition of keeping the latency down so I've been thinking
> how to satisfy both.

In your case, then yes. I guess I should limit my patch for queues
that serve vif of type BSS for now. It'll still help us for 99% of
iwlwifi's usages.

>
> The current approach ath10k is taking (patches in review [1][2]) is to
> use mac80211 software queues for per-station queuing, exposing queue
> state to firmware (it decides where frames should be dequeued from)
> and making it possible to stop/wake per-station tx subqueue with fake
> netdev queues. I'm starting to think this is not the right way though
> because it's inherently hard to control latency and there's a huge
> memory overhead associated with the fake netdev queues. Also fq_codel
> is a less effective with this kind of setup.

I'd love to hear more the reasons of fq_codel efficiency's problem
here (just for education).
I saw these patches. This makes the firmware much more actively
involved in the scheduling. As I said, in iwlwifi we plan to have one
hardware queue per-sta-per-tid so that the firmware would be able to
choose what station it wants to send frames to. The main purpose is to
improve uAPSD response time as an AP, but it will also make it easier
to choose the right frames to add to a MU-MIMO transmission.

>
> My current thinking is that the entire problem should be solved via
> (per-AC) qdiscs, e.g. fq_codel. I guess one could use
> limit/target/interval/quantum knobs to tune it for higher latency of
> aggregation-oriented Wi-Fi links where long service time (think
> 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi
> works in the first place, i.e. Wi-Fi gets better throughput if you
> deliver bursts of packets destined to the same station. Moreover this
> gets even more complicated with MU-MIMO where you may want to consider
> spatial location (which influences signal quality when grouped) of
> each station when you decide which set of stations you're going to
> aggregate to in parallel. Since drivers have a finite tx ring this it
> is important to deliver bursts that can actually be aggregated
> efficiently. This means driver would need to be able to tell qdisc
> about per-flow conditions to influence the RR scheme in some way
> (assuming a qdiscs even understands flows; do we need a unified way of
> talking about flows between qdiscs and drivers?).

Interesting. Not sure that 100-200ms is acceptable for all the use
cases though. Bursts can be achieved by TSO maybe? I did that to get
A-MSDU. But... that won't help for bridged traffic which is your
primary use case.

>
>
> [1]: https://www.spinics.net/lists/linux-wireless/msg146187.html
> [2]: https://www.spinics.net/lists/linux-wireless/msg146512.html
>
>
>>>> For reference, ath10k has around 1400 tx descriptors, though
>>>> in practice not all are usable, and in stock firmware, I'm guessing
>>>> the NIC will never be able to actually fill up it's tx descriptors
>>>> and stop traffic. Instead, it just allows the stack to try to
>>>> TX, then drops the frame...
>>>
>>>
>>> 1400 descriptors, ok... but they are not organised in queues?
>>> (forgive my ignorance of athX drivers)
>>
>>
>> I think all the details are in the firmware, at least for now.
>
> Yeah. Basically ath10k has a flat set of tx descriptors which are
> AC-agnostic. Firmware classifies them internally to per-AC HW queues.
>
>
> Michał
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-02-07 12:39:22

by Grumbach, Emmanuel

[permalink] [raw]
Subject: Re: [RFC RESEND] iwlwifi: pcie: transmit queue auto-sizing



On 02/05/2016 12:06 AM, Dave Taht wrote:
> I am not on linux-wireless nor netdev presently, but...
>
> On Thu, Feb 4, 2016 at 12:16 PM, Emmanuel Grumbach
> <[email protected]> wrote:
>> As many (all?) WiFi devices, Intel WiFi devices have
>> transmit queues which have 256 transmit descriptors
>> each and each descriptor corresponds to an MPDU.
>> This means that when it is full, the queue contains
>> 256 * ~1500 bytes to be transmitted (if we don't have
>> A-MSDUs). The purpose of those queues is to have enough
>> packets to be ready for transmission so that when the device
>> gets an opportunity to transmit (TxOP), it can take as many
>> packets as the spec allows and aggregate them into one
>> A-MPDU or even several A-MPDUs if we are using bursts.
>>
>> The problem is that the packets that are in these queues are
>> already out of control of the Qdisc and can stay in those
>> queues for a fairly long time when the link condition is
>> not good. This leads to the well known bufferbloat problem.
>>
>> This patch adds a way to tune the size of the transmit queue
>> so that it won't cause excessive latency. When the link
>> condition is good, the packets will flow smoothly and the
>> transmit queue will grow quickly allowing A-MPDUs and
>> maximal throughput. When the link is not optimal, we will
>> have retransmissions, lower transmit rates or signal
>> detection (CCA) which will cause a delay in the packet
>> transmission. The driver will sense this higher latency
>> and will reduce the size of the transmit queue.
>> This means that the packets that continue to arrive
>> will pile up in the Qdisc rather than in the device
>> queues. The major advantage of this approach is that
>> codel can now do its work.
>>
>> The algorithm is really (too?) simple:
>> every 5 seconds, starts from a short queue again.
>> If a packet has been in the queue for less than 10ms,
>> allow 10 more MPDUs in.
>> If a packet has been in the queue for more than 20ms,
>> reduce by 10 the size of the transmit queue.
>>
>> The implementation is really naive and way too simple:
>> * reading jiffies for every Tx / Tx status is not a
>> good idead.
>> * jiffies are not fine-grained enough on all platforms
>> * the constants chosen are really arbitrary and can't be
>> tuned.
>> * This may be implemented in mac80211 probably and help
>> other drivers.
>> * etc...
>>
>> But already this gives nice results. I ran a very simple
>> experiment: I put the device in a controlled environment
>> and ran traffic while running default sized ping in the
>> background. In this configuration, our device quickly
>> raises its transmission rate to the best rate.
>> Then, I force the device to use the lowest rate (6Mbps).
>> Of course, the throughput collapses, but the ping RTT
>> shoots up.
>> Using codel helps, but the latency is still high. Codel
>> with this patch gives much better results:
>>
>> pfifo_fast:
>> rtt min/avg/max/mdev = 1932.616/2393.284/2833.407/315.941 ms, pipe 3, ipg/ewma 2215.707/2446.884 ms
>>
>> fq_codel + Tx queue auto-sizing:
>> rtt min/avg/max/mdev = 13.541/32.396/54.791/9.610 ms, ipg/ewma 200.685/32.202 ms
>>
>> fq_codel without Tx queue auto-sizing:
>> rtt min/avg/max/mdev = 140.821/257.303/331.889/31.074 ms, pipe 2, ipg/ewma 258.147/252.847 ms
> This is a dramatic improvement. But I'm not sure what you are
> measuring. Is this the 6mbit test? What happens when you send traffic
> the other way (more pure acks, rather than big packets?)

Not tested. I only tested the part that I thought was most interesting:
lots of TCP traffic (charriot) with a very low rate and ping in the
background.

>
> I try to encourage folk to use flent whenever possible, for pretty
> graphs and long term measurements, so you can simultaneously measure
> both throughput and latency.
>
> flent.org's .14 release just shipped.


Ok - I hope I'll get some time to give it a try.

>> Clearly, there is more work to do to be able to merge this,
>> but it seems that the wireless problems mentioned in
>> https://lwn.net/Articles/616241/ may have a solution.
> I gave talks on the problems that wifi had with bufferbloat at the
> ieee 802.11 wg meeting a while back, and more recently it was filmed
> at battlemesh.
>
> https://www.youtube.com/watch?v=Rb-UnHDw02o
>
> I have spent my time since trying to raise sufficient resources
> (testbeds and test tools), orgs, people and money to tackle these
> problems at more depth. We made a bit of progress recently which I can
> talk about offline...
>
> In that talk I suggested that overall we move towards timestamping
> everything, that (at least in the case of the ath9k and mt72) we tie
> together aggregation with a byte based estimator similar to how BQL
> works, and I hoped that eventually - we'd be able to basically - at
> low rates, keep no more than one aggregate in the hardware, one in the
> driver queue, and one being assembled. The pending aggregate would be
> sent to the hardware on the completion interrupt for the previous
> aggregate, which would fire off the size estimator and start
> aggrefying the one being assembled.
>
> A hook to do that is in use on the mt72 chipset that felix is working
> on... but nowhere else so far as I know (as yet).
>
> the iwl does it's own aggregation (I think(?))... but estimates can
> still be made...
>
> There are WAY more details of course - per station queuing, a separate
> multicast queue, only some in that talk!, but my hope was that under
> good conditions we'd get wireless-n down below 12ms driver overhead,
> even at 6mbit, before something like fq_codel could kick in (under
> good conditions! Plenty of other potential latency sources beside
> excessive queuing in wifi!). My ideal world would be to hold it at
> under 1250us at higher rates....
>
> Periodically sampling seems like a reasonable approach under lab
> conditions but it would be nicer to have feedback from the firmware -
> "I transmitted the last tx as an X byte aggregate, at MCS1, I had to
> retransmit a few packets once, it took me 6ms to acquire the media, I
> heard 3 other stations transmitting, etc.".
>
> The above info we know we can get from a few chipsets, but not enough
> was known about the iwl last I looked. And one reason why fq_codel -
> unassisted - is not quite the right thing on top of this is that
> multicast can take a really long time...
>
> Regardless, I'd highly love to see/use this patch myself in a variety
> of real world conditions and see what happens. And incremental
> progress is the only way forward. Thx for cheering me up.
>
>> Cc: Stephen Hemminger <[email protected]>
>> Cc: Dave Taht <[email protected]>
>> Cc: Jonathan Corbet <[email protected]>
>> Signed-off-by: Emmanuel Grumbach <[email protected]>
>> ---
>> Fix Dave's email address
>> ---
>> drivers/net/wireless/intel/iwlwifi/pcie/internal.h | 6 ++++
>> drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 32 ++++++++++++++++++++--
>> 2 files changed, 35 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
>> index 2f95916..d83eb56 100644
>> --- a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
>> +++ b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
>> @@ -192,6 +192,11 @@ struct iwl_cmd_meta {
>> u32 flags;
>> };
>>
>> +struct iwl_txq_auto_size {
>> + int min_space;
>> + unsigned long reset_ts;
>> +};
>> +
>> /*
>> * Generic queue structure
>> *
>> @@ -293,6 +298,7 @@ struct iwl_txq {
>> bool block;
>> unsigned long wd_timeout;
>> struct sk_buff_head overflow_q;
>> + struct iwl_txq_auto_size auto_sz;
>> };
>>
>> static inline dma_addr_t
>> diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
>> index 837a7d5..4d1dee6 100644
>> --- a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
>> +++ b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
>> @@ -572,6 +572,8 @@ static int iwl_pcie_txq_init(struct iwl_trans *trans, struct iwl_txq *txq,
>>
>> spin_lock_init(&txq->lock);
>> __skb_queue_head_init(&txq->overflow_q);
>> + txq->auto_sz.min_space = 240;
>> + txq->auto_sz.reset_ts = jiffies;
>>
>> /*
>> * Tell nic where to find circular buffer of Tx Frame Descriptors for
>> @@ -1043,10 +1045,14 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn,
>> q->read_ptr != tfd_num;
>> q->read_ptr = iwl_queue_inc_wrap(q->read_ptr)) {
>> struct sk_buff *skb = txq->entries[txq->q.read_ptr].skb;
>> + struct ieee80211_tx_info *info;
>> + unsigned long tx_time;
>>
>> if (WARN_ON_ONCE(!skb))
>> continue;
>>
>> + info = IEEE80211_SKB_CB(skb);
>> +
>> iwl_pcie_free_tso_page(skb);
>>
>> __skb_queue_tail(skbs, skb);
>> @@ -1056,6 +1062,18 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn,
>> iwl_pcie_txq_inval_byte_cnt_tbl(trans, txq);
>>
>> iwl_pcie_txq_free_tfd(trans, txq);
>> +
>> + tx_time = (uintptr_t)info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2];
>> + if (time_before(jiffies, tx_time + msecs_to_jiffies(10))) {
>> + txq->auto_sz.min_space -= 10;
>> + txq->auto_sz.min_space =
>> + max(txq->auto_sz.min_space, txq->q.high_mark);
>> + } else if (time_after(jiffies,
>> + tx_time + msecs_to_jiffies(20))) {
>> + txq->auto_sz.min_space += 10;
>> + txq->auto_sz.min_space =
>> + min(txq->auto_sz.min_space, 252);
>> + }
>> }
>>
>> iwl_pcie_txq_progress(txq);
>> @@ -2185,6 +2203,7 @@ int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb,
>> struct iwl_device_cmd *dev_cmd, int txq_id)
>> {
>> struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
>> + struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
>> struct ieee80211_hdr *hdr;
>> struct iwl_tx_cmd *tx_cmd = (struct iwl_tx_cmd *)dev_cmd->payload;
>> struct iwl_cmd_meta *out_meta;
>> @@ -2234,13 +2253,20 @@ int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb,
>>
>> spin_lock(&txq->lock);
>>
>> - if (iwl_queue_space(q) < q->high_mark) {
>> + info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2] =
>> + (void *)(uintptr_t)jiffies;
>> +
>> + if (time_after(jiffies,
>> + txq->auto_sz.reset_ts + msecs_to_jiffies(5000))) {
>> + txq->auto_sz.min_space = 240;
>> + txq->auto_sz.reset_ts = jiffies;
>> + }
>> +
>> + if (iwl_queue_space(q) < txq->auto_sz.min_space) {
>> iwl_stop_queue(trans, txq);
>>
>> /* don't put the packet on the ring, if there is no room */
>> if (unlikely(iwl_queue_space(q) < 3)) {
>> - struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
>> -
>> info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 1] =
>> dev_cmd;
>> __skb_queue_tail(&txq->overflow_q, skb);
>> --
>> 2.5.0
> =
> Dave T?ht
> Let's go make home routers and wifi faster! With better software!
> https://www.gofundme.com/savewifi
>


2016-02-08 10:00:20

by Michal Kazior

[permalink] [raw]
Subject: Re: [RFC v2] iwlwifi: pcie: transmit queue auto-sizing

On 5 February 2016 at 17:47, Dave Taht <[email protected]> wrote:
>> A bursted txop can be as big as 5-10ms. If you consider you want to
>> queue 5-10ms worth of data for *each* station at any given time you
>> obviously introduce a lot of lag. If you have 10 stations you might
>> end up with service period at 10*10ms = 100ms. This gets even worse if
>> you consider MU-MIMO because you need to do an expensive sounding
>> procedure before transmitting. So while SU aggregation can probably
>> still work reasonably well with shorter bursts (1-2ms) MU needs at
>> least 3ms to get *any* gain when compared to SU (which obviously means
>> you want more to actually make MU pay off).
>
> I am not sure where you get these numbers. Got a spreadsheet?

Here's a nice summary on some of it:

http://chimera.labs.oreilly.com/books/1234000001739/ch03.html#figure-mac-ampdu

Even if your single A-MPDU can be shorter than a txop you can burst a
few of them if my understanding is correct.

The overhead associated with MU sounding is something I've been told
only. Apparently for MU to pay off you need fairly big bursts. This
implies that the more stations you have to service the less it makes
sense to attempt MU if you consider latency.


> Gradually reducing the maximum sized txop as a function of the number
> of stations makes sense. If you have 10 stations pending delivery and
> reduced the max txop to 1ms, you hurt bandwidth at that instant, but
> by offering more service to more stations, in less time, they will
> converge on a reasonable share of the bandwidth for each, faster[1].
> And I'm sure that the person videoconferencing on a link like that
> would appreciate getting some service inside of a 10ms interval,
> rather than a 100ms.
>
> yes, there's overhead, and that's not the right number, which would
> vary as to g,n,ac and successors.
>
> You will also get more opportunities to use mu-mimo with shorter
> bursts extant and more stations being regularly serviced.
>
> [1] https://www.youtube.com/watch?v=Rb-UnHDw02o at about 13:50

This is my thinking as well, at least for most common use cases.

If you try to optimize for throughput by introducing extra induced
latency you might end up not being able to use aggregation in practice
anyway because you won't be able to start up connections and ask for
enough data - or at least that's what my intuition tells me.

But, like I've mentioned, there's interest in making it possible to
maximize for throughput (regardless of latency). This surely makes
sense for synthetic UDP benchmarks. But does it make sense for any
real-world application? No idea.


>> The rule of thumb is the
>> longer you wait the bigger capacity you can get.
>
> This is not strictly true as the "fountain" of packets is regulated by
> acks on the other side of the link, and ramp up or down as a function
> of service time and loss.

Yes, if you consider real world cases, i.e. TCP, web traffic, etc.
then you're correct.


>> Apparently there's interest in maximizing throughput but it stands in
>> direct opposition of keeping the latency down so I've been thinking
>> how to satisfy both.
>>
>> The current approach ath10k is taking (patches in review [1][2]) is to
>> use mac80211 software queues for per-station queuing, exposing queue
>> state to firmware (it decides where frames should be dequeued from)
>> and making it possible to stop/wake per-station tx subqueue with fake
>> netdev queues. I'm starting to think this is not the right way though
>> because it's inherently hard to control latency and there's a huge
>> memory overhead associated with the fake netdev queues.
>
> What is this overhead?

E.g. if you want to be able to maximize throughput for 50 MU clients
you need to be able to queue, in theory, 50*200 (roughly) frames. This
translates to both huge memory usage and latency *and* renders
(fq_)codel qdisc rather.. moot.


> Applying things like codel tend to dramatically shorten the amount of
> skbs extant...

> modern 802.11ac capable hardware has tons more
> memory...

I don't think it does. QCA988x is able to handle "only" 1424 tx
descriptors (IOW 1500-byte long MSDUs) in the driver-to-firmware tx
queue (it's a flat queue). QCA99x0 is able to handle 2500 if asked
politely.

This is still not enough to satisfy the insane "maximize the
capacity/throughput" expectations though.

You could actually argue it's too much from the bufferbloat problem
point of view anyway and Emmanuel's patch proves it is beneficial to
buffer less in driver depending on the sojourn packet time.


>> Also fq_codel
>> is a less effective with this kind of setup.
>
> fq_codel's principal problems with working with wifi are long and
> documented in the talk above.
>
>> My current thinking is that the entire problem should be solved via
>> (per-AC) qdiscs, e.g. fq_codel. I guess one could use
>> limit/target/interval/quantum knobs to tune it for higher latency of
>> aggregation-oriented Wi-Fi links where long service time (think
>> 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi
>> works in the first place, i.e. Wi-Fi gets better throughput if you
>> deliver bursts of packets destined to the same station. Moreover this
>> gets even more complicated with MU-MIMO where you may want to consider
>> spatial location (which influences signal quality when grouped) of
>> each station when you decide which set of stations you're going to
>> aggregate to in parallel. Since drivers have a finite tx ring this it
>> is important to deliver bursts that can actually be aggregated
>> efficiently. This means driver would need to be able to tell qdisc
>> about per-flow conditions to influence the RR scheme in some way
>> (assuming a qdiscs even understands flows; do we need a unified way of
>> talking about flows between qdiscs and drivers?).
>
> This is a very good summary of the problems in layering fq_codel as it
> exists today on top of wifi as it exists today. :/ Our conclusion
> several years ago was that as the information needed to do things more
> right was in the mac80211 layer that we could not evolve the qdisc
> layer to suit, and needed to move the core ideas into the mac80211
> layer.
>
> Things have evolved since, but I still think we can't get enough info
> up to the qdisc layer (locks and so on) to use it sanely.

The current per-station queue implementation in mac80211 doesn't seem
sufficient. Each of these queues use a simple flat fifo queue
(sk_buff_head) limited by packet count only with somewhat broken
behavior when packet limit is reached as you end up with unfairly
populated queues a lot. They need to have codel applied on them
somehow. You'll eventually end up re-inventing fq_codel in mac80211
making the qdisc redundant.

Moreover these queues aren't per-station only. They are
per-station-per-tid giving 16 queues per station. This is important
because you can't aggregate traffic going out on different tids.

Even without explicit air conditions feedback from mac80211 to
fq_codel I suspect it'd be still beneficial if fq_codel was able to
group (sub-)flows into per-station-tid and burst them out them in
subsequent dequeue() calls so the chance they get aggregated is
higher.


Michał

2016-02-04 20:16:40

by Grumbach, Emmanuel

[permalink] [raw]
Subject: [RFC v2] iwlwifi: pcie: transmit queue auto-sizing

As many (all?) WiFi devices, Intel WiFi devices have
transmit queues which have 256 transmit descriptors
each and each descriptor corresponds to an MPDU.
This means that when it is full, the queue contains
256 * ~1500 bytes to be transmitted (if we don't have
A-MSDUs). The purpose of those queues is to have enough
packets to be ready for transmission so that when the device
gets an opportunity to transmit (TxOP), it can take as many
packets as the spec allows and aggregate them into one
A-MPDU or even several A-MPDUs if we are using bursts.

The problem is that the packets that are in these queues are
already out of control of the Qdisc and can stay in those
queues for a fairly long time when the link condition is
not good. This leads to the well known bufferbloat problem.

This patch adds a way to tune the size of the transmit queue
so that it won't cause excessive latency. When the link
condition is good, the packets will flow smoothly and the
transmit queue will grow quickly allowing A-MPDUs and
maximal throughput. When the link is not optimal, we will
have retransmissions, lower transmit rates or signal
detection (CCA) which will cause a delay in the packet
transmission. The driver will sense this higher latency
and will reduce the size of the transmit queue.
This means that the packets that continue to arrive
will pile up in the Qdisc rather than in the device
queues. The major advantage of this approach is that
codel can now do its work.

The algorithm is really (too?) simple:
every 5 seconds, starts from a short queue again.
If a packet has been in the queue for less than 10ms,
allow 10 more MPDUs in.
If a packet has been in the queue for more than 20ms,
reduce by 10 the size of the transmit queue.

The implementation is really naive and way too simple:
* reading jiffies for every Tx / Tx status is not a
good idead.
* jiffies are not fine-grained enough on all platforms
* the constants chosen are really arbitrary and can't be
tuned.
* This may be implemented in mac80211 probably and help
other drivers.
* etc...

But already this gives nice results. I ran a very simple
experiment: I put the device in a controlled environment
and ran traffic while running default sized ping in the
background. In this configuration, our device quickly
raises its transmission rate to the best rate.
Then, I force the device to use the lowest rate (6Mbps).
Of course, the throughput collapses, but the ping RTT
shoots up.
Using codel helps, but the latency is still high. Codel
with this patch gives much better results:

pfifo_fast:
rtt min/avg/max/mdev = 1932.616/2393.284/2833.407/315.941 ms, pipe 3, ipg/ewma 2215.707/2446.884 ms

fq_codel + Tx queue auto-sizing:
rtt min/avg/max/mdev = 13.541/32.396/54.791/9.610 ms, ipg/ewma 200.685/32.202 ms

fq_codel without Tx queue auto-sizing:
rtt min/avg/max/mdev = 140.821/257.303/331.889/31.074 ms, pipe 2, ipg/ewma 258.147/252.847 ms

Clearly, there is more work to do to be able to merge this,
but it seems that the wireless problems mentioned in
https://lwn.net/Articles/616241/ may have a solution.

Cc: Stephen Hemminger <[email protected]>
Cc: Dave Taht <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Emmanuel Grumbach <[email protected]>
---
Fix Dave's email address
---
drivers/net/wireless/intel/iwlwifi/pcie/internal.h | 6 ++++
drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 32 ++++++++++++++++++++--
2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
index 2f95916..d83eb56 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
@@ -192,6 +192,11 @@ struct iwl_cmd_meta {
u32 flags;
};

+struct iwl_txq_auto_size {
+ int min_space;
+ unsigned long reset_ts;
+};
+
/*
* Generic queue structure
*
@@ -293,6 +298,7 @@ struct iwl_txq {
bool block;
unsigned long wd_timeout;
struct sk_buff_head overflow_q;
+ struct iwl_txq_auto_size auto_sz;
};

static inline dma_addr_t
diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
index 837a7d5..4d1dee6 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
@@ -572,6 +572,8 @@ static int iwl_pcie_txq_init(struct iwl_trans *trans, struct iwl_txq *txq,

spin_lock_init(&txq->lock);
__skb_queue_head_init(&txq->overflow_q);
+ txq->auto_sz.min_space = 240;
+ txq->auto_sz.reset_ts = jiffies;

/*
* Tell nic where to find circular buffer of Tx Frame Descriptors for
@@ -1043,10 +1045,14 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn,
q->read_ptr != tfd_num;
q->read_ptr = iwl_queue_inc_wrap(q->read_ptr)) {
struct sk_buff *skb = txq->entries[txq->q.read_ptr].skb;
+ struct ieee80211_tx_info *info;
+ unsigned long tx_time;

if (WARN_ON_ONCE(!skb))
continue;

+ info = IEEE80211_SKB_CB(skb);
+
iwl_pcie_free_tso_page(skb);

__skb_queue_tail(skbs, skb);
@@ -1056,6 +1062,18 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn,
iwl_pcie_txq_inval_byte_cnt_tbl(trans, txq);

iwl_pcie_txq_free_tfd(trans, txq);
+
+ tx_time = (uintptr_t)info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2];
+ if (time_before(jiffies, tx_time + msecs_to_jiffies(10))) {
+ txq->auto_sz.min_space -= 10;
+ txq->auto_sz.min_space =
+ max(txq->auto_sz.min_space, txq->q.high_mark);
+ } else if (time_after(jiffies,
+ tx_time + msecs_to_jiffies(20))) {
+ txq->auto_sz.min_space += 10;
+ txq->auto_sz.min_space =
+ min(txq->auto_sz.min_space, 252);
+ }
}

iwl_pcie_txq_progress(txq);
@@ -2185,6 +2203,7 @@ int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb,
struct iwl_device_cmd *dev_cmd, int txq_id)
{
struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
+ struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
struct ieee80211_hdr *hdr;
struct iwl_tx_cmd *tx_cmd = (struct iwl_tx_cmd *)dev_cmd->payload;
struct iwl_cmd_meta *out_meta;
@@ -2234,13 +2253,20 @@ int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb,

spin_lock(&txq->lock);

- if (iwl_queue_space(q) < q->high_mark) {
+ info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2] =
+ (void *)(uintptr_t)jiffies;
+
+ if (time_after(jiffies,
+ txq->auto_sz.reset_ts + msecs_to_jiffies(5000))) {
+ txq->auto_sz.min_space = 240;
+ txq->auto_sz.reset_ts = jiffies;
+ }
+
+ if (iwl_queue_space(q) < txq->auto_sz.min_space) {
iwl_stop_queue(trans, txq);

/* don't put the packet on the ring, if there is no room */
if (unlikely(iwl_queue_space(q) < 3)) {
- struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
-
info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 1] =
dev_cmd;
__skb_queue_tail(&txq->overflow_q, skb);
--
2.5.0


2016-02-05 16:48:33

by Ben Greear

[permalink] [raw]
Subject: Re: [RFC v2] iwlwifi: pcie: transmit queue auto-sizing

On 02/05/2016 12:44 AM, Michal Kazior wrote:

> Per-station queues sound tricky if you consider bufferbloat.
>
> To maximize use of airtime (i.e. txop) you need to send big
> aggregates. Since aggregates are per station-tid to maximize
> multi-station performance (in AP mode) you'll need to queue a lot of
> frames, per each station, depending on the chosen tx rate.
>
> A bursted txop can be as big as 5-10ms. If you consider you want to
> queue 5-10ms worth of data for *each* station at any given time you
> obviously introduce a lot of lag. If you have 10 stations you might
> end up with service period at 10*10ms = 100ms. This gets even worse if
> you consider MU-MIMO because you need to do an expensive sounding
> procedure before transmitting. So while SU aggregation can probably
> still work reasonably well with shorter bursts (1-2ms) MU needs at
> least 3ms to get *any* gain when compared to SU (which obviously means
> you want more to actually make MU pay off). The rule of thumb is the
> longer you wait the bigger capacity you can get.
>
> Apparently there's interest in maximizing throughput but it stands in
> direct opposition of keeping the latency down so I've been thinking
> how to satisfy both.

I really think this should be tunable. For instance, someone making an AP
that is mostly for letting lots of users stream movies would care a lot more
about throughput than someone making an AP that is mainly for browsing the
web and doing more latency-sensitive activities.


> The current approach ath10k is taking (patches in review [1][2]) is to
> use mac80211 software queues for per-station queuing, exposing queue
> state to firmware (it decides where frames should be dequeued from)
> and making it possible to stop/wake per-station tx subqueue with fake
> netdev queues. I'm starting to think this is not the right way though
> because it's inherently hard to control latency and there's a huge
> memory overhead associated with the fake netdev queues. Also fq_codel
> is a less effective with this kind of setup.
>
> My current thinking is that the entire problem should be solved via
> (per-AC) qdiscs, e.g. fq_codel. I guess one could use
> limit/target/interval/quantum knobs to tune it for higher latency of
> aggregation-oriented Wi-Fi links where long service time (think
> 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi
> works in the first place, i.e. Wi-Fi gets better throughput if you
> deliver bursts of packets destined to the same station. Moreover this
> gets even more complicated with MU-MIMO where you may want to consider
> spatial location (which influences signal quality when grouped) of
> each station when you decide which set of stations you're going to
> aggregate to in parallel. Since drivers have a finite tx ring this it
> is important to deliver bursts that can actually be aggregated
> efficiently. This means driver would need to be able to tell qdisc
> about per-flow conditions to influence the RR scheme in some way
> (assuming a qdiscs even understands flows; do we need a unified way of
> talking about flows between qdiscs and drivers?).

I wonder if it would work better if we removed most of the
tid handling and aggregation logic in the firmware. Maybe just have the mgt Q and best
effort (and skip VO/VI). Let the OS tell (or suggest to) the firmware when aggregation
starts and stops. That might at least cut the number of queues in half,
saving memory and latency up and down the stack.

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2016-02-05 16:47:34

by Dave Taht

[permalink] [raw]
Subject: Re: [RFC v2] iwlwifi: pcie: transmit queue auto-sizing

> A bursted txop can be as big as 5-10ms. If you consider you want to
> queue 5-10ms worth of data for *each* station at any given time you
> obviously introduce a lot of lag. If you have 10 stations you might
> end up with service period at 10*10ms = 100ms. This gets even worse if
> you consider MU-MIMO because you need to do an expensive sounding
> procedure before transmitting. So while SU aggregation can probably
> still work reasonably well with shorter bursts (1-2ms) MU needs at
> least 3ms to get *any* gain when compared to SU (which obviously means
> you want more to actually make MU pay off).

I am not sure where you get these numbers. Got a spreadsheet?

Gradually reducing the maximum sized txop as a function of the number
of stations makes sense. If you have 10 stations pending delivery and
reduced the max txop to 1ms, you hurt bandwidth at that instant, but
by offering more service to more stations, in less time, they will
converge on a reasonable share of the bandwidth for each, faster[1].
And I'm sure that the person videoconferencing on a link like that
would appreciate getting some service inside of a 10ms interval,
rather than a 100ms.

yes, there's overhead, and that's not the right number, which would
vary as to g,n,ac and successors.

You will also get more opportunities to use mu-mimo with shorter
bursts extant and more stations being regularly serviced.

[1] https://www.youtube.com/watch?v=Rb-UnHDw02o at about 13:50

> The rule of thumb is the
> longer you wait the bigger capacity you can get.

This is not strictly true as the "fountain" of packets is regulated by
acks on the other side of the link, and ramp up or down as a function
of service time and loss.

>
> Apparently there's interest in maximizing throughput but it stands in
> direct opposition of keeping the latency down so I've been thinking
> how to satisfy both.
>
> The current approach ath10k is taking (patches in review [1][2]) is to
> use mac80211 software queues for per-station queuing, exposing queue
> state to firmware (it decides where frames should be dequeued from)
> and making it possible to stop/wake per-station tx subqueue with fake
> netdev queues. I'm starting to think this is not the right way though
> because it's inherently hard to control latency and there's a huge
> memory overhead associated with the fake netdev queues.

What is this overhead?

Applying things like codel tend to dramatically shorten the amount of
skbs extant... modern 802.11ac capable hardware has tons more
memory...

> Also fq_codel
> is a less effective with this kind of setup.

fq_codel's principal problems with working with wifi are long and
documented in the talk above.

> My current thinking is that the entire problem should be solved via
> (per-AC) qdiscs, e.g. fq_codel. I guess one could use
> limit/target/interval/quantum knobs to tune it for higher latency of
> aggregation-oriented Wi-Fi links where long service time (think
> 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi
> works in the first place, i.e. Wi-Fi gets better throughput if you
> deliver bursts of packets destined to the same station. Moreover this
> gets even more complicated with MU-MIMO where you may want to consider
> spatial location (which influences signal quality when grouped) of
> each station when you decide which set of stations you're going to
> aggregate to in parallel. Since drivers have a finite tx ring this it
> is important to deliver bursts that can actually be aggregated
> efficiently. This means driver would need to be able to tell qdisc
> about per-flow conditions to influence the RR scheme in some way
> (assuming a qdiscs even understands flows; do we need a unified way of
> talking about flows between qdiscs and drivers?).

This is a very good summary of the problems in layering fq_codel as it
exists today on top of wifi as it exists today. :/ Our conclusion
several years ago was that as the information needed to do things more
right was in the mac80211 layer that we could not evolve the qdisc
layer to suit, and needed to move the core ideas into the mac80211
layer.

Things have evolved since, but I still think we can't get enough info
up to the qdisc layer (locks and so on) to use it sanely.

>
> [1]: https://www.spinics.net/lists/linux-wireless/msg146187.html
> [2]: https://www.spinics.net/lists/linux-wireless/msg146512.html

I will review!

>
>>>> For reference, ath10k has around 1400 tx descriptors, though
>>>> in practice not all are usable, and in stock firmware, I'm guessing
>>>> the NIC will never be able to actually fill up it's tx descriptors
>>>> and stop traffic. Instead, it just allows the stack to try to
>>>> TX, then drops the frame...
>>>
>>>
>>> 1400 descriptors, ok... but they are not organised in queues?
>>> (forgive my ignorance of athX drivers)
>>
>>
>> I think all the details are in the firmware, at least for now.
>
> Yeah. Basically ath10k has a flat set of tx descriptors which are
> AC-agnostic. Firmware classifies them internally to per-AC HW queues.
>
>
> Michał

2016-02-05 08:44:45

by Michal Kazior

[permalink] [raw]
Subject: Re: [RFC v2] iwlwifi: pcie: transmit queue auto-sizing

On 4 February 2016 at 22:14, Ben Greear <[email protected]> wrote:
> On 02/04/2016 12:56 PM, Grumbach, Emmanuel wrote:
>> On 02/04/2016 10:46 PM, Ben Greear wrote:
>>> On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote:
>>>>
>>>> As many (all?) WiFi devices, Intel WiFi devices have
>>>> transmit queues which have 256 transmit descriptors
>>>> each and each descriptor corresponds to an MPDU.
>>>> This means that when it is full, the queue contains
>>>> 256 * ~1500 bytes to be transmitted (if we don't have
>>>> A-MSDUs). The purpose of those queues is to have enough
>>>> packets to be ready for transmission so that when the device
>>>> gets an opportunity to transmit (TxOP), it can take as many
>>>> packets as the spec allows and aggregate them into one
>>>> A-MPDU or even several A-MPDUs if we are using bursts.
>>>
>>> I guess this is only really usable if you have exactly one
>>> peer connected (ie, in station mode)?
>>>
>>> Otherwise, you could have one slow peer and one fast one,
>>> and then I suspect this would not work so well?
>>
>>
>> Yes. I guess this one (big) limitation. I guess that what would happen
>> in this case is that the the latency would constantly jitter. But I also

Hmm.. You'd probably need to track per-station packet sojourn time as
well and make it possible to stop/wake queues per station.


>> noticed that I could reduce the transmit queue to 130 descriptors
>> (instead of 256) and still reach maximal throughput because we can
>> refill the queues quickly enough.
>> In iwlwifi, we have plans to have one queue for each peer.
>> This is under development. Not sure when it'll be ready. It also requires
>> firmware change obviously.
>
> Per-peer queues will probably be nice, especially if we can keep the
> buffer bloat manageable.

Per-station queues sound tricky if you consider bufferbloat.

To maximize use of airtime (i.e. txop) you need to send big
aggregates. Since aggregates are per station-tid to maximize
multi-station performance (in AP mode) you'll need to queue a lot of
frames, per each station, depending on the chosen tx rate.

A bursted txop can be as big as 5-10ms. If you consider you want to
queue 5-10ms worth of data for *each* station at any given time you
obviously introduce a lot of lag. If you have 10 stations you might
end up with service period at 10*10ms = 100ms. This gets even worse if
you consider MU-MIMO because you need to do an expensive sounding
procedure before transmitting. So while SU aggregation can probably
still work reasonably well with shorter bursts (1-2ms) MU needs at
least 3ms to get *any* gain when compared to SU (which obviously means
you want more to actually make MU pay off). The rule of thumb is the
longer you wait the bigger capacity you can get.

Apparently there's interest in maximizing throughput but it stands in
direct opposition of keeping the latency down so I've been thinking
how to satisfy both.

The current approach ath10k is taking (patches in review [1][2]) is to
use mac80211 software queues for per-station queuing, exposing queue
state to firmware (it decides where frames should be dequeued from)
and making it possible to stop/wake per-station tx subqueue with fake
netdev queues. I'm starting to think this is not the right way though
because it's inherently hard to control latency and there's a huge
memory overhead associated with the fake netdev queues. Also fq_codel
is a less effective with this kind of setup.

My current thinking is that the entire problem should be solved via
(per-AC) qdiscs, e.g. fq_codel. I guess one could use
limit/target/interval/quantum knobs to tune it for higher latency of
aggregation-oriented Wi-Fi links where long service time (think
100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi
works in the first place, i.e. Wi-Fi gets better throughput if you
deliver bursts of packets destined to the same station. Moreover this
gets even more complicated with MU-MIMO where you may want to consider
spatial location (which influences signal quality when grouped) of
each station when you decide which set of stations you're going to
aggregate to in parallel. Since drivers have a finite tx ring this it
is important to deliver bursts that can actually be aggregated
efficiently. This means driver would need to be able to tell qdisc
about per-flow conditions to influence the RR scheme in some way
(assuming a qdiscs even understands flows; do we need a unified way of
talking about flows between qdiscs and drivers?).


[1]: https://www.spinics.net/lists/linux-wireless/msg146187.html
[2]: https://www.spinics.net/lists/linux-wireless/msg146512.html


>>> For reference, ath10k has around 1400 tx descriptors, though
>>> in practice not all are usable, and in stock firmware, I'm guessing
>>> the NIC will never be able to actually fill up it's tx descriptors
>>> and stop traffic. Instead, it just allows the stack to try to
>>> TX, then drops the frame...
>>
>>
>> 1400 descriptors, ok... but they are not organised in queues?
>> (forgive my ignorance of athX drivers)
>
>
> I think all the details are in the firmware, at least for now.

Yeah. Basically ath10k has a flat set of tx descriptors which are
AC-agnostic. Firmware classifies them internally to per-AC HW queues.


Michał

2016-02-04 20:56:05

by Grumbach, Emmanuel

[permalink] [raw]
Subject: Re: [RFC v2] iwlwifi: pcie: transmit queue auto-sizing



On 02/04/2016 10:46 PM, Ben Greear wrote:
> On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote:
>> As many (all?) WiFi devices, Intel WiFi devices have
>> transmit queues which have 256 transmit descriptors
>> each and each descriptor corresponds to an MPDU.
>> This means that when it is full, the queue contains
>> 256 * ~1500 bytes to be transmitted (if we don't have
>> A-MSDUs). The purpose of those queues is to have enough
>> packets to be ready for transmission so that when the device
>> gets an opportunity to transmit (TxOP), it can take as many
>> packets as the spec allows and aggregate them into one
>> A-MPDU or even several A-MPDUs if we are using bursts.
> I guess this is only really usable if you have exactly one
> peer connected (ie, in station mode)?
>
> Otherwise, you could have one slow peer and one fast one,
> and then I suspect this would not work so well?

Yes. I guess this one (big) limitation. I guess that what would happen
in this case is that the the latency would constantly jitter. But I also
noticed that I could reduce the transmit queue to 130 descriptors
(instead of 256) and still reach maximal throughput because we can
refill the queues quickly enough.
In iwlwifi, we have plans to have one queue for each peer.
This is under development. Not sure when it'll be ready. It also requires
firmware change obviously.

> For reference, ath10k has around 1400 tx descriptors, though
> in practice not all are usable, and in stock firmware, I'm guessing
> the NIC will never be able to actually fill up it's tx descriptors
> and stop traffic. Instead, it just allows the stack to try to
> TX, then drops the frame...

1400 descriptors, ok... but they are not organised in queues?
(forgive my ignorance of athX drivers)

>
> Thanks,
> Ben
>