2021-05-28 06:07:32

by Yunsheng Lin

[permalink] [raw]
Subject: [PATCH net-next 0/3] Some optimization for lockless qdisc

Patch 1: remove unnecessary seqcount operation.
Patch 2: implement TCQ_F_CAN_BYPASS.
Patch 3: remove qdisc->empty.

Performance data for pktgen in queue_xmit mode + dummy netdev
with pfifo_fast:

threads unpatched patched delta
1 2.60Mpps 3.21Mpps +23%
2 3.84Mpps 5.56Mpps +44%
4 5.52Mpps 5.58Mpps +1%
8 2.77Mpps 2.76Mpps -0.3%
16 2.24Mpps 2.23Mpps +0.4%

Performance for IP forward testing: 1.05Mpps increases to
1.16Mpps, about 10% improvement.

V1: Drop RFC tag, Add nolock_qdisc_is_empty() and do the qdisc
empty checking without the protection of qdisc->seqlock to
aviod doing unnecessary spin_trylock() for contention case.
RFC v4: Use STATE_MISSED and STATE_DRAINING to indicate non-empty
qdisc, and add patch 1 and 3.

Yunsheng Lin (3):
net: sched: avoid unnecessary seqcount operation for lockless qdisc
net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc
net: sched: remove qdisc->empty for lockless qdisc

include/net/sch_generic.h | 31 ++++++++++++++++++-------------
net/core/dev.c | 26 ++++++++++++++++++++++++--
net/sched/sch_generic.c | 23 ++++++++++++++++-------
3 files changed, 58 insertions(+), 22 deletions(-)

--
2.7.4


2021-05-28 08:50:07

by Yunsheng Lin

[permalink] [raw]
Subject: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
flag set, but queue discipline by-pass does not work for lockless
qdisc because skb is always enqueued to qdisc even when the qdisc
is empty, see __dev_xmit_skb().

This patch calls sch_direct_xmit() to transmit the skb directly
to the driver for empty lockless qdisc, which aviod enqueuing
and dequeuing operation.

As qdisc->empty is not reliable to indicate a empty qdisc because
there is a time window between enqueuing and setting qdisc->empty.
So we use the MISSED state added in commit a90c57f2cedd ("net:
sched: fix packet stuck problem for lockless qdisc"), which
indicate there is lock contention, suggesting that it is better
not to do the qdisc bypass in order to avoid packet out of order
problem.

In order to make MISSED state reliable to indicate a empty qdisc,
we need to ensure that testing and clearing of MISSED state is
within the protection of qdisc->seqlock, only setting MISSED state
can be done without the protection of qdisc->seqlock. A MISSED
state testing is added without the protection of qdisc->seqlock to
aviod doing unnecessary spin_trylock() for contention case.

There are below cases that need special handling:
1. When MISSED state is cleared before another round of dequeuing
in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
dequeue all skb in one round and call __netif_schedule(), which
might result in a non-empty qdisc without MISSED set. In order
to avoid this, the MISSED state is set for lockless qdisc and
__netif_schedule() will be called at the end of qdisc_run_end.

2. The MISSED state also need to be set for lockless qdisc instead
of calling __netif_schedule() directly when requeuing a skb for
a similar reason.

3. For netdev queue stopped case, the MISSED case need clearing
while the netdev queue is stopped, otherwise there may be
unnecessary __netif_schedule() calling. So a new DRAINING state
is added to indicate this case, which also indicate a non-empty
qdisc.

4. As there is already netif_xmit_frozen_or_stopped() checking in
dequeue_skb() and sch_direct_xmit(), which are both within the
protection of qdisc->seqlock, but the same checking in
__dev_xmit_skb() is without the protection, which might cause
empty indication of a lockless qdisc to be not reliable. So
remove the checking in __dev_xmit_skb(), and the checking in
the protection of qdisc->seqlock seems enough to avoid the cpu
consumption problem for netdev queue stopped case.

Signed-off-by: Yunsheng Lin <[email protected]>
---
V1: Add nolock_qdisc_is_empty() and do the qdisc empty checking
without the protection of qdisc->seqlock to aviod doing
unnecessary spin_trylock() for contention case.
RFC v4: Use STATE_MISSED and STATE_DRAINING to indicate non-empty
qdisc.
---
include/net/sch_generic.h | 16 +++++++++++++---
net/core/dev.c | 26 ++++++++++++++++++++++++--
net/sched/sch_generic.c | 20 ++++++++++++++++----
3 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 3ed6bcc..177f240 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -37,8 +37,15 @@ enum qdisc_state_t {
__QDISC_STATE_SCHED,
__QDISC_STATE_DEACTIVATED,
__QDISC_STATE_MISSED,
+ __QDISC_STATE_DRAINING,
};

+#define QDISC_STATE_MISSED BIT(__QDISC_STATE_MISSED)
+#define QDISC_STATE_DRAINING BIT(__QDISC_STATE_DRAINING)
+
+#define QDISC_STATE_NON_EMPTY (QDISC_STATE_MISSED | \
+ QDISC_STATE_DRAINING)
+
struct qdisc_size_table {
struct rcu_head rcu;
struct list_head list;
@@ -145,6 +152,11 @@ static inline bool qdisc_is_running(struct Qdisc *qdisc)
return (raw_read_seqcount(&qdisc->running) & 1) ? true : false;
}

+static inline bool nolock_qdisc_is_empty(const struct Qdisc *qdisc)
+{
+ return !(READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY);
+}
+
static inline bool qdisc_is_percpu_stats(const struct Qdisc *q)
{
return q->flags & TCQ_F_CPUSTATS;
@@ -206,10 +218,8 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
spin_unlock(&qdisc->seqlock);

if (unlikely(test_bit(__QDISC_STATE_MISSED,
- &qdisc->state))) {
- clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
+ &qdisc->state)))
__netif_schedule(qdisc);
- }
} else {
write_seqcount_end(&qdisc->running);
}
diff --git a/net/core/dev.c b/net/core/dev.c
index 50531a2..e4cc926 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
qdisc_calculate_pkt_len(skb, q);

if (q->flags & TCQ_F_NOLOCK) {
+ if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
+ qdisc_run_begin(q)) {
+ /* Retest nolock_qdisc_is_empty() within the protection
+ * of q->seqlock to ensure qdisc is indeed empty.
+ */
+ if (unlikely(!nolock_qdisc_is_empty(q))) {
+ rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
+ __qdisc_run(q);
+ qdisc_run_end(q);
+
+ goto no_lock_out;
+ }
+
+ qdisc_bstats_cpu_update(q, skb);
+ if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
+ !nolock_qdisc_is_empty(q))
+ __qdisc_run(q);
+
+ qdisc_run_end(q);
+ return NET_XMIT_SUCCESS;
+ }
+
rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
- if (likely(!netif_xmit_frozen_or_stopped(txq)))
- qdisc_run(q);
+ qdisc_run(q);

+no_lock_out:
if (unlikely(to_free))
kfree_skb_list(to_free);
return rc;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index fc8b56b..83d7f5f 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -52,6 +52,8 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
*/
if (!netif_xmit_frozen_or_stopped(txq))
set_bit(__QDISC_STATE_MISSED, &q->state);
+ else
+ set_bit(__QDISC_STATE_DRAINING, &q->state);
}

/* Main transmission queue. */
@@ -164,9 +166,13 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)

skb = next;
}
- if (lock)
+
+ if (lock) {
spin_unlock(lock);
- __netif_schedule(q);
+ set_bit(__QDISC_STATE_MISSED, &q->state);
+ } else {
+ __netif_schedule(q);
+ }
}

static void try_bulk_dequeue_skb(struct Qdisc *q,
@@ -409,7 +415,11 @@ void __qdisc_run(struct Qdisc *q)
while (qdisc_restart(q, &packets)) {
quota -= packets;
if (quota <= 0) {
- __netif_schedule(q);
+ if (q->flags & TCQ_F_NOLOCK)
+ set_bit(__QDISC_STATE_MISSED, &q->state);
+ else
+ __netif_schedule(q);
+
break;
}
}
@@ -680,13 +690,14 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
if (likely(skb)) {
qdisc_update_stats_at_dequeue(qdisc, skb);
} else if (need_retry &&
- test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
+ READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY) {
/* Delay clearing the STATE_MISSED here to reduce
* the overhead of the second spin_trylock() in
* qdisc_run_begin() and __netif_schedule() calling
* in qdisc_run_end().
*/
clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
+ clear_bit(__QDISC_STATE_DRAINING, &qdisc->state);

/* Make sure dequeuing happens after clearing
* STATE_MISSED.
@@ -1204,6 +1215,7 @@ static void dev_reset_queue(struct net_device *dev,
spin_unlock_bh(qdisc_lock(qdisc));
if (nolock) {
clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
+ clear_bit(__QDISC_STATE_DRAINING, &qdisc->state);
spin_unlock_bh(&qdisc->seqlock);
}
}
--
2.7.4

2021-05-29 01:01:32

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On Fri, 28 May 2021 10:49:56 +0800 Yunsheng Lin wrote:
> Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
> flag set, but queue discipline by-pass does not work for lockless
> qdisc because skb is always enqueued to qdisc even when the qdisc
> is empty, see __dev_xmit_skb().
>
> This patch calls sch_direct_xmit() to transmit the skb directly
> to the driver for empty lockless qdisc, which aviod enqueuing
> and dequeuing operation.
>
> As qdisc->empty is not reliable to indicate a empty qdisc because
> there is a time window between enqueuing and setting qdisc->empty.
> So we use the MISSED state added in commit a90c57f2cedd ("net:
> sched: fix packet stuck problem for lockless qdisc"), which
> indicate there is lock contention, suggesting that it is better
> not to do the qdisc bypass in order to avoid packet out of order
> problem.
>
> In order to make MISSED state reliable to indicate a empty qdisc,
> we need to ensure that testing and clearing of MISSED state is
> within the protection of qdisc->seqlock, only setting MISSED state
> can be done without the protection of qdisc->seqlock. A MISSED
> state testing is added without the protection of qdisc->seqlock to
> aviod doing unnecessary spin_trylock() for contention case.
>
> There are below cases that need special handling:
> 1. When MISSED state is cleared before another round of dequeuing
> in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
> dequeue all skb in one round and call __netif_schedule(), which
> might result in a non-empty qdisc without MISSED set. In order
> to avoid this, the MISSED state is set for lockless qdisc and
> __netif_schedule() will be called at the end of qdisc_run_end.
>
> 2. The MISSED state also need to be set for lockless qdisc instead
> of calling __netif_schedule() directly when requeuing a skb for
> a similar reason.
>
> 3. For netdev queue stopped case, the MISSED case need clearing
> while the netdev queue is stopped, otherwise there may be
> unnecessary __netif_schedule() calling. So a new DRAINING state
> is added to indicate this case, which also indicate a non-empty
> qdisc.
>
> 4. As there is already netif_xmit_frozen_or_stopped() checking in
> dequeue_skb() and sch_direct_xmit(), which are both within the
> protection of qdisc->seqlock, but the same checking in
> __dev_xmit_skb() is without the protection, which might cause
> empty indication of a lockless qdisc to be not reliable. So
> remove the checking in __dev_xmit_skb(), and the checking in
> the protection of qdisc->seqlock seems enough to avoid the cpu
> consumption problem for netdev queue stopped case.
>
> Signed-off-by: Yunsheng Lin <[email protected]>

> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 3ed6bcc..177f240 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -37,8 +37,15 @@ enum qdisc_state_t {
> __QDISC_STATE_SCHED,
> __QDISC_STATE_DEACTIVATED,
> __QDISC_STATE_MISSED,
> + __QDISC_STATE_DRAINING,
> };
>
> +#define QDISC_STATE_MISSED BIT(__QDISC_STATE_MISSED)
> +#define QDISC_STATE_DRAINING BIT(__QDISC_STATE_DRAINING)
> +
> +#define QDISC_STATE_NON_EMPTY (QDISC_STATE_MISSED | \
> + QDISC_STATE_DRAINING)
> +
> struct qdisc_size_table {
> struct rcu_head rcu;
> struct list_head list;
> @@ -145,6 +152,11 @@ static inline bool qdisc_is_running(struct Qdisc *qdisc)
> return (raw_read_seqcount(&qdisc->running) & 1) ? true : false;
> }
>
> +static inline bool nolock_qdisc_is_empty(const struct Qdisc *qdisc)
> +{
> + return !(READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY);
> +}
> +
> static inline bool qdisc_is_percpu_stats(const struct Qdisc *q)
> {
> return q->flags & TCQ_F_CPUSTATS;
> @@ -206,10 +218,8 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
> spin_unlock(&qdisc->seqlock);
>
> if (unlikely(test_bit(__QDISC_STATE_MISSED,
> - &qdisc->state))) {
> - clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
> + &qdisc->state)))

Why remove the clear_bit()? Wasn't that added to avoid infinite
re-schedules?

> __netif_schedule(qdisc);
> - }
> } else {
> write_seqcount_end(&qdisc->running);
> }
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 50531a2..e4cc926 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> qdisc_calculate_pkt_len(skb, q);
>
> if (q->flags & TCQ_F_NOLOCK) {
> + if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
> + qdisc_run_begin(q)) {
> + /* Retest nolock_qdisc_is_empty() within the protection
> + * of q->seqlock to ensure qdisc is indeed empty.
> + */
> + if (unlikely(!nolock_qdisc_is_empty(q))) {

This is just for the DRAINING case right?

MISSED can be set at any moment, testing MISSED seems confusing.

Is it really worth the extra code?

> + rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> + __qdisc_run(q);
> + qdisc_run_end(q);
> +
> + goto no_lock_out;
> + }
> +
> + qdisc_bstats_cpu_update(q, skb);
> + if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
> + !nolock_qdisc_is_empty(q))
> + __qdisc_run(q);
> +
> + qdisc_run_end(q);
> + return NET_XMIT_SUCCESS;
> + }
> +
> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> - if (likely(!netif_xmit_frozen_or_stopped(txq)))
> - qdisc_run(q);
> + qdisc_run(q);
>
> +no_lock_out:
> if (unlikely(to_free))
> kfree_skb_list(to_free);
> return rc;
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index fc8b56b..83d7f5f 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -52,6 +52,8 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
> */
> if (!netif_xmit_frozen_or_stopped(txq))
> set_bit(__QDISC_STATE_MISSED, &q->state);
> + else
> + set_bit(__QDISC_STATE_DRAINING, &q->state);
> }
>
> /* Main transmission queue. */
> @@ -164,9 +166,13 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>
> skb = next;
> }
> - if (lock)
> +
> + if (lock) {
> spin_unlock(lock);
> - __netif_schedule(q);
> + set_bit(__QDISC_STATE_MISSED, &q->state);
> + } else {
> + __netif_schedule(q);

Could we reorder qdisc_run_begin() with clear_bit(SCHED)
in net_tx_action() and add SCHED to the NON_EMPTY mask?

> + }
> }
>
> static void try_bulk_dequeue_skb(struct Qdisc *q,
> @@ -409,7 +415,11 @@ void __qdisc_run(struct Qdisc *q)
> while (qdisc_restart(q, &packets)) {
> quota -= packets;
> if (quota <= 0) {
> - __netif_schedule(q);
> + if (q->flags & TCQ_F_NOLOCK)
> + set_bit(__QDISC_STATE_MISSED, &q->state);
> + else
> + __netif_schedule(q);
> +
> break;
> }
> }
> @@ -680,13 +690,14 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
> if (likely(skb)) {
> qdisc_update_stats_at_dequeue(qdisc, skb);
> } else if (need_retry &&
> - test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
> + READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY) {

Do we really need to retry based on DRAINING being set?
Or is it just a convenient way of coding things up?

> /* Delay clearing the STATE_MISSED here to reduce
> * the overhead of the second spin_trylock() in
> * qdisc_run_begin() and __netif_schedule() calling
> * in qdisc_run_end().
> */
> clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
> + clear_bit(__QDISC_STATE_DRAINING, &qdisc->state);


2021-05-29 01:47:36

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On 2021/5/29 9:00, Jakub Kicinski wrote:
> On Fri, 28 May 2021 10:49:56 +0800 Yunsheng Lin wrote:
>> Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
>> flag set, but queue discipline by-pass does not work for lockless
>> qdisc because skb is always enqueued to qdisc even when the qdisc
>> is empty, see __dev_xmit_skb().
>>
>> This patch calls sch_direct_xmit() to transmit the skb directly
>> to the driver for empty lockless qdisc, which aviod enqueuing
>> and dequeuing operation.
>>
>> As qdisc->empty is not reliable to indicate a empty qdisc because
>> there is a time window between enqueuing and setting qdisc->empty.
>> So we use the MISSED state added in commit a90c57f2cedd ("net:
>> sched: fix packet stuck problem for lockless qdisc"), which
>> indicate there is lock contention, suggesting that it is better
>> not to do the qdisc bypass in order to avoid packet out of order
>> problem.
>>
>> In order to make MISSED state reliable to indicate a empty qdisc,
>> we need to ensure that testing and clearing of MISSED state is
>> within the protection of qdisc->seqlock, only setting MISSED state
>> can be done without the protection of qdisc->seqlock. A MISSED
>> state testing is added without the protection of qdisc->seqlock to
>> aviod doing unnecessary spin_trylock() for contention case.
>>
>> There are below cases that need special handling:
>> 1. When MISSED state is cleared before another round of dequeuing
>> in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
>> dequeue all skb in one round and call __netif_schedule(), which
>> might result in a non-empty qdisc without MISSED set. In order
>> to avoid this, the MISSED state is set for lockless qdisc and
>> __netif_schedule() will be called at the end of qdisc_run_end.
>>
>> 2. The MISSED state also need to be set for lockless qdisc instead
>> of calling __netif_schedule() directly when requeuing a skb for
>> a similar reason.
>>
>> 3. For netdev queue stopped case, the MISSED case need clearing
>> while the netdev queue is stopped, otherwise there may be
>> unnecessary __netif_schedule() calling. So a new DRAINING state
>> is added to indicate this case, which also indicate a non-empty
>> qdisc.
>>
>> 4. As there is already netif_xmit_frozen_or_stopped() checking in
>> dequeue_skb() and sch_direct_xmit(), which are both within the
>> protection of qdisc->seqlock, but the same checking in
>> __dev_xmit_skb() is without the protection, which might cause
>> empty indication of a lockless qdisc to be not reliable. So
>> remove the checking in __dev_xmit_skb(), and the checking in
>> the protection of qdisc->seqlock seems enough to avoid the cpu
>> consumption problem for netdev queue stopped case.
>>
>> Signed-off-by: Yunsheng Lin <[email protected]>
>
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index 3ed6bcc..177f240 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -37,8 +37,15 @@ enum qdisc_state_t {
>> __QDISC_STATE_SCHED,
>> __QDISC_STATE_DEACTIVATED,
>> __QDISC_STATE_MISSED,
>> + __QDISC_STATE_DRAINING,
>> };
>>
>> +#define QDISC_STATE_MISSED BIT(__QDISC_STATE_MISSED)
>> +#define QDISC_STATE_DRAINING BIT(__QDISC_STATE_DRAINING)
>> +
>> +#define QDISC_STATE_NON_EMPTY (QDISC_STATE_MISSED | \
>> + QDISC_STATE_DRAINING)
>> +
>> struct qdisc_size_table {
>> struct rcu_head rcu;
>> struct list_head list;
>> @@ -145,6 +152,11 @@ static inline bool qdisc_is_running(struct Qdisc *qdisc)
>> return (raw_read_seqcount(&qdisc->running) & 1) ? true : false;
>> }
>>
>> +static inline bool nolock_qdisc_is_empty(const struct Qdisc *qdisc)
>> +{
>> + return !(READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY);
>> +}
>> +
>> static inline bool qdisc_is_percpu_stats(const struct Qdisc *q)
>> {
>> return q->flags & TCQ_F_CPUSTATS;
>> @@ -206,10 +218,8 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
>> spin_unlock(&qdisc->seqlock);
>>
>> if (unlikely(test_bit(__QDISC_STATE_MISSED,
>> - &qdisc->state))) {
>> - clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
>> + &qdisc->state)))
>
> Why remove the clear_bit()? Wasn't that added to avoid infinite
> re-schedules?

As we has also clear the MISSED for stopped and deactived case, so
the above clear_bit() is necessarily needed, it was added because
of the 0.02Mpps improvement for the 16 thread pktgen test case.

As we need to ensure clearing is within the protection of q->seqlock,
and above clear_bit is not within the protection, so remove the
clear_bit() above.

>
>> __netif_schedule(qdisc);
>> - }
>> } else {
>> write_seqcount_end(&qdisc->running);
>> }
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 50531a2..e4cc926 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> qdisc_calculate_pkt_len(skb, q);
>>
>> if (q->flags & TCQ_F_NOLOCK) {
>> + if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
>> + qdisc_run_begin(q)) {
>> + /* Retest nolock_qdisc_is_empty() within the protection
>> + * of q->seqlock to ensure qdisc is indeed empty.
>> + */
>> + if (unlikely(!nolock_qdisc_is_empty(q))) {
>
> This is just for the DRAINING case right?
>
> MISSED can be set at any moment, testing MISSED seems confusing.

MISSED is only set when there is lock contention, which means it
is better not to do the qdisc bypass to avoid out of order packet
problem, another good thing is that we could also do the batch
dequeuing and transmiting of packets when there is lock contention.

>
> Is it really worth the extra code?

Currently DRAINING is only set for the netdev queue stopped.
We could only use DRAINING to indicate the non-empty of a qdisc,
then we need to set the DRAINING evrywhere MISSED is set, that is
why I use both DRAINING and MISSED to indicate a non-empty qdisc.

>
>> + rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
>> + __qdisc_run(q);
>> + qdisc_run_end(q);
>> +
>> + goto no_lock_out;
>> + }
>> +
>> + qdisc_bstats_cpu_update(q, skb);
>> + if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
>> + !nolock_qdisc_is_empty(q))
>> + __qdisc_run(q);
>> +
>> + qdisc_run_end(q);
>> + return NET_XMIT_SUCCESS;
>> + }
>> +
>> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
>> - if (likely(!netif_xmit_frozen_or_stopped(txq)))
>> - qdisc_run(q);
>> + qdisc_run(q);
>>
>> +no_lock_out:
>> if (unlikely(to_free))
>> kfree_skb_list(to_free);
>> return rc;
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index fc8b56b..83d7f5f 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -52,6 +52,8 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
>> */
>> if (!netif_xmit_frozen_or_stopped(txq))
>> set_bit(__QDISC_STATE_MISSED, &q->state);
>> + else
>> + set_bit(__QDISC_STATE_DRAINING, &q->state);
>> }
>>
>> /* Main transmission queue. */
>> @@ -164,9 +166,13 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>>
>> skb = next;
>> }
>> - if (lock)
>> +
>> + if (lock) {
>> spin_unlock(lock);
>> - __netif_schedule(q);
>> + set_bit(__QDISC_STATE_MISSED, &q->state);
>> + } else {
>> + __netif_schedule(q);
>
> Could we reorder qdisc_run_begin() with clear_bit(SCHED)
> in net_tx_action() and add SCHED to the NON_EMPTY mask?

Did you mean clearing the SCHED after the q->seqlock is
taken?

The problem is that the SCHED is also used to indicate
a qdisc is in sd->output_queue or not, and the
qdisc_run_begin() called by net_tx_action() can not
guarantee it will take the q->seqlock(we are using trylock
for lockless qdisc)

>
>> + }
>> }
>>
>> static void try_bulk_dequeue_skb(struct Qdisc *q,
>> @@ -409,7 +415,11 @@ void __qdisc_run(struct Qdisc *q)
>> while (qdisc_restart(q, &packets)) {
>> quota -= packets;
>> if (quota <= 0) {
>> - __netif_schedule(q);
>> + if (q->flags & TCQ_F_NOLOCK)
>> + set_bit(__QDISC_STATE_MISSED, &q->state);
>> + else
>> + __netif_schedule(q);
>> +
>> break;
>> }
>> }
>> @@ -680,13 +690,14 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>> if (likely(skb)) {
>> qdisc_update_stats_at_dequeue(qdisc, skb);
>> } else if (need_retry &&
>> - test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
>> + READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY) {
>
> Do we really need to retry based on DRAINING being set?
> Or is it just a convenient way of coding things up?

Yes, it is just a convenient way of coding things up.
Only MISSED need retrying.

>
>> /* Delay clearing the STATE_MISSED here to reduce
>> * the overhead of the second spin_trylock() in
>> * qdisc_run_begin() and __netif_schedule() calling
>> * in qdisc_run_end().
>> */
>> clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
>> + clear_bit(__QDISC_STATE_DRAINING, &qdisc->state);
>
>
>
> .
>

2021-05-29 04:34:23

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On Sat, 29 May 2021 09:44:57 +0800 Yunsheng Lin wrote:
> >> @@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >> qdisc_calculate_pkt_len(skb, q);
> >>
> >> if (q->flags & TCQ_F_NOLOCK) {
> >> + if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
> >> + qdisc_run_begin(q)) {
> >> + /* Retest nolock_qdisc_is_empty() within the protection
> >> + * of q->seqlock to ensure qdisc is indeed empty.
> >> + */
> >> + if (unlikely(!nolock_qdisc_is_empty(q))) {
> >
> > This is just for the DRAINING case right?
> >
> > MISSED can be set at any moment, testing MISSED seems confusing.
>
> MISSED is only set when there is lock contention, which means it
> is better not to do the qdisc bypass to avoid out of order packet
> problem,

Avoid as in make less likely? Nothing guarantees other thread is not
interrupted after ->enqueue and before qdisc_run_begin().

TBH I'm not sure what out-of-order situation you're referring to,
there is no ordering guarantee between separate threads trying to
transmit AFAIU.

IOW this check is not required for correctness, right?

> another good thing is that we could also do the batch
> dequeuing and transmiting of packets when there is lock contention.

No doubt, but did you see the flag get set significantly often here
to warrant the double-checking?

> > Is it really worth the extra code?
>
> Currently DRAINING is only set for the netdev queue stopped.
> We could only use DRAINING to indicate the non-empty of a qdisc,
> then we need to set the DRAINING evrywhere MISSED is set, that is
> why I use both DRAINING and MISSED to indicate a non-empty qdisc.
>
> >
> >> + rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> >> + __qdisc_run(q);
> >> + qdisc_run_end(q);
> >> +
> >> + goto no_lock_out;
> >> + }
> >> +
> >> + qdisc_bstats_cpu_update(q, skb);
> >> + if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
> >> + !nolock_qdisc_is_empty(q))
> >> + __qdisc_run(q);
> >> +
> >> + qdisc_run_end(q);
> >> + return NET_XMIT_SUCCESS;
> >> + }
> >> +
> >> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> >> - if (likely(!netif_xmit_frozen_or_stopped(txq)))
> >> - qdisc_run(q);
> >> + qdisc_run(q);
> >>
> >> +no_lock_out:
> >> if (unlikely(to_free))
> >> kfree_skb_list(to_free);
> >> return rc;

> >> @@ -164,9 +166,13 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
> >>
> >> skb = next;
> >> }
> >> - if (lock)
> >> +
> >> + if (lock) {
> >> spin_unlock(lock);
> >> - __netif_schedule(q);
> >> + set_bit(__QDISC_STATE_MISSED, &q->state);
> >> + } else {
> >> + __netif_schedule(q);
> >
> > Could we reorder qdisc_run_begin() with clear_bit(SCHED)
> > in net_tx_action() and add SCHED to the NON_EMPTY mask?
>
> Did you mean clearing the SCHED after the q->seqlock is
> taken?
>
> The problem is that the SCHED is also used to indicate
> a qdisc is in sd->output_queue or not, and the
> qdisc_run_begin() called by net_tx_action() can not
> guarantee it will take the q->seqlock(we are using trylock
> for lockless qdisc)

Ah, right. We'd need to do some more flag juggling in net_tx_action()
to get it right.

> >> + }
> >> }
> >>
> >> static void try_bulk_dequeue_skb(struct Qdisc *q,
> >> @@ -409,7 +415,11 @@ void __qdisc_run(struct Qdisc *q)
> >> while (qdisc_restart(q, &packets)) {
> >> quota -= packets;
> >> if (quota <= 0) {
> >> - __netif_schedule(q);
> >> + if (q->flags & TCQ_F_NOLOCK)
> >> + set_bit(__QDISC_STATE_MISSED, &q->state);
> >> + else
> >> + __netif_schedule(q);
> >> +
> >> break;
> >> }
> >> }
> >> @@ -680,13 +690,14 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
> >> if (likely(skb)) {
> >> qdisc_update_stats_at_dequeue(qdisc, skb);
> >> } else if (need_retry &&
> >> - test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
> >> + READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY) {
> >
> > Do we really need to retry based on DRAINING being set?
> > Or is it just a convenient way of coding things up?
>
> Yes, it is just a convenient way of coding things up.
> Only MISSED need retrying.

2021-05-29 07:06:03

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On 2021/5/29 12:32, Jakub Kicinski wrote:
> On Sat, 29 May 2021 09:44:57 +0800 Yunsheng Lin wrote:
>>>> @@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>>>> qdisc_calculate_pkt_len(skb, q);
>>>>
>>>> if (q->flags & TCQ_F_NOLOCK) {
>>>> + if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
>>>> + qdisc_run_begin(q)) {
>>>> + /* Retest nolock_qdisc_is_empty() within the protection
>>>> + * of q->seqlock to ensure qdisc is indeed empty.
>>>> + */
>>>> + if (unlikely(!nolock_qdisc_is_empty(q))) {
>>>
>>> This is just for the DRAINING case right?
>>>
>>> MISSED can be set at any moment, testing MISSED seems confusing.
>>
>> MISSED is only set when there is lock contention, which means it
>> is better not to do the qdisc bypass to avoid out of order packet
>> problem,
>
> Avoid as in make less likely? Nothing guarantees other thread is not
> interrupted after ->enqueue and before qdisc_run_begin().
>
> TBH I'm not sure what out-of-order situation you're referring to,
> there is no ordering guarantee between separate threads trying to
> transmit AFAIU.
A thread need to do the bypass checking before doing enqueuing, so
it means MISSED is set or the trylock fails for the bypass transmiting(
which will set the MISSED after the first trylock), so the MISSED will
always be set before a thread doing a enqueuing, and we ensure MISSED
only be cleared during the protection of q->seqlock, after clearing
MISSED, we do anther round of dequeuing within the protection of
q->seqlock.

So if a thread has taken the q->seqlock and the MISSED is not set yet,
it is allowed to send the packet directly without going through the
qdisc enqueuing and dequeuing process.


>
> IOW this check is not required for correctness, right?

if a thread has taken the q->seqlock and the MISSED is not set, it means
other thread has not set MISSED after the first trylock and before the
second trylock, which means the enqueuing is not done yet.
So I assume the this check is required for correctness if I understand
your question correctly.

>
>> another good thing is that we could also do the batch
>> dequeuing and transmiting of packets when there is lock contention.
>
> No doubt, but did you see the flag get set significantly often here
> to warrant the double-checking?

No, that is just my guess:)

>
>>> Is it really worth the extra code?
>>
>> Currently DRAINING is only set for the netdev queue stopped.
>> We could only use DRAINING to indicate the non-empty of a qdisc,
>> then we need to set the DRAINING evrywhere MISSED is set, that is
>> why I use both DRAINING and MISSED to indicate a non-empty qdisc.


2021-05-29 18:54:07

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On Sat, 29 May 2021 15:03:09 +0800 Yunsheng Lin wrote:
> On 2021/5/29 12:32, Jakub Kicinski wrote:
> > On Sat, 29 May 2021 09:44:57 +0800 Yunsheng Lin wrote:
> >> MISSED is only set when there is lock contention, which means it
> >> is better not to do the qdisc bypass to avoid out of order packet
> >> problem,
> >
> > Avoid as in make less likely? Nothing guarantees other thread is not
> > interrupted after ->enqueue and before qdisc_run_begin().
> >
> > TBH I'm not sure what out-of-order situation you're referring to,
> > there is no ordering guarantee between separate threads trying to
> > transmit AFAIU.
> A thread need to do the bypass checking before doing enqueuing, so
> it means MISSED is set or the trylock fails for the bypass transmiting(
> which will set the MISSED after the first trylock), so the MISSED will
> always be set before a thread doing a enqueuing, and we ensure MISSED
> only be cleared during the protection of q->seqlock, after clearing
> MISSED, we do anther round of dequeuing within the protection of
> q->seqlock.

The fact that MISSED is only cleared under q->seqlock does not matter,
because setting it and ->enqueue() are not under any lock. If the thread
gets interrupted between:

if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
qdisc_run_begin(q)) {

and ->enqueue() we can't guarantee that something else won't come in,
take q->seqlock and clear MISSED.

thread1 thread2 thread3
# holds seqlock
qdisc_run_begin(q)
set(MISSED)
pfifo_fast_dequeue
clear(MISSED)
# recheck the queue
qdisc_run_end()
->enqueue()
q->flags & TCQ_F_CAN_BYPASS..
qdisc_run_begin() # true
sch_direct_xmit()
qdisc_run_begin()
set(MISSED)

Or am I missing something?

Re-checking nolock_qdisc_is_empty() may or may not help.
But it doesn't really matter because there is no ordering
requirement between thread2 and thread3 here.

> So if a thread has taken the q->seqlock and the MISSED is not set yet,
> it is allowed to send the packet directly without going through the
> qdisc enqueuing and dequeuing process.
>
> > IOW this check is not required for correctness, right?
>
> if a thread has taken the q->seqlock and the MISSED is not set, it means
> other thread has not set MISSED after the first trylock and before the
> second trylock, which means the enqueuing is not done yet.
> So I assume the this check is required for correctness if I understand
> your question correctly.
>
> >> another good thing is that we could also do the batch
> >> dequeuing and transmiting of packets when there is lock contention.
> >
> > No doubt, but did you see the flag get set significantly often here
> > to warrant the double-checking?
>
> No, that is just my guess:)

2021-05-30 03:24:04

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On 2021/5/30 2:49, Jakub Kicinski wrote:
> On Sat, 29 May 2021 15:03:09 +0800 Yunsheng Lin wrote:
>> On 2021/5/29 12:32, Jakub Kicinski wrote:
>>> On Sat, 29 May 2021 09:44:57 +0800 Yunsheng Lin wrote:
>>>> MISSED is only set when there is lock contention, which means it
>>>> is better not to do the qdisc bypass to avoid out of order packet
>>>> problem,
>>>
>>> Avoid as in make less likely? Nothing guarantees other thread is not
>>> interrupted after ->enqueue and before qdisc_run_begin().
>>>
>>> TBH I'm not sure what out-of-order situation you're referring to,
>>> there is no ordering guarantee between separate threads trying to
>>> transmit AFAIU.
>> A thread need to do the bypass checking before doing enqueuing, so
>> it means MISSED is set or the trylock fails for the bypass transmiting(
>> which will set the MISSED after the first trylock), so the MISSED will
>> always be set before a thread doing a enqueuing, and we ensure MISSED
>> only be cleared during the protection of q->seqlock, after clearing
>> MISSED, we do anther round of dequeuing within the protection of
>> q->seqlock.
>
> The fact that MISSED is only cleared under q->seqlock does not matter,
> because setting it and ->enqueue() are not under any lock. If the thread
> gets interrupted between:
>
> if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
> qdisc_run_begin(q)) {
>
> and ->enqueue() we can't guarantee that something else won't come in,
> take q->seqlock and clear MISSED.
>
> thread1 thread2 thread3
> # holds seqlock
> qdisc_run_begin(q)
> set(MISSED)
> pfifo_fast_dequeue
> clear(MISSED)
> # recheck the queue
> qdisc_run_end()
> ->enqueue()
> q->flags & TCQ_F_CAN_BYPASS..
> qdisc_run_begin() # true
> sch_direct_xmit()
> qdisc_run_begin()
> set(MISSED)
>
> Or am I missing something?
>
> Re-checking nolock_qdisc_is_empty() may or may not help.
> But it doesn't really matter because there is no ordering
> requirement between thread2 and thread3 here.

I were more focued on explaining that using MISSED is reliable
as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a
empty qdisc, and forgot to mention the data race described in
RFCv3, which is kind of like the one described above:

"There is a data race as below:

CPU1 CPU2
qdisc_run_begin(q) .
. q->enqueue()
sch_may_need_requeuing() .
return true .
. .
. .
q->enqueue() .

When above happen, the skb enqueued by CPU1 is dequeued after the
skb enqueued by CPU2 because sch_may_need_requeuing() return true.
If there is not qdisc bypass, the CPU1 has better chance to queue
the skb quicker than CPU2.

This patch does not take care of the above data race, because I
view this as similar as below:

Even at the same time CPU1 and CPU2 write the skb to two socket
which both heading to the same qdisc, there is no guarantee that
which skb will hit the qdisc first, becuase there is a lot of
factor like interrupt/softirq/cache miss/scheduling afffecting
that."

Does above make sense? Or any idea to avoid it?

1. https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

>
>> So if a thread has taken the q->seqlock and the MISSED is not set yet,
>> it is allowed to send the packet directly without going through the
>> qdisc enqueuing and dequeuing process.
>>
>>> IOW this check is not required for correctness, right?
>>
>> if a thread has taken the q->seqlock and the MISSED is not set, it means
>> other thread has not set MISSED after the first trylock and before the
>> second trylock, which means the enqueuing is not done yet.
>> So I assume the this check is required for correctness if I understand
>> your question correctly.
>>
>>>> another good thing is that we could also do the batch
>>>> dequeuing and transmiting of packets when there is lock contention.
>>>
>>> No doubt, but did you see the flag get set significantly often here
>>> to warrant the double-checking?
>>
>> No, that is just my guess:)
>
>

2021-05-30 20:39:25

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On Sun, 30 May 2021 09:37:09 +0800 Yunsheng Lin wrote:
> On 2021/5/30 2:49, Jakub Kicinski wrote:
> > The fact that MISSED is only cleared under q->seqlock does not matter,
> > because setting it and ->enqueue() are not under any lock. If the thread
> > gets interrupted between:
> >
> > if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
> > qdisc_run_begin(q)) {
> >
> > and ->enqueue() we can't guarantee that something else won't come in,
> > take q->seqlock and clear MISSED.
> >
> > thread1 thread2 thread3
> > # holds seqlock
> > qdisc_run_begin(q)
> > set(MISSED)
> > pfifo_fast_dequeue
> > clear(MISSED)
> > # recheck the queue
> > qdisc_run_end()
> > ->enqueue()
> > q->flags & TCQ_F_CAN_BYPASS..
> > qdisc_run_begin() # true
> > sch_direct_xmit()
> > qdisc_run_begin()
> > set(MISSED)
> >
> > Or am I missing something?
> >
> > Re-checking nolock_qdisc_is_empty() may or may not help.
> > But it doesn't really matter because there is no ordering
> > requirement between thread2 and thread3 here.
>
> I were more focued on explaining that using MISSED is reliable
> as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a
> empty qdisc, and forgot to mention the data race described in
> RFCv3, which is kind of like the one described above:
>
> "There is a data race as below:
>
> CPU1 CPU2
> qdisc_run_begin(q) .
> . q->enqueue()
> sch_may_need_requeuing() .
> return true .
> . .
> . .
> q->enqueue() .
>
> When above happen, the skb enqueued by CPU1 is dequeued after the
> skb enqueued by CPU2 because sch_may_need_requeuing() return true.
> If there is not qdisc bypass, the CPU1 has better chance to queue
> the skb quicker than CPU2.
>
> This patch does not take care of the above data race, because I
> view this as similar as below:
>
> Even at the same time CPU1 and CPU2 write the skb to two socket
> which both heading to the same qdisc, there is no guarantee that
> which skb will hit the qdisc first, becuase there is a lot of
> factor like interrupt/softirq/cache miss/scheduling afffecting
> that."
>
> Does above make sense? Or any idea to avoid it?
>
> 1. https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

We agree on this one.

Could you draw a sequence diagram of different CPUs (like the one
above) for the case where removing re-checking nolock_qdisc_is_empty()
under q->seqlock leads to incorrect behavior?

If there is no such case would you be willing to repeat the benchmark
with and without this test?

Sorry for dragging the review out..

2021-05-31 00:45:52

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On 2021/5/31 4:21, Jakub Kicinski wrote:
> On Sun, 30 May 2021 09:37:09 +0800 Yunsheng Lin wrote:
>> On 2021/5/30 2:49, Jakub Kicinski wrote:
>>> The fact that MISSED is only cleared under q->seqlock does not matter,
>>> because setting it and ->enqueue() are not under any lock. If the thread
>>> gets interrupted between:
>>>
>>> if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
>>> qdisc_run_begin(q)) {
>>>
>>> and ->enqueue() we can't guarantee that something else won't come in,
>>> take q->seqlock and clear MISSED.
>>>
>>> thread1 thread2 thread3
>>> # holds seqlock
>>> qdisc_run_begin(q)
>>> set(MISSED)
>>> pfifo_fast_dequeue
>>> clear(MISSED)
>>> # recheck the queue
>>> qdisc_run_end()
>>> ->enqueue()
>>> q->flags & TCQ_F_CAN_BYPASS..
>>> qdisc_run_begin() # true
>>> sch_direct_xmit()
>>> qdisc_run_begin()
>>> set(MISSED)
>>>
>>> Or am I missing something?
>>>
>>> Re-checking nolock_qdisc_is_empty() may or may not help.
>>> But it doesn't really matter because there is no ordering
>>> requirement between thread2 and thread3 here.
>>
>> I were more focued on explaining that using MISSED is reliable
>> as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a
>> empty qdisc, and forgot to mention the data race described in
>> RFCv3, which is kind of like the one described above:
>>
>> "There is a data race as below:
>>
>> CPU1 CPU2
>> qdisc_run_begin(q) .
>> . q->enqueue()
>> sch_may_need_requeuing() .
>> return true .
>> . .
>> . .
>> q->enqueue() .
>>
>> When above happen, the skb enqueued by CPU1 is dequeued after the
>> skb enqueued by CPU2 because sch_may_need_requeuing() return true.
>> If there is not qdisc bypass, the CPU1 has better chance to queue
>> the skb quicker than CPU2.
>>
>> This patch does not take care of the above data race, because I
>> view this as similar as below:
>>
>> Even at the same time CPU1 and CPU2 write the skb to two socket
>> which both heading to the same qdisc, there is no guarantee that
>> which skb will hit the qdisc first, becuase there is a lot of
>> factor like interrupt/softirq/cache miss/scheduling afffecting
>> that."
>>
>> Does above make sense? Or any idea to avoid it?
>>
>> 1. https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
>
> We agree on this one.
>
> Could you draw a sequence diagram of different CPUs (like the one
> above) for the case where removing re-checking nolock_qdisc_is_empty()
> under q->seqlock leads to incorrect behavior?

When nolock_qdisc_is_empty() is not re-checking under q->seqlock, we
may have:


CPU1 CPU2
qdisc_run_begin(q) .
. enqueue skb1
deuqueue skb1 and clear MISSED .
. nolock_qdisc_is_empty() return true
requeue skb .
q->enqueue() .
set MISSED .
. .
qdisc_run_end(q) .
. qdisc_run_begin(q)
. transmit skb2 directly
. transmit the requeued skb1

The problem here is that skb1 and skb2 are from the same CPU, which
means they are likely from the same flow, so we need to avoid this,
right?

>
> If there is no such case would you be willing to repeat the benchmark
> with and without this test?
>
> Sorry for dragging the review out..
>
> .
>

2021-05-31 01:12:43

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [Linuxarm] Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On 2021/5/31 8:40, Yunsheng Lin wrote:
> On 2021/5/31 4:21, Jakub Kicinski wrote:
>> On Sun, 30 May 2021 09:37:09 +0800 Yunsheng Lin wrote:
>>> On 2021/5/30 2:49, Jakub Kicinski wrote:
>>>> The fact that MISSED is only cleared under q->seqlock does not matter,
>>>> because setting it and ->enqueue() are not under any lock. If the thread
>>>> gets interrupted between:
>>>>
>>>> if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
>>>> qdisc_run_begin(q)) {
>>>>
>>>> and ->enqueue() we can't guarantee that something else won't come in,
>>>> take q->seqlock and clear MISSED.
>>>>
>>>> thread1 thread2 thread3
>>>> # holds seqlock
>>>> qdisc_run_begin(q)
>>>> set(MISSED)
>>>> pfifo_fast_dequeue
>>>> clear(MISSED)
>>>> # recheck the queue
>>>> qdisc_run_end()
>>>> ->enqueue()
>>>> q->flags & TCQ_F_CAN_BYPASS..
>>>> qdisc_run_begin() # true
>>>> sch_direct_xmit()
>>>> qdisc_run_begin()
>>>> set(MISSED)
>>>>
>>>> Or am I missing something?
>>>>
>>>> Re-checking nolock_qdisc_is_empty() may or may not help.
>>>> But it doesn't really matter because there is no ordering
>>>> requirement between thread2 and thread3 here.
>>>
>>> I were more focued on explaining that using MISSED is reliable
>>> as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a
>>> empty qdisc, and forgot to mention the data race described in
>>> RFCv3, which is kind of like the one described above:
>>>
>>> "There is a data race as below:
>>>
>>> CPU1 CPU2
>>> qdisc_run_begin(q) .
>>> . q->enqueue()
>>> sch_may_need_requeuing() .
>>> return true .
>>> . .
>>> . .
>>> q->enqueue() .
>>>
>>> When above happen, the skb enqueued by CPU1 is dequeued after the
>>> skb enqueued by CPU2 because sch_may_need_requeuing() return true.
>>> If there is not qdisc bypass, the CPU1 has better chance to queue
>>> the skb quicker than CPU2.
>>>
>>> This patch does not take care of the above data race, because I
>>> view this as similar as below:
>>>
>>> Even at the same time CPU1 and CPU2 write the skb to two socket
>>> which both heading to the same qdisc, there is no guarantee that
>>> which skb will hit the qdisc first, becuase there is a lot of
>>> factor like interrupt/softirq/cache miss/scheduling afffecting
>>> that."
>>>
>>> Does above make sense? Or any idea to avoid it?
>>>
>>> 1. https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
>>
>> We agree on this one.
>>
>> Could you draw a sequence diagram of different CPUs (like the one
>> above) for the case where removing re-checking nolock_qdisc_is_empty()
>> under q->seqlock leads to incorrect behavior?
>
> When nolock_qdisc_is_empty() is not re-checking under q->seqlock, we
> may have:
>
>
> CPU1 CPU2
> qdisc_run_begin(q) .
> . enqueue skb1
> deuqueue skb1 and clear MISSED .
> . nolock_qdisc_is_empty() return true
> requeue skb .
> q->enqueue() .
> set MISSED .
> . .
> qdisc_run_end(q) .
> . qdisc_run_begin(q)
> . transmit skb2 directly
> . transmit the requeued skb1
>
> The problem here is that skb1 and skb2 are from the same CPU, which
> means they are likely from the same flow, so we need to avoid this,
> right?


CPU1 CPU2
qdisc_run_begin(q) .
. enqueue skb1
dequeue skb1 .
. .
netdevice stopped and MISSED is clear .
. nolock_qdisc_is_empty() return true
requeue skb .
. .
. .
. .
qdisc_run_end(q) .
. qdisc_run_begin(q)
. transmit skb2 directly
. transmit the requeued skb1

The above sequence diagram seems more correct, it is basically about how to
avoid transmitting a packet directly bypassing the requeued packet.

>
>>
>> If there is no such case would you be willing to repeat the benchmark
>> with and without this test?
>>
>> Sorry for dragging the review out..
>>
>> .
>>
> _______________________________________________
> Linuxarm mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>

2021-05-31 12:42:21

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [Linuxarm] Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On 2021/5/31 9:10, Yunsheng Lin wrote:
> On 2021/5/31 8:40, Yunsheng Lin wrote:
>> On 2021/5/31 4:21, Jakub Kicinski wrote:
>>> On Sun, 30 May 2021 09:37:09 +0800 Yunsheng Lin wrote:
>>>> On 2021/5/30 2:49, Jakub Kicinski wrote:
>>>>> The fact that MISSED is only cleared under q->seqlock does not matter,
>>>>> because setting it and ->enqueue() are not under any lock. If the thread
>>>>> gets interrupted between:
>>>>>
>>>>> if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
>>>>> qdisc_run_begin(q)) {
>>>>>
>>>>> and ->enqueue() we can't guarantee that something else won't come in,
>>>>> take q->seqlock and clear MISSED.
>>>>>
>>>>> thread1 thread2 thread3
>>>>> # holds seqlock
>>>>> qdisc_run_begin(q)
>>>>> set(MISSED)
>>>>> pfifo_fast_dequeue
>>>>> clear(MISSED)
>>>>> # recheck the queue
>>>>> qdisc_run_end()
>>>>> ->enqueue()
>>>>> q->flags & TCQ_F_CAN_BYPASS..
>>>>> qdisc_run_begin() # true
>>>>> sch_direct_xmit()
>>>>> qdisc_run_begin()
>>>>> set(MISSED)
>>>>>
>>>>> Or am I missing something?
>>>>>
>>>>> Re-checking nolock_qdisc_is_empty() may or may not help.
>>>>> But it doesn't really matter because there is no ordering
>>>>> requirement between thread2 and thread3 here.
>>>>
>>>> I were more focued on explaining that using MISSED is reliable
>>>> as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a
>>>> empty qdisc, and forgot to mention the data race described in
>>>> RFCv3, which is kind of like the one described above:
>>>>
>>>> "There is a data race as below:
>>>>
>>>> CPU1 CPU2
>>>> qdisc_run_begin(q) .
>>>> . q->enqueue()
>>>> sch_may_need_requeuing() .
>>>> return true .
>>>> . .
>>>> . .
>>>> q->enqueue() .
>>>>
>>>> When above happen, the skb enqueued by CPU1 is dequeued after the
>>>> skb enqueued by CPU2 because sch_may_need_requeuing() return true.
>>>> If there is not qdisc bypass, the CPU1 has better chance to queue
>>>> the skb quicker than CPU2.
>>>>
>>>> This patch does not take care of the above data race, because I
>>>> view this as similar as below:
>>>>
>>>> Even at the same time CPU1 and CPU2 write the skb to two socket
>>>> which both heading to the same qdisc, there is no guarantee that
>>>> which skb will hit the qdisc first, becuase there is a lot of
>>>> factor like interrupt/softirq/cache miss/scheduling afffecting
>>>> that."
>>>>
>>>> Does above make sense? Or any idea to avoid it?
>>>>
>>>> 1. https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
>>>
>>> We agree on this one.
>>>
>>> Could you draw a sequence diagram of different CPUs (like the one
>>> above) for the case where removing re-checking nolock_qdisc_is_empty()
>>> under q->seqlock leads to incorrect behavior?
>>
>> When nolock_qdisc_is_empty() is not re-checking under q->seqlock, we
>> may have:
>>
>>
>> CPU1 CPU2
>> qdisc_run_begin(q) .
>> . enqueue skb1
>> deuqueue skb1 and clear MISSED .
>> . nolock_qdisc_is_empty() return true
>> requeue skb .
>> q->enqueue() .
>> set MISSED .
>> . .
>> qdisc_run_end(q) .
>> . qdisc_run_begin(q)
>> . transmit skb2 directly
>> . transmit the requeued skb1
>>
>> The problem here is that skb1 and skb2 are from the same CPU, which
>> means they are likely from the same flow, so we need to avoid this,
>> right?
>
>
> CPU1 CPU2
> qdisc_run_begin(q) .
> . enqueue skb1
> dequeue skb1 .
> . .
> netdevice stopped and MISSED is clear .
> . nolock_qdisc_is_empty() return true
> requeue skb .
> . .
> . .
> . .
> qdisc_run_end(q) .
> . qdisc_run_begin(q)
> . transmit skb2 directly
> . transmit the requeued skb1
>
> The above sequence diagram seems more correct, it is basically about how to
> avoid transmitting a packet directly bypassing the requeued packet.
>
>>
>>>
>>> If there is no such case would you be willing to repeat the benchmark
>>> with and without this test?

I had did some interesting testing to show how adjust a small number
of code has some notiable performance degrade.

1. I used below patch to remove the nolock_qdisc_is_empty() testing
under q->seqlock.

@@ -3763,17 +3763,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
if (q->flags & TCQ_F_NOLOCK) {
if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
qdisc_run_begin(q)) {
- /* Retest nolock_qdisc_is_empty() within the protection
- * of q->seqlock to ensure qdisc is indeed empty.
- */
- if (unlikely(!nolock_qdisc_is_empty(q))) {
- rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
- __qdisc_run(q);
- qdisc_run_end(q);
-
- goto no_lock_out;
- }
-
qdisc_bstats_cpu_update(q, skb);
if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
!nolock_qdisc_is_empty(q))
@@ -3786,7 +3775,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
qdisc_run(q);

-no_lock_out:
if (unlikely(to_free))
kfree_skb_list(to_free);
return rc;

which has the below performance improvement:

threads v1 v1 + above patch delta
1 3.21Mpps 3.20Mpps -0.3%
2 5.56Mpps 5.94Mpps +4.9%
4 5.58Mpps 5.60Mpps +0.3%
8 2.76Mpps 2.77Mpps +0.3%
16 2.23Mpps 2.23Mpps +0.0%

v1 = this patchset.


2. After the above testing, it seems worthwhile to remove the
nolock_qdisc_is_empty() testing under q->seqlock, so I used below
patch to make sure nolock_qdisc_is_empty() always return false for
netdev queue stopped case。

--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops);
static void qdisc_maybe_clear_missed(struct Qdisc *q,
const struct netdev_queue *txq)
{
+ set_bit(__QDISC_STATE_DRAINING, &q->state);
+
+ /* Make sure DRAINING is set before clearing MISSED
+ * to make sure nolock_qdisc_is_empty() always return
+ * false for aoviding transmitting a packet directly
+ * bypassing the requeued packet.
+ */
+ smp_mb__after_atomic();
+
clear_bit(__QDISC_STATE_MISSED, &q->state);

/* Make sure the below netif_xmit_frozen_or_stopped()
@@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
*/
if (!netif_xmit_frozen_or_stopped(txq))
set_bit(__QDISC_STATE_MISSED, &q->state);
- else
- set_bit(__QDISC_STATE_DRAINING, &q->state);
}

which has the below performance data:

threads v1 v1 + above two patch delta
1 3.21Mpps 3.20Mpps -0.3%
2 5.56Mpps 5.94Mpps +4.9%
4 5.58Mpps 5.02Mpps -10%
8 2.76Mpps 2.77Mpps +0.3%
16 2.23Mpps 2.23Mpps +0.0%

So the adjustment in qdisc_maybe_clear_missed() seems to have
caused about 10% performance degradation for 4 threads case.

And the cpu topdown perf data suggested that icache missed and
bad Speculation play the main factor to those performance difference.

I tried to control the above factor by removing the inline function
and add likely and unlikely tag for netif_xmit_frozen_or_stopped()
in sch_generic.c.

And after removing the inline mark for function in sch_generic.c
and add likely/unlikely tag for netif_xmit_frozen_or_stopped()
checking in in sch_generic.c, we got notiable performance improvement
for 1/2 threads case(some performance improvement for ip forwarding
test too), but not for 4 threads case.

So it seems we need to ignore the performance degradation for 4
threads case? or any idea?

>>>
>>> Sorry for dragging the review out..
>>>
>>> .
>>>
>> _______________________________________________
>> Linuxarm mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
> _______________________________________________
> Linuxarm mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>

2021-06-01 04:56:25

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [Linuxarm] Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On Mon, 31 May 2021 20:40:01 +0800 Yunsheng Lin wrote:
> On 2021/5/31 9:10, Yunsheng Lin wrote:
> > On 2021/5/31 8:40, Yunsheng Lin wrote:
> >> On 2021/5/31 4:21, Jakub Kicinski wrote:
> [...]
> [...]
> [...]
> [...]
> [...]
> [...]
> [...]
> >>
> >> When nolock_qdisc_is_empty() is not re-checking under q->seqlock, we
> >> may have:
> >>
> >>
> >> CPU1 CPU2
> >> qdisc_run_begin(q) .
> >> . enqueue skb1
> >> deuqueue skb1 and clear MISSED .
> >> . nolock_qdisc_is_empty() return true
> >> requeue skb .
> >> q->enqueue() .
> >> set MISSED .
> >> . .
> >> qdisc_run_end(q) .
> >> . qdisc_run_begin(q)
> >> . transmit skb2 directly
> >> . transmit the requeued skb1
> >>
> >> The problem here is that skb1 and skb2 are from the same CPU, which
> >> means they are likely from the same flow, so we need to avoid this,
> >> right?
> >
> >
> > CPU1 CPU2
> > qdisc_run_begin(q) .
> > . enqueue skb1
> > dequeue skb1 .
> > . .
> > netdevice stopped and MISSED is clear .
> > . nolock_qdisc_is_empty() return true
> > requeue skb .
> > . .
> > . .
> > . .
> > qdisc_run_end(q) .
> > . qdisc_run_begin(q)
> > . transmit skb2 directly
> > . transmit the requeued skb1
> >
> > The above sequence diagram seems more correct, it is basically about how to
> > avoid transmitting a packet directly bypassing the requeued packet.

I see, thanks! That explains the need. Perhaps we can rephrase the
comment? Maybe:

+ /* Retest nolock_qdisc_is_empty() within the protection
+ * of q->seqlock to protect from racing with requeuing.
+ */

> I had did some interesting testing to show how adjust a small number
> of code has some notiable performance degrade.
>
> 1. I used below patch to remove the nolock_qdisc_is_empty() testing
> under q->seqlock.
>
> @@ -3763,17 +3763,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> if (q->flags & TCQ_F_NOLOCK) {
> if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
> qdisc_run_begin(q)) {
> - /* Retest nolock_qdisc_is_empty() within the protection
> - * of q->seqlock to ensure qdisc is indeed empty.
> - */
> - if (unlikely(!nolock_qdisc_is_empty(q))) {
> - rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> - __qdisc_run(q);
> - qdisc_run_end(q);
> -
> - goto no_lock_out;
> - }
> -
> qdisc_bstats_cpu_update(q, skb);
> if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
> !nolock_qdisc_is_empty(q))
> @@ -3786,7 +3775,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> qdisc_run(q);
>
> -no_lock_out:
> if (unlikely(to_free))
> kfree_skb_list(to_free);
> return rc;
>
> which has the below performance improvement:
>
> threads v1 v1 + above patch delta
> 1 3.21Mpps 3.20Mpps -0.3%
> 2 5.56Mpps 5.94Mpps +4.9%
> 4 5.58Mpps 5.60Mpps +0.3%
> 8 2.76Mpps 2.77Mpps +0.3%
> 16 2.23Mpps 2.23Mpps +0.0%
>
> v1 = this patchset.
>
>
> 2. After the above testing, it seems worthwhile to remove the
> nolock_qdisc_is_empty() testing under q->seqlock, so I used below
> patch to make sure nolock_qdisc_is_empty() always return false for
> netdev queue stopped case。
>
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops);
> static void qdisc_maybe_clear_missed(struct Qdisc *q,
> const struct netdev_queue *txq)
> {
> + set_bit(__QDISC_STATE_DRAINING, &q->state);
> +
> + /* Make sure DRAINING is set before clearing MISSED
> + * to make sure nolock_qdisc_is_empty() always return
> + * false for aoviding transmitting a packet directly
> + * bypassing the requeued packet.
> + */
> + smp_mb__after_atomic();
> +
> clear_bit(__QDISC_STATE_MISSED, &q->state);
>
> /* Make sure the below netif_xmit_frozen_or_stopped()
> @@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
> */
> if (!netif_xmit_frozen_or_stopped(txq))
> set_bit(__QDISC_STATE_MISSED, &q->state);
> - else
> - set_bit(__QDISC_STATE_DRAINING, &q->state);
> }

But this would not be enough because we may also clear MISSING
in pfifo_fast_dequeue()?

> which has the below performance data:
>
> threads v1 v1 + above two patch delta
> 1 3.21Mpps 3.20Mpps -0.3%
> 2 5.56Mpps 5.94Mpps +4.9%
> 4 5.58Mpps 5.02Mpps -10%
> 8 2.76Mpps 2.77Mpps +0.3%
> 16 2.23Mpps 2.23Mpps +0.0%
>
> So the adjustment in qdisc_maybe_clear_missed() seems to have
> caused about 10% performance degradation for 4 threads case.
>
> And the cpu topdown perf data suggested that icache missed and
> bad Speculation play the main factor to those performance difference.
>
> I tried to control the above factor by removing the inline function
> and add likely and unlikely tag for netif_xmit_frozen_or_stopped()
> in sch_generic.c.
>
> And after removing the inline mark for function in sch_generic.c
> and add likely/unlikely tag for netif_xmit_frozen_or_stopped()
> checking in in sch_generic.c, we got notiable performance improvement
> for 1/2 threads case(some performance improvement for ip forwarding
> test too), but not for 4 threads case.
>
> So it seems we need to ignore the performance degradation for 4
> threads case? or any idea?

No ideas, are the threads pinned to CPUs in some particular way?

2021-06-01 08:20:19

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [Linuxarm] Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On 2021/6/1 12:51, Jakub Kicinski wrote:
> On Mon, 31 May 2021 20:40:01 +0800 Yunsheng Lin wrote:
>> On 2021/5/31 9:10, Yunsheng Lin wrote:
>>> On 2021/5/31 8:40, Yunsheng Lin wrote:
>>>> On 2021/5/31 4:21, Jakub Kicinski wrote:
>> [...] >>>
>>>
>>> CPU1 CPU2
>>> qdisc_run_begin(q) .
>>> . enqueue skb1
>>> dequeue skb1 .
>>> . .
>>> netdevice stopped and MISSED is clear .
>>> . nolock_qdisc_is_empty() return true
>>> requeue skb .
>>> . .
>>> . .
>>> . .
>>> qdisc_run_end(q) .
>>> . qdisc_run_begin(q)
>>> . transmit skb2 directly
>>> . transmit the requeued skb1
>>>
>>> The above sequence diagram seems more correct, it is basically about how to
>>> avoid transmitting a packet directly bypassing the requeued packet.
>
> I see, thanks! That explains the need. Perhaps we can rephrase the
> comment? Maybe:
>
> + /* Retest nolock_qdisc_is_empty() within the protection
> + * of q->seqlock to protect from racing with requeuing.
> + */

Yes if we still decide to preserve the nolock_qdisc_is_empty() rechecking
under q->seqlock.

>
>> I had did some interesting testing to show how adjust a small number
>> of code has some notiable performance degrade.
>>
>> 1. I used below patch to remove the nolock_qdisc_is_empty() testing
>> under q->seqlock.
>>
>> @@ -3763,17 +3763,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> if (q->flags & TCQ_F_NOLOCK) {
>> if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
>> qdisc_run_begin(q)) {
>> - /* Retest nolock_qdisc_is_empty() within the protection
>> - * of q->seqlock to ensure qdisc is indeed empty.
>> - */
>> - if (unlikely(!nolock_qdisc_is_empty(q))) {
>> - rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
>> - __qdisc_run(q);
>> - qdisc_run_end(q);
>> -
>> - goto no_lock_out;
>> - }
>> -
>> qdisc_bstats_cpu_update(q, skb);
>> if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
>> !nolock_qdisc_is_empty(q))
>> @@ -3786,7 +3775,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
>> qdisc_run(q);
>>
>> -no_lock_out:
>> if (unlikely(to_free))
>> kfree_skb_list(to_free);
>> return rc;
>>
>> which has the below performance improvement:
>>
>> threads v1 v1 + above patch delta
>> 1 3.21Mpps 3.20Mpps -0.3%
>> 2 5.56Mpps 5.94Mpps +4.9%
>> 4 5.58Mpps 5.60Mpps +0.3%
>> 8 2.76Mpps 2.77Mpps +0.3%
>> 16 2.23Mpps 2.23Mpps +0.0%
>>
>> v1 = this patchset.
>>
>>
>> 2. After the above testing, it seems worthwhile to remove the
>> nolock_qdisc_is_empty() testing under q->seqlock, so I used below
>> patch to make sure nolock_qdisc_is_empty() always return false for
>> netdev queue stopped case。
>>
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops);
>> static void qdisc_maybe_clear_missed(struct Qdisc *q,
>> const struct netdev_queue *txq)
>> {
>> + set_bit(__QDISC_STATE_DRAINING, &q->state);
>> +
>> + /* Make sure DRAINING is set before clearing MISSED
>> + * to make sure nolock_qdisc_is_empty() always return
>> + * false for aoviding transmitting a packet directly
>> + * bypassing the requeued packet.
>> + */
>> + smp_mb__after_atomic();
>> +
>> clear_bit(__QDISC_STATE_MISSED, &q->state);
>>
>> /* Make sure the below netif_xmit_frozen_or_stopped()
>> @@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
>> */
>> if (!netif_xmit_frozen_or_stopped(txq))
>> set_bit(__QDISC_STATE_MISSED, &q->state);
>> - else
>> - set_bit(__QDISC_STATE_DRAINING, &q->state);
>> }
>
> But this would not be enough because we may also clear MISSING
> in pfifo_fast_dequeue()?

For the MISSING clearing in pfifo_fast_dequeue(), it seems it
looks like the data race described in RFC v3 too?

CPU1 CPU2 CPU3
qdisc_run_begin(q) . .
. MISSED is set .
MISSED is cleared . .
q->dequeue() . .
. enqueue skb1 check MISSED # true
qdisc_run_end(q) . .
. . qdisc_run_begin(q) # true
. MISSED is set send skb2 directly


>
>> which has the below performance data:
>>
>> threads v1 v1 + above two patch delta
>> 1 3.21Mpps 3.20Mpps -0.3%
>> 2 5.56Mpps 5.94Mpps +4.9%
>> 4 5.58Mpps 5.02Mpps -10%
>> 8 2.76Mpps 2.77Mpps +0.3%
>> 16 2.23Mpps 2.23Mpps +0.0%
>>
>> So the adjustment in qdisc_maybe_clear_missed() seems to have
>> caused about 10% performance degradation for 4 threads case.
>>
>> And the cpu topdown perf data suggested that icache missed and
>> bad Speculation play the main factor to those performance difference.
>>
>> I tried to control the above factor by removing the inline function
>> and add likely and unlikely tag for netif_xmit_frozen_or_stopped()
>> in sch_generic.c.
>>
>> And after removing the inline mark for function in sch_generic.c
>> and add likely/unlikely tag for netif_xmit_frozen_or_stopped()
>> checking in in sch_generic.c, we got notiable performance improvement
>> for 1/2 threads case(some performance improvement for ip forwarding
>> test too), but not for 4 threads case.
>>
>> So it seems we need to ignore the performance degradation for 4
>> threads case? or any idea?
>
> No ideas, are the threads pinned to CPUs in some particular way?

The pktgen seems already runnig a thread for each CPU, so I do not
need to do the pinning myself, for the 4 threads case, it runs on
the 0~3 cpu.

It seems more related to specific cpu implemantaion.

>
> .
>

2021-06-01 20:50:37

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [Linuxarm] Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On Tue, 1 Jun 2021 16:18:54 +0800 Yunsheng Lin wrote:
> > I see, thanks! That explains the need. Perhaps we can rephrase the
> > comment? Maybe:
> >
> > + /* Retest nolock_qdisc_is_empty() within the protection
> > + * of q->seqlock to protect from racing with requeuing.
> > + */
>
> Yes if we still decide to preserve the nolock_qdisc_is_empty() rechecking
> under q->seqlock.

Sounds good.

> >> --- a/net/sched/sch_generic.c
> >> +++ b/net/sched/sch_generic.c
> >> @@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops);
> >> static void qdisc_maybe_clear_missed(struct Qdisc *q,
> >> const struct netdev_queue *txq)
> >> {
> >> + set_bit(__QDISC_STATE_DRAINING, &q->state);
> >> +
> >> + /* Make sure DRAINING is set before clearing MISSED
> >> + * to make sure nolock_qdisc_is_empty() always return
> >> + * false for aoviding transmitting a packet directly
> >> + * bypassing the requeued packet.
> >> + */
> >> + smp_mb__after_atomic();
> >> +
> >> clear_bit(__QDISC_STATE_MISSED, &q->state);
> >>
> >> /* Make sure the below netif_xmit_frozen_or_stopped()
> >> @@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
> >> */
> >> if (!netif_xmit_frozen_or_stopped(txq))
> >> set_bit(__QDISC_STATE_MISSED, &q->state);
> >> - else
> >> - set_bit(__QDISC_STATE_DRAINING, &q->state);
> >> }
> >
> > But this would not be enough because we may also clear MISSING
> > in pfifo_fast_dequeue()?
>
> For the MISSING clearing in pfifo_fast_dequeue(), it seems it
> looks like the data race described in RFC v3 too?
>
> CPU1 CPU2 CPU3
> qdisc_run_begin(q) . .
> . MISSED is set .
> MISSED is cleared . .
> q->dequeue() . .
> . enqueue skb1 check MISSED # true
> qdisc_run_end(q) . .
> . . qdisc_run_begin(q) # true
> . MISSED is set send skb2 directly

Not sure what you mean.

2021-06-02 03:43:16

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [Linuxarm] Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On 2021/6/2 4:48, Jakub Kicinski wrote:
> On Tue, 1 Jun 2021 16:18:54 +0800 Yunsheng Lin wrote:
>>> I see, thanks! That explains the need. Perhaps we can rephrase the
>>> comment? Maybe:
>>>
>>> + /* Retest nolock_qdisc_is_empty() within the protection
>>> + * of q->seqlock to protect from racing with requeuing.
>>> + */
>>
>> Yes if we still decide to preserve the nolock_qdisc_is_empty() rechecking
>> under q->seqlock.
>
> Sounds good.
>
>>>> --- a/net/sched/sch_generic.c
>>>> +++ b/net/sched/sch_generic.c
>>>> @@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops);
>>>> static void qdisc_maybe_clear_missed(struct Qdisc *q,
>>>> const struct netdev_queue *txq)
>>>> {
>>>> + set_bit(__QDISC_STATE_DRAINING, &q->state);
>>>> +
>>>> + /* Make sure DRAINING is set before clearing MISSED
>>>> + * to make sure nolock_qdisc_is_empty() always return
>>>> + * false for aoviding transmitting a packet directly
>>>> + * bypassing the requeued packet.
>>>> + */
>>>> + smp_mb__after_atomic();
>>>> +
>>>> clear_bit(__QDISC_STATE_MISSED, &q->state);
>>>>
>>>> /* Make sure the below netif_xmit_frozen_or_stopped()
>>>> @@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
>>>> */
>>>> if (!netif_xmit_frozen_or_stopped(txq))
>>>> set_bit(__QDISC_STATE_MISSED, &q->state);
>>>> - else
>>>> - set_bit(__QDISC_STATE_DRAINING, &q->state);
>>>> }
>>>
>>> But this would not be enough because we may also clear MISSING
>>> in pfifo_fast_dequeue()?
>>
>> For the MISSING clearing in pfifo_fast_dequeue(), it seems it
>> looks like the data race described in RFC v3 too?
>>
>> CPU1 CPU2 CPU3
>> qdisc_run_begin(q) . .
>> . MISSED is set .
>> MISSED is cleared . .
>> q->dequeue() . .
>> . enqueue skb1 check MISSED # true
>> qdisc_run_end(q) . .
>> . . qdisc_run_begin(q) # true
>> . MISSED is set send skb2 directly
>
> Not sure what you mean.

CPU1 CPU2 CPU3
qdisc_run_begin(q) . .
. MISSED is set .
MISSED is cleared . .
another dequeuing . .
. . .
. enqueue skb1 nolock_qdisc_is_empty() # true
qdisc_run_end(q) . .
. . qdisc_run_begin(q) # true
. . send skb2 directly
. MISSED is set .

As qdisc is indeed empty at the point when MISSED is clear and
another dequeue is retried by CPU1, MISSED setting is not under
q->seqlock, so it seems retesting MISSED under q->seqlock does not
seem to make any difference? and it seems like the case that does
not need handling as we agreed previously?


>
> .
>

2021-06-02 16:32:08

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [Linuxarm] Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

On Wed, 2 Jun 2021 09:21:01 +0800 Yunsheng Lin wrote:
> >> For the MISSING clearing in pfifo_fast_dequeue(), it seems it
> >> looks like the data race described in RFC v3 too?
> >>
> >> CPU1 CPU2 CPU3
> >> qdisc_run_begin(q) . .
> >> . MISSED is set .
> >> MISSED is cleared . .
> >> q->dequeue() . .
> >> . enqueue skb1 check MISSED # true
> >> qdisc_run_end(q) . .
> >> . . qdisc_run_begin(q) # true
> >> . MISSED is set send skb2 directly
> >
> > Not sure what you mean.
>
> CPU1 CPU2 CPU3
> qdisc_run_begin(q) . .
> . MISSED is set .
> MISSED is cleared . .
> another dequeuing . .
> . . .
> . enqueue skb1 nolock_qdisc_is_empty() # true
> qdisc_run_end(q) . .
> . . qdisc_run_begin(q) # true
> . . send skb2 directly
> . MISSED is set .
>
> As qdisc is indeed empty at the point when MISSED is clear and
> another dequeue is retried by CPU1, MISSED setting is not under
> q->seqlock, so it seems retesting MISSED under q->seqlock does not
> seem to make any difference? and it seems like the case that does
> not need handling as we agreed previously?

Right, this case doesn't need the re-check under the lock, but pointed
out that the re-queuing case requires the re-check.