2020-09-01 01:03:23

by Yunsheng Lin

[permalink] [raw]
Subject: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

Currently there is concurrent reset and enqueue operation for the
same lockless qdisc when there is no lock to synchronize the
q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
qdisc_deactivate() called by dev_deactivate_queue(), which may cause
out-of-bounds access for priv->ring[] in hns3 driver if user has
requested a smaller queue num when __dev_xmit_skb() still enqueue a
skb with a larger queue_mapping after the corresponding qdisc is
reset, and call hns3_nic_net_xmit() with that skb later.

Avoid the above concurrent op by calling synchronize_rcu_tasks()
after assigning new qdisc to dev_queue->qdisc and before calling
qdisc_deactivate() to make sure skb with larger queue_mapping
enqueued to old qdisc will always be reset when qdisc_deactivate()
is called.

Signed-off-by: Yunsheng Lin <[email protected]>
---
net/sched/sch_generic.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 265a61d..6e42237 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1160,8 +1160,13 @@ static void dev_deactivate_queue(struct net_device *dev,

qdisc = rtnl_dereference(dev_queue->qdisc);
if (qdisc) {
- qdisc_deactivate(qdisc);
rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
+
+ /* Make sure lockless qdisc enqueuing is done with the
+ * old qdisc in __dev_xmit_skb().
+ */
+ synchronize_rcu_tasks();
+ qdisc_deactivate(qdisc);
}
}

--
2.8.1


2020-09-01 06:52:18

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc



On 8/31/20 5:55 PM, Yunsheng Lin wrote:
> Currently there is concurrent reset and enqueue operation for the
> same lockless qdisc when there is no lock to synchronize the
> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> out-of-bounds access for priv->ring[] in hns3 driver if user has
> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> skb with a larger queue_mapping after the corresponding qdisc is
> reset, and call hns3_nic_net_xmit() with that skb later.
>
> Avoid the above concurrent op by calling synchronize_rcu_tasks()
> after assigning new qdisc to dev_queue->qdisc and before calling
> qdisc_deactivate() to make sure skb with larger queue_mapping
> enqueued to old qdisc will always be reset when qdisc_deactivate()
> is called.
>

We request Fixes: tag for fixes in networking land.

> Signed-off-by: Yunsheng Lin <[email protected]>
> ---
> net/sched/sch_generic.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 265a61d..6e42237 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1160,8 +1160,13 @@ static void dev_deactivate_queue(struct net_device *dev,
>
> qdisc = rtnl_dereference(dev_queue->qdisc);
> if (qdisc) {
> - qdisc_deactivate(qdisc);
> rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
> +
> + /* Make sure lockless qdisc enqueuing is done with the
> + * old qdisc in __dev_xmit_skb().
> + */
> + synchronize_rcu_tasks();

This seems quite wrong, there is not a single use of synchronize_rcu_tasks() in net/,
we probably do not want this.

I bet that synchronize_net() is appropriate, if not please explain/comment why we want this instead.

Adding one synchronize_net() per TX queue is a killer for devices with 128 or 256 TX queues.

I would rather find a way of not calling qdisc_reset() from qdisc_deactivate().

This lockless pfifo_fast is a mess really.


2020-09-01 07:28:57

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On 2020/9/1 14:48, Eric Dumazet wrote:
>
>
> On 8/31/20 5:55 PM, Yunsheng Lin wrote:
>> Currently there is concurrent reset and enqueue operation for the
>> same lockless qdisc when there is no lock to synchronize the
>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
>> out-of-bounds access for priv->ring[] in hns3 driver if user has
>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
>> skb with a larger queue_mapping after the corresponding qdisc is
>> reset, and call hns3_nic_net_xmit() with that skb later.
>>
>> Avoid the above concurrent op by calling synchronize_rcu_tasks()
>> after assigning new qdisc to dev_queue->qdisc and before calling
>> qdisc_deactivate() to make sure skb with larger queue_mapping
>> enqueued to old qdisc will always be reset when qdisc_deactivate()
>> is called.
>>
>
> We request Fixes: tag for fixes in networking land.

ok.

Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")

>
>> Signed-off-by: Yunsheng Lin <[email protected]>
>> ---
>> net/sched/sch_generic.c | 7 ++++++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 265a61d..6e42237 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -1160,8 +1160,13 @@ static void dev_deactivate_queue(struct net_device *dev,
>>
>> qdisc = rtnl_dereference(dev_queue->qdisc);
>> if (qdisc) {
>> - qdisc_deactivate(qdisc);
>> rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
>> +
>> + /* Make sure lockless qdisc enqueuing is done with the
>> + * old qdisc in __dev_xmit_skb().
>> + */
>> + synchronize_rcu_tasks();
>
> This seems quite wrong, there is not a single use of synchronize_rcu_tasks() in net/,
> we probably do not want this.
>
> I bet that synchronize_net() is appropriate, if not please explain/comment why we want this instead.

Using synchronize_net() seems more appropriate here, thanks.

>
> Adding one synchronize_net() per TX queue is a killer for devices with 128 or 256 TX queues.
>
> I would rather find a way of not calling qdisc_reset() from qdisc_deactivate().

Without calling qdisc_reset(), it seems there will always be skb left in the old qdisc.
Is above acceptable?

How about below steps to avoid the concurrent op:
1. assign new qdisc to all queue' qdisc(which is noop_qdisc).
2. call synchronize_net().
3. calling qdisc_reset() with all queue' qdisc_sleeping.

And the synchronize_net() in dev_deactivate_many() can be reused to
ensure old qdisc is not touched any more when calling qdisc_reset().


>
> This lockless pfifo_fast is a mess really.
>
>
> .
>

2020-09-01 18:28:14

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin <[email protected]> wrote:
>
> Currently there is concurrent reset and enqueue operation for the
> same lockless qdisc when there is no lock to synchronize the
> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> out-of-bounds access for priv->ring[] in hns3 driver if user has
> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> skb with a larger queue_mapping after the corresponding qdisc is
> reset, and call hns3_nic_net_xmit() with that skb later.

Can you be more specific here? Which call path requests a smaller
tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
we already have a synchronize_net() there.

>
> Avoid the above concurrent op by calling synchronize_rcu_tasks()
> after assigning new qdisc to dev_queue->qdisc and before calling
> qdisc_deactivate() to make sure skb with larger queue_mapping
> enqueued to old qdisc will always be reset when qdisc_deactivate()
> is called.

Like Eric said, it is not nice to call such a blocking function when
we have a large number of TX queues. Possibly we just need to
add a synchronize_net() as in netif_set_real_num_tx_queues(),
if it is missing.

Thanks.

2020-09-01 18:37:48

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

From: Yunsheng Lin <[email protected]>
Date: Tue, 1 Sep 2020 15:27:44 +0800

> On 2020/9/1 14:48, Eric Dumazet wrote:
>> We request Fixes: tag for fixes in networking land.
>
> ok.
>
> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")

You should repost the patch with the Fixes: tag in order to add it, you
can't just mention it in this thread.

And the patch has to change anyways as you were also given other
feedback from Eric to address as well.

Thank you.

2020-09-02 01:45:23

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On 2020/9/2 2:34, David Miller wrote:
> From: Yunsheng Lin <[email protected]>
> Date: Tue, 1 Sep 2020 15:27:44 +0800
>
>> On 2020/9/1 14:48, Eric Dumazet wrote:
>>> We request Fixes: tag for fixes in networking land.
>>
>> ok.
>>
>> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
>
> You should repost the patch with the Fixes: tag in order to add it, you
> can't just mention it in this thread.
>
> And the patch has to change anyways as you were also given other
> feedback from Eric to address as well.

Yes, thanks for the feedback:)

>
> Thank you.
> .
>

2020-09-02 01:53:55

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On 2020/9/2 2:24, Cong Wang wrote:
> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin <[email protected]> wrote:
>>
>> Currently there is concurrent reset and enqueue operation for the
>> same lockless qdisc when there is no lock to synchronize the
>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
>> out-of-bounds access for priv->ring[] in hns3 driver if user has
>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
>> skb with a larger queue_mapping after the corresponding qdisc is
>> reset, and call hns3_nic_net_xmit() with that skb later.
>
> Can you be more specific here? Which call path requests a smaller
> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> we already have a synchronize_net() there.

When the netdevice is in active state, the synchronize_net() seems to
do the correct work, as below:

CPU 0: CPU1:
__dev_queue_xmit() netif_set_real_num_tx_queues()
rcu_read_lock_bh();
netdev_core_pick_tx(dev, skb, sb_dev);
.
. dev->real_num_tx_queues = txq;
. .
. .
. synchronize_net();
. .
q->enqueue() .
. .
rcu_read_unlock_bh() .
qdisc_reset_all_tx_gt


but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
too.

The problem we hit is as below:
In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
to deactive the netdevice when user requested a smaller queue num, and
txq->qdisc is already changed to noop_qdisc when calling
netif_set_real_num_tx_queues(), so the synchronize_net() in the function
netif_set_real_num_tx_queues() does not help here.

>
>>
>> Avoid the above concurrent op by calling synchronize_rcu_tasks()
>> after assigning new qdisc to dev_queue->qdisc and before calling
>> qdisc_deactivate() to make sure skb with larger queue_mapping
>> enqueued to old qdisc will always be reset when qdisc_deactivate()
>> is called.
>
> Like Eric said, it is not nice to call such a blocking function when
> we have a large number of TX queues. Possibly we just need to
> add a synchronize_net() as in netif_set_real_num_tx_queues(),
> if it is missing.

As above, the synchronize_net() in netif_set_real_num_tx_queues() seems
to work when netdevice is in active state, but does not work when in
deactive.

And we do not want skb left in the old qdisc when netdevice is deactived,
right?

As reply to Eric, maybe the existing synchronize_net() in dev_deactivate_many()
can be reused to order the qdisc assignment and qdisc reset?

>
> Thanks.
> .
>

2020-09-02 04:42:14

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin <[email protected]> wrote:
>
> On 2020/9/2 2:24, Cong Wang wrote:
> > On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin <[email protected]> wrote:
> >>
> >> Currently there is concurrent reset and enqueue operation for the
> >> same lockless qdisc when there is no lock to synchronize the
> >> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >> skb with a larger queue_mapping after the corresponding qdisc is
> >> reset, and call hns3_nic_net_xmit() with that skb later.
> >
> > Can you be more specific here? Which call path requests a smaller
> > tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> > we already have a synchronize_net() there.
>
> When the netdevice is in active state, the synchronize_net() seems to
> do the correct work, as below:
>
> CPU 0: CPU1:
> __dev_queue_xmit() netif_set_real_num_tx_queues()
> rcu_read_lock_bh();
> netdev_core_pick_tx(dev, skb, sb_dev);
> .
> . dev->real_num_tx_queues = txq;
> . .
> . .
> . synchronize_net();
> . .
> q->enqueue() .
> . .
> rcu_read_unlock_bh() .
> qdisc_reset_all_tx_gt
>
>

Right.


> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> too.
>
> The problem we hit is as below:
> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> to deactive the netdevice when user requested a smaller queue num, and
> txq->qdisc is already changed to noop_qdisc when calling
> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> netif_set_real_num_tx_queues() does not help here.

How could qdisc still be running after deactivating the device?


>
> >
> >>
> >> Avoid the above concurrent op by calling synchronize_rcu_tasks()
> >> after assigning new qdisc to dev_queue->qdisc and before calling
> >> qdisc_deactivate() to make sure skb with larger queue_mapping
> >> enqueued to old qdisc will always be reset when qdisc_deactivate()
> >> is called.
> >
> > Like Eric said, it is not nice to call such a blocking function when
> > we have a large number of TX queues. Possibly we just need to
> > add a synchronize_net() as in netif_set_real_num_tx_queues(),
> > if it is missing.
>
> As above, the synchronize_net() in netif_set_real_num_tx_queues() seems
> to work when netdevice is in active state, but does not work when in
> deactive.

Please explain why deactivated device still has qdisc running?

At least before commit 379349e9bc3b4, we always test deactivate
bit before enqueueing. Are you complaining about that commit?
That commit is indeed suspicious, at least it does not precisely revert
commit ba27b4cdaaa66561aaedb21 as it claims.


>
> And we do not want skb left in the old qdisc when netdevice is deactived,
> right?

Yes, and more importantly, qdisc should not be running after deactivation.

Thanks.

2020-09-02 06:38:31

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On 2020/9/2 12:41, Cong Wang wrote:
> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin <[email protected]> wrote:
>>
>> On 2020/9/2 2:24, Cong Wang wrote:
>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin <[email protected]> wrote:
>>>>
>>>> Currently there is concurrent reset and enqueue operation for the
>>>> same lockless qdisc when there is no lock to synchronize the
>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
>>>> skb with a larger queue_mapping after the corresponding qdisc is
>>>> reset, and call hns3_nic_net_xmit() with that skb later.
>>>
>>> Can you be more specific here? Which call path requests a smaller
>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
>>> we already have a synchronize_net() there.
>>
>> When the netdevice is in active state, the synchronize_net() seems to
>> do the correct work, as below:
>>
>> CPU 0: CPU1:
>> __dev_queue_xmit() netif_set_real_num_tx_queues()
>> rcu_read_lock_bh();
>> netdev_core_pick_tx(dev, skb, sb_dev);
>> .
>> . dev->real_num_tx_queues = txq;
>> . .
>> . .
>> . synchronize_net();
>> . .
>> q->enqueue() .
>> . .
>> rcu_read_unlock_bh() .
>> qdisc_reset_all_tx_gt
>>
>>
>
> Right.
>
>
>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
>> too.
>>
>> The problem we hit is as below:
>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
>> to deactive the netdevice when user requested a smaller queue num, and
>> txq->qdisc is already changed to noop_qdisc when calling
>> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
>> netif_set_real_num_tx_queues() does not help here.
>
> How could qdisc still be running after deactivating the device?

qdisc could be running during the device deactivating process.

The main process of changing channel number is as below:

1. dev_deactivate()
2. hns3 handware related setup
3. netif_set_real_num_tx_queues()
4. netif_tx_wake_all_queues()
5. dev_activate()

During step 1, qdisc could be running while qdisc is resetting, so
there could be skb left in the old qdisc(which will be restored back to
txq->qdisc during dev_activate()), as below:

CPU 0: CPU1:
__dev_queue_xmit(): dev_deactivate_many():
rcu_read_lock_bh(); qdisc_deactivate(qdisc);
q = rcu_dereference_bh(txq->qdisc); .
netdev_core_pick_tx(dev, skb, sb_dev); .
.
. rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
. .
. .
. .
. .
q->enqueue() .
. .
rcu_read_unlock_bh() .

And During step 3, txq->qdisc is pointing to noop_qdisc, so the qdisc_reset()
only reset the noop_qdisc, but not the actual qdisc, which is stored in
txq->qdisc_sleeping, so the actual qdisc may still have skb.

When hns3_link_status_change() call step 4 and 5, it will restore all queue's
qdisc using txq->qdisc_sleeping and schedule all queue with net_tx_action().
The skb enqueued in step 1 may be dequeued and run, which cause the problem.

>
>
>>
>>>
>>>>
>>>> Avoid the above concurrent op by calling synchronize_rcu_tasks()
>>>> after assigning new qdisc to dev_queue->qdisc and before calling
>>>> qdisc_deactivate() to make sure skb with larger queue_mapping
>>>> enqueued to old qdisc will always be reset when qdisc_deactivate()
>>>> is called.
>>>
>>> Like Eric said, it is not nice to call such a blocking function when
>>> we have a large number of TX queues. Possibly we just need to
>>> add a synchronize_net() as in netif_set_real_num_tx_queues(),
>>> if it is missing.
>>
>> As above, the synchronize_net() in netif_set_real_num_tx_queues() seems
>> to work when netdevice is in active state, but does not work when in
>> deactive.
>
> Please explain why deactivated device still has qdisc running?
>
> At least before commit 379349e9bc3b4, we always test deactivate
> bit before enqueueing. Are you complaining about that commit?
> That commit is indeed suspicious, at least it does not precisely revert
> commit ba27b4cdaaa66561aaedb21 as it claims.

I am not familiar with TCQ_F_CAN_BYPASS.
From my understanding, the problem is that there is no order between
qdisc enqueuing and qdisc reset.


>
>
>>
>> And we do not want skb left in the old qdisc when netdevice is deactived,
>> right?
>
> Yes, and more importantly, qdisc should not be running after deactivation.
>
> Thanks.
> .
>

2020-09-02 07:34:02

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc



On 9/1/20 11:34 PM, Yunsheng Lin wrote:

>
> I am not familiar with TCQ_F_CAN_BYPASS.
> From my understanding, the problem is that there is no order between
> qdisc enqueuing and qdisc reset.

Thw qdisc_reset() should be done after rcu grace period, when there is guarantee no enqueue is in progress.

qdisc_destroy() already has a qdisc_reset() call, I am not sure why qdisc_deactivate() is also calling qdisc_reset()


2020-09-02 08:16:07

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On 2020/9/2 15:32, Eric Dumazet wrote:
>
>
> On 9/1/20 11:34 PM, Yunsheng Lin wrote:
>
>>
>> I am not familiar with TCQ_F_CAN_BYPASS.
>> From my understanding, the problem is that there is no order between
>> qdisc enqueuing and qdisc reset.
>
> Thw qdisc_reset() should be done after rcu grace period, when there is guarantee no enqueue is in progress.
>
> qdisc_destroy() already has a qdisc_reset() call, I am not sure why qdisc_deactivate() is also calling qdisc_reset()

That is a good point.
Do we allow skb left in qdisc when the qdisc is deactivated state?
And qdisc_destroy() is not always called after qdisc_deactivate() is called.
If we allow skb left in qdisc when the qdisc is deactivated state, then it is
huge change of semantics for qdisc_deactivate(), and I am not sure how many
cases will affected by this change.


>
>
>

2020-09-02 09:22:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc



On 9/2/20 1:14 AM, Yunsheng Lin wrote:
> On 2020/9/2 15:32, Eric Dumazet wrote:
>>
>>
>> On 9/1/20 11:34 PM, Yunsheng Lin wrote:
>>
>>>
>>> I am not familiar with TCQ_F_CAN_BYPASS.
>>> From my understanding, the problem is that there is no order between
>>> qdisc enqueuing and qdisc reset.
>>
>> Thw qdisc_reset() should be done after rcu grace period, when there is guarantee no enqueue is in progress.
>>
>> qdisc_destroy() already has a qdisc_reset() call, I am not sure why qdisc_deactivate() is also calling qdisc_reset()
>
> That is a good point.
> Do we allow skb left in qdisc when the qdisc is deactivated state?
> And qdisc_destroy() is not always called after qdisc_deactivate() is called.
> If we allow skb left in qdisc when the qdisc is deactivated state, then it is
> huge change of semantics for qdisc_deactivate(), and I am not sure how many
> cases will affected by this change.

All I am saying is that the qdisc_reset() in qdisc_deactivate() seems
at the wrong place.

This certainly can be deferred _after_ the rcu grace period we already have in dev_deactivate_many()

Something like this nicer patch.

net/sched/sch_generic.c | 22 ++++------------------
1 file changed, 4 insertions(+), 18 deletions(-)


diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 265a61d011dfaa7ec0f8fb8aaede920784f690c9..0eaa99e4f8de643724c0942ee1a2e9516eed65c0 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1131,24 +1131,10 @@ EXPORT_SYMBOL(dev_activate);

static void qdisc_deactivate(struct Qdisc *qdisc)
{
- bool nolock = qdisc->flags & TCQ_F_NOLOCK;
-
if (qdisc->flags & TCQ_F_BUILTIN)
return;
- if (test_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state))
- return;
-
- if (nolock)
- spin_lock_bh(&qdisc->seqlock);
- spin_lock_bh(qdisc_lock(qdisc));

set_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state);
-
- qdisc_reset(qdisc);
-
- spin_unlock_bh(qdisc_lock(qdisc));
- if (nolock)
- spin_unlock_bh(&qdisc->seqlock);
}

static void dev_deactivate_queue(struct net_device *dev,
@@ -1184,6 +1170,9 @@ static bool some_qdisc_is_busy(struct net_device *dev)
val = (qdisc_is_running(q) ||
test_bit(__QDISC_STATE_SCHED, &q->state));

+ if (!val)
+ qdisc_reset(q);
+
spin_unlock_bh(root_lock);

if (val)
@@ -1213,10 +1202,7 @@ void dev_deactivate_many(struct list_head *head)
dev_watchdog_down(dev);
}

- /* Wait for outstanding qdisc-less dev_queue_xmit calls.
- * This is avoided if all devices are in dismantle phase :
- * Caller will call synchronize_net() for us
- */
+ /* Wait for outstanding dev_queue_xmit calls. */
synchronize_net();

/* Wait for outstanding qdisc_run calls. */


2020-09-03 00:37:40

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin <[email protected]> wrote:
>
> On 2020/9/2 12:41, Cong Wang wrote:
> > On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin <[email protected]> wrote:
> >>
> >> On 2020/9/2 2:24, Cong Wang wrote:
> >>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin <[email protected]> wrote:
> >>>>
> >>>> Currently there is concurrent reset and enqueue operation for the
> >>>> same lockless qdisc when there is no lock to synchronize the
> >>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>
> >>> Can you be more specific here? Which call path requests a smaller
> >>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> >>> we already have a synchronize_net() there.
> >>
> >> When the netdevice is in active state, the synchronize_net() seems to
> >> do the correct work, as below:
> >>
> >> CPU 0: CPU1:
> >> __dev_queue_xmit() netif_set_real_num_tx_queues()
> >> rcu_read_lock_bh();
> >> netdev_core_pick_tx(dev, skb, sb_dev);
> >> .
> >> . dev->real_num_tx_queues = txq;
> >> . .
> >> . .
> >> . synchronize_net();
> >> . .
> >> q->enqueue() .
> >> . .
> >> rcu_read_unlock_bh() .
> >> qdisc_reset_all_tx_gt
> >>
> >>
> >
> > Right.
> >
> >
> >> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> >> too.
> >>
> >> The problem we hit is as below:
> >> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> >> to deactive the netdevice when user requested a smaller queue num, and
> >> txq->qdisc is already changed to noop_qdisc when calling
> >> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> >> netif_set_real_num_tx_queues() does not help here.
> >
> > How could qdisc still be running after deactivating the device?
>
> qdisc could be running during the device deactivating process.
>
> The main process of changing channel number is as below:
>
> 1. dev_deactivate()
> 2. hns3 handware related setup
> 3. netif_set_real_num_tx_queues()
> 4. netif_tx_wake_all_queues()
> 5. dev_activate()
>
> During step 1, qdisc could be running while qdisc is resetting, so
> there could be skb left in the old qdisc(which will be restored back to
> txq->qdisc during dev_activate()), as below:
>
> CPU 0: CPU1:
> __dev_queue_xmit(): dev_deactivate_many():
> rcu_read_lock_bh(); qdisc_deactivate(qdisc);
> q = rcu_dereference_bh(txq->qdisc); .
> netdev_core_pick_tx(dev, skb, sb_dev); .
> .
> . rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
> . .
> . .
> . .
> . .
> q->enqueue() .


Well, like I said, if the deactivated bit were tested before ->enqueue(),
there would be no packet queued after qdisc_deactivate().


> . .
> rcu_read_unlock_bh() .
>
> And During step 3, txq->qdisc is pointing to noop_qdisc, so the qdisc_reset()
> only reset the noop_qdisc, but not the actual qdisc, which is stored in
> txq->qdisc_sleeping, so the actual qdisc may still have skb.
>
> When hns3_link_status_change() call step 4 and 5, it will restore all queue's
> qdisc using txq->qdisc_sleeping and schedule all queue with net_tx_action().
> The skb enqueued in step 1 may be dequeued and run, which cause the problem.
>
> >
> >
> >>
> >>>
> >>>>
> >>>> Avoid the above concurrent op by calling synchronize_rcu_tasks()
> >>>> after assigning new qdisc to dev_queue->qdisc and before calling
> >>>> qdisc_deactivate() to make sure skb with larger queue_mapping
> >>>> enqueued to old qdisc will always be reset when qdisc_deactivate()
> >>>> is called.
> >>>
> >>> Like Eric said, it is not nice to call such a blocking function when
> >>> we have a large number of TX queues. Possibly we just need to
> >>> add a synchronize_net() as in netif_set_real_num_tx_queues(),
> >>> if it is missing.
> >>
> >> As above, the synchronize_net() in netif_set_real_num_tx_queues() seems
> >> to work when netdevice is in active state, but does not work when in
> >> deactive.
> >
> > Please explain why deactivated device still has qdisc running?
> >
> > At least before commit 379349e9bc3b4, we always test deactivate
> > bit before enqueueing. Are you complaining about that commit?
> > That commit is indeed suspicious, at least it does not precisely revert
> > commit ba27b4cdaaa66561aaedb21 as it claims.
>
> I am not familiar with TCQ_F_CAN_BYPASS.
> From my understanding, the problem is that there is no order between
> qdisc enqueuing and qdisc reset.

It has almost nothing to do with BYPASS, my point here is all about
__QDISC_STATE_DEACTIVATED bit, clearly commit 379349e9bc3b4
removed this bit test for lockless qdisc, whether intentionally or accidentally.
That bit test existed prior to BYPASS test, see commit ba27b4cdaaa665.

Thanks.

2020-09-03 01:23:23

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On 2020/9/3 8:35, Cong Wang wrote:
> On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin <[email protected]> wrote:
>>
>> On 2020/9/2 12:41, Cong Wang wrote:
>>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin <[email protected]> wrote:
>>>>
>>>> On 2020/9/2 2:24, Cong Wang wrote:
>>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin <[email protected]> wrote:
>>>>>>
>>>>>> Currently there is concurrent reset and enqueue operation for the
>>>>>> same lockless qdisc when there is no lock to synchronize the
>>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
>>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
>>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
>>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
>>>>>> skb with a larger queue_mapping after the corresponding qdisc is
>>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
>>>>>
>>>>> Can you be more specific here? Which call path requests a smaller
>>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
>>>>> we already have a synchronize_net() there.
>>>>
>>>> When the netdevice is in active state, the synchronize_net() seems to
>>>> do the correct work, as below:
>>>>
>>>> CPU 0: CPU1:
>>>> __dev_queue_xmit() netif_set_real_num_tx_queues()
>>>> rcu_read_lock_bh();
>>>> netdev_core_pick_tx(dev, skb, sb_dev);
>>>> .
>>>> . dev->real_num_tx_queues = txq;
>>>> . .
>>>> . .
>>>> . synchronize_net();
>>>> . .
>>>> q->enqueue() .
>>>> . .
>>>> rcu_read_unlock_bh() .
>>>> qdisc_reset_all_tx_gt
>>>>
>>>>
>>>
>>> Right.
>>>
>>>
>>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
>>>> too.
>>>>
>>>> The problem we hit is as below:
>>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
>>>> to deactive the netdevice when user requested a smaller queue num, and
>>>> txq->qdisc is already changed to noop_qdisc when calling
>>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
>>>> netif_set_real_num_tx_queues() does not help here.
>>>
>>> How could qdisc still be running after deactivating the device?
>>
>> qdisc could be running during the device deactivating process.
>>
>> The main process of changing channel number is as below:
>>
>> 1. dev_deactivate()
>> 2. hns3 handware related setup
>> 3. netif_set_real_num_tx_queues()
>> 4. netif_tx_wake_all_queues()
>> 5. dev_activate()
>>
>> During step 1, qdisc could be running while qdisc is resetting, so
>> there could be skb left in the old qdisc(which will be restored back to
>> txq->qdisc during dev_activate()), as below:
>>
>> CPU 0: CPU1:
>> __dev_queue_xmit(): dev_deactivate_many():
>> rcu_read_lock_bh(); qdisc_deactivate(qdisc);
>> q = rcu_dereference_bh(txq->qdisc); .
>> netdev_core_pick_tx(dev, skb, sb_dev); .
>> .
>> . rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
>> . .
>> . .
>> . .
>> . .
>> q->enqueue() .
>
>
> Well, like I said, if the deactivated bit were tested before ->enqueue(),
> there would be no packet queued after qdisc_deactivate().

Only if the deactivated bit testing is also protected by qdisc->seqlock?
otherwise there is still window between setting and testing the deactivated bit.

>
>
>> . .
>> rcu_read_unlock_bh() .
>>
>> And During step 3, txq->qdisc is pointing to noop_qdisc, so the qdisc_reset()
>> only reset the noop_qdisc, but not the actual qdisc, which is stored in
>> txq->qdisc_sleeping, so the actual qdisc may still have skb.
>>
>> When hns3_link_status_change() call step 4 and 5, it will restore all queue's
>> qdisc using txq->qdisc_sleeping and schedule all queue with net_tx_action().
>> The skb enqueued in step 1 may be dequeued and run, which cause the problem.
>>
>>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Avoid the above concurrent op by calling synchronize_rcu_tasks()
>>>>>> after assigning new qdisc to dev_queue->qdisc and before calling
>>>>>> qdisc_deactivate() to make sure skb with larger queue_mapping
>>>>>> enqueued to old qdisc will always be reset when qdisc_deactivate()
>>>>>> is called.
>>>>>
>>>>> Like Eric said, it is not nice to call such a blocking function when
>>>>> we have a large number of TX queues. Possibly we just need to
>>>>> add a synchronize_net() as in netif_set_real_num_tx_queues(),
>>>>> if it is missing.
>>>>
>>>> As above, the synchronize_net() in netif_set_real_num_tx_queues() seems
>>>> to work when netdevice is in active state, but does not work when in
>>>> deactive.
>>>
>>> Please explain why deactivated device still has qdisc running?
>>>
>>> At least before commit 379349e9bc3b4, we always test deactivate
>>> bit before enqueueing. Are you complaining about that commit?
>>> That commit is indeed suspicious, at least it does not precisely revert
>>> commit ba27b4cdaaa66561aaedb21 as it claims.
>>
>> I am not familiar with TCQ_F_CAN_BYPASS.
>> From my understanding, the problem is that there is no order between
>> qdisc enqueuing and qdisc reset.
>
> It has almost nothing to do with BYPASS, my point here is all about
> __QDISC_STATE_DEACTIVATED bit, clearly commit 379349e9bc3b4
> removed this bit test for lockless qdisc, whether intentionally or accidentally.
> That bit test existed prior to BYPASS test, see commit ba27b4cdaaa665.

It seems there is no qdisc->seqlock or qdisc->seqlock to protect the
deactivated bit testing in commit ba27b4cdaaa665 too?

>
> Thanks.
> .
>

2020-09-03 01:30:18

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On 2020/9/2 17:20, Eric Dumazet wrote:
>
>
> On 9/2/20 1:14 AM, Yunsheng Lin wrote:
>> On 2020/9/2 15:32, Eric Dumazet wrote:
>>>
>>>
>>> On 9/1/20 11:34 PM, Yunsheng Lin wrote:
>>>
>>>>
>>>> I am not familiar with TCQ_F_CAN_BYPASS.
>>>> From my understanding, the problem is that there is no order between
>>>> qdisc enqueuing and qdisc reset.
>>>
>>> Thw qdisc_reset() should be done after rcu grace period, when there is guarantee no enqueue is in progress.
>>>
>>> qdisc_destroy() already has a qdisc_reset() call, I am not sure why qdisc_deactivate() is also calling qdisc_reset()
>>
>> That is a good point.
>> Do we allow skb left in qdisc when the qdisc is deactivated state?
>> And qdisc_destroy() is not always called after qdisc_deactivate() is called.
>> If we allow skb left in qdisc when the qdisc is deactivated state, then it is
>> huge change of semantics for qdisc_deactivate(), and I am not sure how many
>> cases will affected by this change.
>
> All I am saying is that the qdisc_reset() in qdisc_deactivate() seems
> at the wrong place.
>
> This certainly can be deferred _after_ the rcu grace period we already have in dev_deactivate_many()
>
> Something like this nicer patch.

Thanks for clarifying and the nicer patch.

Reusing the existing synchronize_net() is a good idea.

Some minor comment inline.

>
> net/sched/sch_generic.c | 22 ++++------------------
> 1 file changed, 4 insertions(+), 18 deletions(-)
>
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 265a61d011dfaa7ec0f8fb8aaede920784f690c9..0eaa99e4f8de643724c0942ee1a2e9516eed65c0 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1131,24 +1131,10 @@ EXPORT_SYMBOL(dev_activate);
>
> static void qdisc_deactivate(struct Qdisc *qdisc)
> {
> - bool nolock = qdisc->flags & TCQ_F_NOLOCK;
> -
> if (qdisc->flags & TCQ_F_BUILTIN)
> return;
> - if (test_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state))
> - return;
> -
> - if (nolock)
> - spin_lock_bh(&qdisc->seqlock);
> - spin_lock_bh(qdisc_lock(qdisc));
>
> set_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state);
> -
> - qdisc_reset(qdisc);
> -
> - spin_unlock_bh(qdisc_lock(qdisc));
> - if (nolock)
> - spin_unlock_bh(&qdisc->seqlock);
> }
>
> static void dev_deactivate_queue(struct net_device *dev,
> @@ -1184,6 +1170,9 @@ static bool some_qdisc_is_busy(struct net_device *dev)
> val = (qdisc_is_running(q) ||
> test_bit(__QDISC_STATE_SCHED, &q->state));
>
> + if (!val)
> + qdisc_reset(q);

It seems semantics for some_qdisc_is_busy() is changed, which does not only do
the checking, but also do the reseting?

Also, qdisc_reset() could be called multi times for the same qdisc if some_qdisc_is_busy()
return true multi times?

> +
> spin_unlock_bh(root_lock);
>
> if (val)
> @@ -1213,10 +1202,7 @@ void dev_deactivate_many(struct list_head *head)
> dev_watchdog_down(dev);
> }
>
> - /* Wait for outstanding qdisc-less dev_queue_xmit calls.
> - * This is avoided if all devices are in dismantle phase :
> - * Caller will call synchronize_net() for us
> - */
> + /* Wait for outstanding dev_queue_xmit calls. */
> synchronize_net();
>
> /* Wait for outstanding qdisc_run calls. */
>
>
> .
>

2020-09-03 01:52:13

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin <[email protected]> wrote:
>
> On 2020/9/3 8:35, Cong Wang wrote:
> > On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin <[email protected]> wrote:
> >>
> >> On 2020/9/2 12:41, Cong Wang wrote:
> >>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin <[email protected]> wrote:
> >>>>
> >>>> On 2020/9/2 2:24, Cong Wang wrote:
> >>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin <[email protected]> wrote:
> >>>>>>
> >>>>>> Currently there is concurrent reset and enqueue operation for the
> >>>>>> same lockless qdisc when there is no lock to synchronize the
> >>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>>>
> >>>>> Can you be more specific here? Which call path requests a smaller
> >>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> >>>>> we already have a synchronize_net() there.
> >>>>
> >>>> When the netdevice is in active state, the synchronize_net() seems to
> >>>> do the correct work, as below:
> >>>>
> >>>> CPU 0: CPU1:
> >>>> __dev_queue_xmit() netif_set_real_num_tx_queues()
> >>>> rcu_read_lock_bh();
> >>>> netdev_core_pick_tx(dev, skb, sb_dev);
> >>>> .
> >>>> . dev->real_num_tx_queues = txq;
> >>>> . .
> >>>> . .
> >>>> . synchronize_net();
> >>>> . .
> >>>> q->enqueue() .
> >>>> . .
> >>>> rcu_read_unlock_bh() .
> >>>> qdisc_reset_all_tx_gt
> >>>>
> >>>>
> >>>
> >>> Right.
> >>>
> >>>
> >>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> >>>> too.
> >>>>
> >>>> The problem we hit is as below:
> >>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> >>>> to deactive the netdevice when user requested a smaller queue num, and
> >>>> txq->qdisc is already changed to noop_qdisc when calling
> >>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> >>>> netif_set_real_num_tx_queues() does not help here.
> >>>
> >>> How could qdisc still be running after deactivating the device?
> >>
> >> qdisc could be running during the device deactivating process.
> >>
> >> The main process of changing channel number is as below:
> >>
> >> 1. dev_deactivate()
> >> 2. hns3 handware related setup
> >> 3. netif_set_real_num_tx_queues()
> >> 4. netif_tx_wake_all_queues()
> >> 5. dev_activate()
> >>
> >> During step 1, qdisc could be running while qdisc is resetting, so
> >> there could be skb left in the old qdisc(which will be restored back to
> >> txq->qdisc during dev_activate()), as below:
> >>
> >> CPU 0: CPU1:
> >> __dev_queue_xmit(): dev_deactivate_many():
> >> rcu_read_lock_bh(); qdisc_deactivate(qdisc);
> >> q = rcu_dereference_bh(txq->qdisc); .
> >> netdev_core_pick_tx(dev, skb, sb_dev); .
> >> .
> >> . rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
> >> . .
> >> . .
> >> . .
> >> . .
> >> q->enqueue() .
> >
> >
> > Well, like I said, if the deactivated bit were tested before ->enqueue(),
> > there would be no packet queued after qdisc_deactivate().
>
> Only if the deactivated bit testing is also protected by qdisc->seqlock?
> otherwise there is still window between setting and testing the deactivated bit.

Can you be more specific here? Why testing or setting a bit is not atomic?

AFAIU, qdisc->seqlock is an optimization to replace
__QDISC_STATE_RUNNING, which has nothing to do with deactivate bit.

Thanks.

2020-09-03 02:25:34

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On 2020/9/3 9:48, Cong Wang wrote:
> On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin <[email protected]> wrote:
>>
>> On 2020/9/3 8:35, Cong Wang wrote:
>>> On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin <[email protected]> wrote:
>>>>
>>>> On 2020/9/2 12:41, Cong Wang wrote:
>>>>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin <[email protected]> wrote:
>>>>>>
>>>>>> On 2020/9/2 2:24, Cong Wang wrote:
>>>>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Currently there is concurrent reset and enqueue operation for the
>>>>>>>> same lockless qdisc when there is no lock to synchronize the
>>>>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
>>>>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
>>>>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
>>>>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
>>>>>>>> skb with a larger queue_mapping after the corresponding qdisc is
>>>>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
>>>>>>>
>>>>>>> Can you be more specific here? Which call path requests a smaller
>>>>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
>>>>>>> we already have a synchronize_net() there.
>>>>>>
>>>>>> When the netdevice is in active state, the synchronize_net() seems to
>>>>>> do the correct work, as below:
>>>>>>
>>>>>> CPU 0: CPU1:
>>>>>> __dev_queue_xmit() netif_set_real_num_tx_queues()
>>>>>> rcu_read_lock_bh();
>>>>>> netdev_core_pick_tx(dev, skb, sb_dev);
>>>>>> .
>>>>>> . dev->real_num_tx_queues = txq;
>>>>>> . .
>>>>>> . .
>>>>>> . synchronize_net();
>>>>>> . .
>>>>>> q->enqueue() .
>>>>>> . .
>>>>>> rcu_read_unlock_bh() .
>>>>>> qdisc_reset_all_tx_gt
>>>>>>
>>>>>>
>>>>>
>>>>> Right.
>>>>>
>>>>>
>>>>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
>>>>>> too.
>>>>>>
>>>>>> The problem we hit is as below:
>>>>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
>>>>>> to deactive the netdevice when user requested a smaller queue num, and
>>>>>> txq->qdisc is already changed to noop_qdisc when calling
>>>>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
>>>>>> netif_set_real_num_tx_queues() does not help here.
>>>>>
>>>>> How could qdisc still be running after deactivating the device?
>>>>
>>>> qdisc could be running during the device deactivating process.
>>>>
>>>> The main process of changing channel number is as below:
>>>>
>>>> 1. dev_deactivate()
>>>> 2. hns3 handware related setup
>>>> 3. netif_set_real_num_tx_queues()
>>>> 4. netif_tx_wake_all_queues()
>>>> 5. dev_activate()
>>>>
>>>> During step 1, qdisc could be running while qdisc is resetting, so
>>>> there could be skb left in the old qdisc(which will be restored back to
>>>> txq->qdisc during dev_activate()), as below:
>>>>
>>>> CPU 0: CPU1:
>>>> __dev_queue_xmit(): dev_deactivate_many():
>>>> rcu_read_lock_bh(); qdisc_deactivate(qdisc);
>>>> q = rcu_dereference_bh(txq->qdisc); .
>>>> netdev_core_pick_tx(dev, skb, sb_dev); .
>>>> .
>>>> . rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
>>>> . .
>>>> . .
>>>> . .
>>>> . .
>>>> q->enqueue() .
>>>
>>>
>>> Well, like I said, if the deactivated bit were tested before ->enqueue(),
>>> there would be no packet queued after qdisc_deactivate().
>>
>> Only if the deactivated bit testing is also protected by qdisc->seqlock?
>> otherwise there is still window between setting and testing the deactivated bit.
>
> Can you be more specific here? Why testing or setting a bit is not atomic?

testing a bit or setting a bit separately is atomic.
But testing a bit and setting a bit is not atomic, right?

cpu0: cpu1:
testing A bit
setting A bit .
. .
. qdisc enqueuing
qdisc reset

>
> AFAIU, qdisc->seqlock is an optimization to replace
> __QDISC_STATE_RUNNING, which has nothing to do with deactivate bit.
>
> Thanks.
> .
>

2020-09-03 02:55:02

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On Wed, Sep 2, 2020 at 7:22 PM Yunsheng Lin <[email protected]> wrote:
>
> On 2020/9/3 9:48, Cong Wang wrote:
> > On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin <[email protected]> wrote:
> >>
> >> On 2020/9/3 8:35, Cong Wang wrote:
> >>> On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin <[email protected]> wrote:
> >>>>
> >>>> On 2020/9/2 12:41, Cong Wang wrote:
> >>>>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin <[email protected]> wrote:
> >>>>>>
> >>>>>> On 2020/9/2 2:24, Cong Wang wrote:
> >>>>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> Currently there is concurrent reset and enqueue operation for the
> >>>>>>>> same lockless qdisc when there is no lock to synchronize the
> >>>>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>>>>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>>>>>
> >>>>>>> Can you be more specific here? Which call path requests a smaller
> >>>>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> >>>>>>> we already have a synchronize_net() there.
> >>>>>>
> >>>>>> When the netdevice is in active state, the synchronize_net() seems to
> >>>>>> do the correct work, as below:
> >>>>>>
> >>>>>> CPU 0: CPU1:
> >>>>>> __dev_queue_xmit() netif_set_real_num_tx_queues()
> >>>>>> rcu_read_lock_bh();
> >>>>>> netdev_core_pick_tx(dev, skb, sb_dev);
> >>>>>> .
> >>>>>> . dev->real_num_tx_queues = txq;
> >>>>>> . .
> >>>>>> . .
> >>>>>> . synchronize_net();
> >>>>>> . .
> >>>>>> q->enqueue() .
> >>>>>> . .
> >>>>>> rcu_read_unlock_bh() .
> >>>>>> qdisc_reset_all_tx_gt
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> Right.
> >>>>>
> >>>>>
> >>>>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> >>>>>> too.
> >>>>>>
> >>>>>> The problem we hit is as below:
> >>>>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> >>>>>> to deactive the netdevice when user requested a smaller queue num, and
> >>>>>> txq->qdisc is already changed to noop_qdisc when calling
> >>>>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> >>>>>> netif_set_real_num_tx_queues() does not help here.
> >>>>>
> >>>>> How could qdisc still be running after deactivating the device?
> >>>>
> >>>> qdisc could be running during the device deactivating process.
> >>>>
> >>>> The main process of changing channel number is as below:
> >>>>
> >>>> 1. dev_deactivate()
> >>>> 2. hns3 handware related setup
> >>>> 3. netif_set_real_num_tx_queues()
> >>>> 4. netif_tx_wake_all_queues()
> >>>> 5. dev_activate()
> >>>>
> >>>> During step 1, qdisc could be running while qdisc is resetting, so
> >>>> there could be skb left in the old qdisc(which will be restored back to
> >>>> txq->qdisc during dev_activate()), as below:
> >>>>
> >>>> CPU 0: CPU1:
> >>>> __dev_queue_xmit(): dev_deactivate_many():
> >>>> rcu_read_lock_bh(); qdisc_deactivate(qdisc);
> >>>> q = rcu_dereference_bh(txq->qdisc); .
> >>>> netdev_core_pick_tx(dev, skb, sb_dev); .
> >>>> .
> >>>> . rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
> >>>> . .
> >>>> . .
> >>>> . .
> >>>> . .
> >>>> q->enqueue() .
> >>>
> >>>
> >>> Well, like I said, if the deactivated bit were tested before ->enqueue(),
> >>> there would be no packet queued after qdisc_deactivate().
> >>
> >> Only if the deactivated bit testing is also protected by qdisc->seqlock?
> >> otherwise there is still window between setting and testing the deactivated bit.
> >
> > Can you be more specific here? Why testing or setting a bit is not atomic?
>
> testing a bit or setting a bit separately is atomic.
> But testing a bit and setting a bit is not atomic, right?
>
> cpu0: cpu1:
> testing A bit
> setting A bit .
> . .
> . qdisc enqueuing
> qdisc reset
>

Well, this was not a problem until commit d518d2ed8640c1cbbb.
Prior to that commit, qdsic can still be scheduled even with this
race condition, that is, the packet just enqueued after resetting can
still be dequeued with qdisc_run().

It is amazing to see how messy the lockless qdisc is now.

Thanks.

2020-09-03 07:26:25

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc



On 9/2/20 6:14 PM, Yunsheng Lin wrote:

>
> It seems semantics for some_qdisc_is_busy() is changed, which does not only do
> the checking, but also do the reseting?

Yes, obviously, we would have to rename to a better name.

>
> Also, qdisc_reset() could be called multi times for the same qdisc if some_qdisc_is_busy()
> return true multi times?

This should not matter, qdisc_reset() can be called multiple times,
as we also call it from qdisc_destroy() anyway.

2020-09-04 01:31:56

by John Fastabend

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

Cong Wang wrote:
> On Wed, Sep 2, 2020 at 7:22 PM Yunsheng Lin <[email protected]> wrote:
> >
> > On 2020/9/3 9:48, Cong Wang wrote:
> > > On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin <[email protected]> wrote:
> > >>
> > >> On 2020/9/3 8:35, Cong Wang wrote:
> > >>> On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin <[email protected]> wrote:
> > >>>>
> > >>>> On 2020/9/2 12:41, Cong Wang wrote:
> > >>>>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin <[email protected]> wrote:
> > >>>>>>
> > >>>>>> On 2020/9/2 2:24, Cong Wang wrote:
> > >>>>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin <[email protected]> wrote:
> > >>>>>>>>
> > >>>>>>>> Currently there is concurrent reset and enqueue operation for the
> > >>>>>>>> same lockless qdisc when there is no lock to synchronize the
> > >>>>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> > >>>>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> > >>>>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> > >>>>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> > >>>>>>>> skb with a larger queue_mapping after the corresponding qdisc is
> > >>>>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
> > >>>>>>>
> > >>>>>>> Can you be more specific here? Which call path requests a smaller
> > >>>>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> > >>>>>>> we already have a synchronize_net() there.
> > >>>>>>
> > >>>>>> When the netdevice is in active state, the synchronize_net() seems to
> > >>>>>> do the correct work, as below:
> > >>>>>>
> > >>>>>> CPU 0: CPU1:
> > >>>>>> __dev_queue_xmit() netif_set_real_num_tx_queues()
> > >>>>>> rcu_read_lock_bh();
> > >>>>>> netdev_core_pick_tx(dev, skb, sb_dev);
> > >>>>>> .
> > >>>>>> . dev->real_num_tx_queues = txq;
> > >>>>>> . .
> > >>>>>> . .
> > >>>>>> . synchronize_net();
> > >>>>>> . .
> > >>>>>> q->enqueue() .
> > >>>>>> . .
> > >>>>>> rcu_read_unlock_bh() .
> > >>>>>> qdisc_reset_all_tx_gt
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>> Right.
> > >>>>>
> > >>>>>
> > >>>>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> > >>>>>> too.
> > >>>>>>
> > >>>>>> The problem we hit is as below:
> > >>>>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> > >>>>>> to deactive the netdevice when user requested a smaller queue num, and
> > >>>>>> txq->qdisc is already changed to noop_qdisc when calling
> > >>>>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> > >>>>>> netif_set_real_num_tx_queues() does not help here.
> > >>>>>
> > >>>>> How could qdisc still be running after deactivating the device?
> > >>>>
> > >>>> qdisc could be running during the device deactivating process.
> > >>>>
> > >>>> The main process of changing channel number is as below:
> > >>>>
> > >>>> 1. dev_deactivate()
> > >>>> 2. hns3 handware related setup
> > >>>> 3. netif_set_real_num_tx_queues()
> > >>>> 4. netif_tx_wake_all_queues()
> > >>>> 5. dev_activate()
> > >>>>
> > >>>> During step 1, qdisc could be running while qdisc is resetting, so
> > >>>> there could be skb left in the old qdisc(which will be restored back to
> > >>>> txq->qdisc during dev_activate()), as below:
> > >>>>
> > >>>> CPU 0: CPU1:
> > >>>> __dev_queue_xmit(): dev_deactivate_many():
> > >>>> rcu_read_lock_bh(); qdisc_deactivate(qdisc);
> > >>>> q = rcu_dereference_bh(txq->qdisc); .
> > >>>> netdev_core_pick_tx(dev, skb, sb_dev); .
> > >>>> .
> > >>>> . rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
> > >>>> . .
> > >>>> . .
> > >>>> . .
> > >>>> . .
> > >>>> q->enqueue() .
> > >>>
> > >>>
> > >>> Well, like I said, if the deactivated bit were tested before ->enqueue(),
> > >>> there would be no packet queued after qdisc_deactivate().

Trying to unwind this through git history :/

Original code had a test_bit in dev_xmit_skb(),

if (q->flags & TCQ_F_NOLOCK) {
if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
__qdisc_drop(skb, &to_free);
rc = NET_XMIT_DROP;
} else {
rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
__qdisc_run(q);
}

if (unlikely(to_free))
kfree_skb_list(to_free);
return rc;
}

So we would never enqueue something on top of a deactivated qdisc. And to ensure
we don't have any in-flight enqueues we have to swap qdiscs, wait a grace
period, and then reset the qdisc. That _should_ be OK.

But, I'm still not entirely sure how you got here. In the drivers I did I always
stop the queue before messing with these things with netif_tx_stop_queue(). Then
we really should not get these packets into the driver.

In sch_direct_xmit():

if (likely(skb)) {
HARD_TX_LOCK(dev, txq, smp_processor_id());
if (!netif_xmit_frozen_or_stopped(txq))
skb = dev_hard_start_xmit(skb, dev, txq, &ret);

HARD_TX_UNLOCK(dev, txq);
} else {
if (root_lock)
spin_lock(root_lock);
return true;
}

Maybe I missed something? Does your driver use the netif_tx_stop/start APIs?

> > >>
> > >> Only if the deactivated bit testing is also protected by qdisc->seqlock?
> > >> otherwise there is still window between setting and testing the deactivated bit.
> > >
> > > Can you be more specific here? Why testing or setting a bit is not atomic?
> >
> > testing a bit or setting a bit separately is atomic.
> > But testing a bit and setting a bit is not atomic, right?
> >
> > cpu0: cpu1:
> > testing A bit
> > setting A bit .
> > . .
> > . qdisc enqueuing
> > qdisc reset
> >
>
> Well, this was not a problem until commit d518d2ed8640c1cbbb.
> Prior to that commit, qdsic can still be scheduled even with this
> race condition, that is, the packet just enqueued after resetting can
> still be dequeued with qdisc_run().
>
> It is amazing to see how messy the lockless qdisc is now.
>
> Thanks.


2020-09-04 08:11:16

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On 2020/9/4 9:30, John Fastabend wrote:
> Cong Wang wrote:
>> On Wed, Sep 2, 2020 at 7:22 PM Yunsheng Lin <[email protected]> wrote:
>>>
>>> On 2020/9/3 9:48, Cong Wang wrote:
>>>> On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin <[email protected]> wrote:
>>>>>
>>>>> On 2020/9/3 8:35, Cong Wang wrote:
>>>>>> On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin <[email protected]> wrote:
>>>>>>>
>>>>>>> On 2020/9/2 12:41, Cong Wang wrote:
>>>>>>>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 2020/9/2 2:24, Cong Wang wrote:
>>>>>>>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Currently there is concurrent reset and enqueue operation for the
>>>>>>>>>>> same lockless qdisc when there is no lock to synchronize the
>>>>>>>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
>>>>>>>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
>>>>>>>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
>>>>>>>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
>>>>>>>>>>> skb with a larger queue_mapping after the corresponding qdisc is
>>>>>>>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
>>>>>>>>>>
>>>>>>>>>> Can you be more specific here? Which call path requests a smaller
>>>>>>>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
>>>>>>>>>> we already have a synchronize_net() there.
>>>>>>>>>
>>>>>>>>> When the netdevice is in active state, the synchronize_net() seems to
>>>>>>>>> do the correct work, as below:
>>>>>>>>>
>>>>>>>>> CPU 0: CPU1:
>>>>>>>>> __dev_queue_xmit() netif_set_real_num_tx_queues()
>>>>>>>>> rcu_read_lock_bh();
>>>>>>>>> netdev_core_pick_tx(dev, skb, sb_dev);
>>>>>>>>> .
>>>>>>>>> . dev->real_num_tx_queues = txq;
>>>>>>>>> . .
>>>>>>>>> . .
>>>>>>>>> . synchronize_net();
>>>>>>>>> . .
>>>>>>>>> q->enqueue() .
>>>>>>>>> . .
>>>>>>>>> rcu_read_unlock_bh() .
>>>>>>>>> qdisc_reset_all_tx_gt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Right.
>>>>>>>>
>>>>>>>>
>>>>>>>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
>>>>>>>>> too.
>>>>>>>>>
>>>>>>>>> The problem we hit is as below:
>>>>>>>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
>>>>>>>>> to deactive the netdevice when user requested a smaller queue num, and
>>>>>>>>> txq->qdisc is already changed to noop_qdisc when calling
>>>>>>>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
>>>>>>>>> netif_set_real_num_tx_queues() does not help here.
>>>>>>>>
>>>>>>>> How could qdisc still be running after deactivating the device?
>>>>>>>
>>>>>>> qdisc could be running during the device deactivating process.
>>>>>>>
>>>>>>> The main process of changing channel number is as below:
>>>>>>>
>>>>>>> 1. dev_deactivate()
>>>>>>> 2. hns3 handware related setup
>>>>>>> 3. netif_set_real_num_tx_queues()
>>>>>>> 4. netif_tx_wake_all_queues()
>>>>>>> 5. dev_activate()
>>>>>>>
>>>>>>> During step 1, qdisc could be running while qdisc is resetting, so
>>>>>>> there could be skb left in the old qdisc(which will be restored back to
>>>>>>> txq->qdisc during dev_activate()), as below:
>>>>>>>
>>>>>>> CPU 0: CPU1:
>>>>>>> __dev_queue_xmit(): dev_deactivate_many():
>>>>>>> rcu_read_lock_bh(); qdisc_deactivate(qdisc);
>>>>>>> q = rcu_dereference_bh(txq->qdisc); .
>>>>>>> netdev_core_pick_tx(dev, skb, sb_dev); .
>>>>>>> .
>>>>>>> . rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
>>>>>>> . .
>>>>>>> . .
>>>>>>> . .
>>>>>>> . .
>>>>>>> q->enqueue() .
>>>>>>
>>>>>>
>>>>>> Well, like I said, if the deactivated bit were tested before ->enqueue(),
>>>>>> there would be no packet queued after qdisc_deactivate().
>
> Trying to unwind this through git history :/
>
> Original code had a test_bit in dev_xmit_skb(),
>
> if (q->flags & TCQ_F_NOLOCK) {
> if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> __qdisc_drop(skb, &to_free);
> rc = NET_XMIT_DROP;
> } else {
> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> __qdisc_run(q);
> }
>
> if (unlikely(to_free))
> kfree_skb_list(to_free);
> return rc;
> }
>
> So we would never enqueue something on top of a deactivated qdisc. And to ensure
> we don't have any in-flight enqueues we have to swap qdiscs, wait a grace
> period, and then reset the qdisc. That _should_ be OK.

Actually, the deactivated is not really needed to be checked after swapping qdiscs
and waiting a grace period, because __dev_xmit_skb() only see the noop_qdisc now,
so qdisc_reset() in safe from q->enqueue()?

>
> But, I'm still not entirely sure how you got here. In the drivers I did I always
> stop the queue before messing with these things with netif_tx_stop_queue(). Then
> we really should not get these packets into the driver.
>
> In sch_direct_xmit():
>
> if (likely(skb)) {
> HARD_TX_LOCK(dev, txq, smp_processor_id());
> if (!netif_xmit_frozen_or_stopped(txq))
> skb = dev_hard_start_xmit(skb, dev, txq, &ret);
>
> HARD_TX_UNLOCK(dev, txq);
> } else {
> if (root_lock)
> spin_lock(root_lock);
> return true;
> }
>
> Maybe I missed something? Does your driver use the netif_tx_stop/start APIs?

The hns3 driver uses netif_tx_stop_all_queues() in hns3_nic_net_stop() before
calling netif_carrier_off()

and call netif_tx_wake_all_queues() and netif_carrier_on() when link is detected
in hns3_link_status_change()

The main process of changing channel number is as below:
1. dev_deactivate()
2. hns3 handware related setup
3. netif_set_real_num_tx_queues()
4. netif_tx_wake_all_queues()
5. dev_activate()

During step 1, qdisc could be running while qdisc is resetting, so
there could be skb left in the old qdisc(which will be restored back to
txq->qdisc during dev_activate()), and the skb left in the old qdisc does not
get called with hns3's xmit function in step 1, but get called with hns3's xmit
function after step 5, because the old qdisc will be restored back to dev_queue->qdisc
after step 5, which still has the skb left.

Hope I did not misunderstand your question.

>
>>>>>
>>>>> Only if the deactivated bit testing is also protected by qdisc->seqlock?
>>>>> otherwise there is still window between setting and testing the deactivated bit.
>>>>
>>>> Can you be more specific here? Why testing or setting a bit is not atomic?
>>>
>>> testing a bit or setting a bit separately is atomic.
>>> But testing a bit and setting a bit is not atomic, right?
>>>
>>> cpu0: cpu1:
>>> testing A bit
>>> setting A bit .
>>> . .
>>> . qdisc enqueuing
>>> qdisc reset
>>>
>>
>> Well, this was not a problem until commit d518d2ed8640c1cbbb.
>> Prior to that commit, qdsic can still be scheduled even with this
>> race condition, that is, the packet just enqueued after resetting can
>> still be dequeued with qdisc_run().
>>
>> It is amazing to see how messy the lockless qdisc is now.
>>
>> Thanks.
>
>
> .
>

2020-09-04 08:12:01

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

On 2020/9/3 15:24, Eric Dumazet wrote:
>
>
> On 9/2/20 6:14 PM, Yunsheng Lin wrote:
>
>>
>> It seems semantics for some_qdisc_is_busy() is changed, which does not only do
>> the checking, but also do the reseting?
>
> Yes, obviously, we would have to rename to a better name.
>
>>
>> Also, qdisc_reset() could be called multi times for the same qdisc if some_qdisc_is_busy()
>> return true multi times?
>
> This should not matter, qdisc_reset() can be called multiple times,
> as we also call it from qdisc_destroy() anyway.

How about the below patch, which does not need to change the semantics
for some_qdisc_is_busy() and avoid calling qdisc_reset() multi times?


diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 265a61d..ce9031c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1131,24 +1131,7 @@ EXPORT_SYMBOL(dev_activate);

static void qdisc_deactivate(struct Qdisc *qdisc)
{
- bool nolock = qdisc->flags & TCQ_F_NOLOCK;
-
- if (qdisc->flags & TCQ_F_BUILTIN)
- return;
- if (test_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state))
- return;
-
- if (nolock)
- spin_lock_bh(&qdisc->seqlock);
- spin_lock_bh(qdisc_lock(qdisc));
-
set_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state);
-
- qdisc_reset(qdisc);
-
- spin_unlock_bh(qdisc_lock(qdisc));
- if (nolock)
- spin_unlock_bh(&qdisc->seqlock);
}

static void dev_deactivate_queue(struct net_device *dev,
@@ -1165,6 +1148,33 @@ static void dev_deactivate_queue(struct net_device *dev,
}
}

+static void dev_reset_qdisc(struct net_device *dev)
+{
+ unsigned int i;
+
+ for (i = 0; i < dev->num_tx_queues; i++) {
+ struct netdev_queue *dev_queue;
+ struct Qdisc *q;
+ bool nolock;
+
+ dev_queue = netdev_get_tx_queue(dev, i);
+ q = dev_queue->qdisc_sleeping;
+ nolock = q->flags & TCQ_F_NOLOCK;
+
+ if (nolock)
+ spin_lock_bh(&q->seqlock);
+
+ spin_lock_bh(qdisc_lock(q));
+
+ qdisc_reset(q);
+
+ spin_unlock_bh(qdisc_lock(q));
+
+ if (nolock)
+ spin_unlock_bh(&q->seqlock);
+ }
+}
+
static bool some_qdisc_is_busy(struct net_device *dev)
{
unsigned int i;
@@ -1219,6 +1229,9 @@ void dev_deactivate_many(struct list_head *head)
*/
synchronize_net();

+ list_for_each_entry(dev, head, close_list)
+ dev_reset_qdisc(dev);
+
/* Wait for outstanding qdisc_run calls. */
list_for_each_entry(dev, head, close_list) {
while (some_qdisc_is_busy(dev)) {



>
>

2020-09-06 09:07:58

by Chen, Rong A

[permalink] [raw]
Subject: [net] 6fd0d0dede: hwsim.ap_ht40_5ghz_switch.fail

Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 6fd0d0deded94645d8cb96f93c26ad55cd92f6a5 ("[PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc")
url: https://github.com/0day-ci/linux/commits/Yunsheng-Lin/net-sch_generic-aviod-concurrent-reset-and-enqueue-op-for-lockless-qdisc/20200901-085941
base: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git 10eb4667946068e34f0cde1485f030aa68c89275

in testcase: hwsim
version: hwsim-x86_64-6eb6cf0-1_20200619
with following parameters:

group: hwsim-03
ucode: 0x21



on test machine: 8 threads Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz with 16G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):




If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>



2020-09-05 21:47:27 ./run-tests.py ap_ht40_5ghz_switch
DEV: wlan0: 02:00:00:00:00:00
DEV: wlan1: 02:00:00:00:01:00
DEV: wlan2: 02:00:00:00:02:00
APDEV: wlan3
APDEV: wlan4
START ap_ht40_5ghz_switch 1/1
Test: HT40 co-ex scan on 5 GHz switching pri/sec channel
Starting AP wlan4
Starting AP wlan3
Connect STA wlan0 to AP
wlan0: Country code not reset back to 00: is US
wlan0: Country code cleared back to 00
FAIL ap_ht40_5ghz_switch 3.614883 2020-09-05 21:47:31.016291
passed 0 test case(s)
skipped 0 test case(s)
failed tests: ap_ht40_5ghz_switch



To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml



Thanks,
Rong Chen


Attachments:
(No filename) (1.66 kB)
config-5.9.0-rc1-00658-g6fd0d0deded94 (172.97 kB)
job-script (5.68 kB)
kmsg.xz (118.41 kB)
hwsim (46.44 kB)
job.yaml (4.64 kB)
reproduce (4.50 kB)
Download all attachments