2023-04-12 01:12:05

by Ding Hui

[permalink] [raw]
Subject: [RFC PATCH net] sfc: Fix use-after-free due to selftest_work

There is a use-after-free scenario that is:

When netif_running() is false, user set mac address or vlan tag to VF,
the xxx_set_vf_mac() or xxx_set_vf_vlan() will invoke efx_net_stop()
and efx_net_open(), since netif_running() is false, the port will not
start and keep port_enabled false, but selftest_worker is scheduled
in efx_net_open().

If we remove the device before selftest_worker run, the efx is freed,
then we will get a UAF in run_timer_softirq() like this:

[ 1178.907941] ==================================================================
[ 1178.907948] BUG: KASAN: use-after-free in run_timer_softirq+0xdea/0xe90
[ 1178.907950] Write of size 8 at addr ff11001f449cdc80 by task swapper/47/0
[ 1178.907950]
[ 1178.907953] CPU: 47 PID: 0 Comm: swapper/47 Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1
[ 1178.907954] Hardware name: SANGFOR X620G40/WI2HG-208T1061A, BIOS SPYH051032-U01 04/01/2022
[ 1178.907955] Call Trace:
[ 1178.907956] <IRQ>
[ 1178.907960] dump_stack+0x71/0xab
[ 1178.907963] print_address_description+0x6b/0x290
[ 1178.907965] ? run_timer_softirq+0xdea/0xe90
[ 1178.907967] kasan_report+0x14a/0x2b0
[ 1178.907968] run_timer_softirq+0xdea/0xe90
[ 1178.907971] ? init_timer_key+0x170/0x170
[ 1178.907973] ? hrtimer_cancel+0x20/0x20
[ 1178.907976] ? sched_clock+0x5/0x10
[ 1178.907978] ? sched_clock_cpu+0x18/0x170
[ 1178.907981] __do_softirq+0x1c8/0x5fa
[ 1178.907985] irq_exit+0x213/0x240
[ 1178.907987] smp_apic_timer_interrupt+0xd0/0x330
[ 1178.907989] apic_timer_interrupt+0xf/0x20
[ 1178.907990] </IRQ>
[ 1178.907991] RIP: 0010:mwait_idle+0xae/0x370

I am thinking about several ways to fix the issue:

[1] In this RFC, I cancel the selftest_worker unconditionally in
efx_pci_remove().

[2] Add a test condition, only invoke efx_selftest_async_start() when
efx->port_enabled is true in efx_net_open().

[3] Move invoking efx_selftest_async_start() from efx_net_open() to
efx_start_all() or efx_start_port(), that matching cancel action in
efx_stop_port().

[4] However, I also notice that in efx_ef10_set_mac_address(), the
efx_net_open() depends on original port_enabled, but others are not,
if we change all efx_net_open() depends on old state like
efx_ef10_set_mac_address() does, the UAF can also be fixed in theory.

But I'm not sure which is better, is there any suggestions? Thanks.

Signed-off-by: Ding Hui <[email protected]>
---
drivers/net/ethernet/sfc/efx.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 884d8d168862..dd0b2363eed1 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -876,6 +876,8 @@ static void efx_pci_remove(struct pci_dev *pci_dev)
efx->state = STATE_UNINIT;
rtnl_unlock();

+ efx_selftest_async_cancel(efx);
+
if (efx->type->sriov_fini)
efx->type->sriov_fini(efx);

--
2.17.1


2023-04-12 22:35:13

by Jacob Keller

[permalink] [raw]
Subject: Re: [RFC PATCH net] sfc: Fix use-after-free due to selftest_work



On 4/11/2023 5:50 PM, Ding Hui wrote:
> There is a use-after-free scenario that is:
>
> When netif_running() is false, user set mac address or vlan tag to VF,
> the xxx_set_vf_mac() or xxx_set_vf_vlan() will invoke efx_net_stop()
> and efx_net_open(), since netif_running() is false, the port will not
> start and keep port_enabled false, but selftest_worker is scheduled
> in efx_net_open().
>
> If we remove the device before selftest_worker run, the efx is freed,
> then we will get a UAF in run_timer_softirq() like this:
>
> [ 1178.907941] ==================================================================
> [ 1178.907948] BUG: KASAN: use-after-free in run_timer_softirq+0xdea/0xe90
> [ 1178.907950] Write of size 8 at addr ff11001f449cdc80 by task swapper/47/0
> [ 1178.907950]
> [ 1178.907953] CPU: 47 PID: 0 Comm: swapper/47 Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1
> [ 1178.907954] Hardware name: SANGFOR X620G40/WI2HG-208T1061A, BIOS SPYH051032-U01 04/01/2022
> [ 1178.907955] Call Trace:
> [ 1178.907956] <IRQ>
> [ 1178.907960] dump_stack+0x71/0xab
> [ 1178.907963] print_address_description+0x6b/0x290
> [ 1178.907965] ? run_timer_softirq+0xdea/0xe90
> [ 1178.907967] kasan_report+0x14a/0x2b0
> [ 1178.907968] run_timer_softirq+0xdea/0xe90
> [ 1178.907971] ? init_timer_key+0x170/0x170
> [ 1178.907973] ? hrtimer_cancel+0x20/0x20
> [ 1178.907976] ? sched_clock+0x5/0x10
> [ 1178.907978] ? sched_clock_cpu+0x18/0x170
> [ 1178.907981] __do_softirq+0x1c8/0x5fa
> [ 1178.907985] irq_exit+0x213/0x240
> [ 1178.907987] smp_apic_timer_interrupt+0xd0/0x330
> [ 1178.907989] apic_timer_interrupt+0xf/0x20
> [ 1178.907990] </IRQ>
> [ 1178.907991] RIP: 0010:mwait_idle+0xae/0x370
>
> I am thinking about several ways to fix the issue:
>
> [1] In this RFC, I cancel the selftest_worker unconditionally in
> efx_pci_remove().
>
> [2] Add a test condition, only invoke efx_selftest_async_start() when
> efx->port_enabled is true in efx_net_open().
>
> [3] Move invoking efx_selftest_async_start() from efx_net_open() to
> efx_start_all() or efx_start_port(), that matching cancel action in
> efx_stop_port().
>
> [4] However, I also notice that in efx_ef10_set_mac_address(), the
> efx_net_open() depends on original port_enabled, but others are not,
> if we change all efx_net_open() depends on old state like
> efx_ef10_set_mac_address() does, the UAF can also be fixed in theory.
>
> But I'm not sure which is better, is there any suggestions? Thanks.
>

I think this fix makes the most sense to me.

> Signed-off-by: Ding Hui <[email protected]>
> ---

net patches need a Fixes tag indicating what commit this fixes. This
being RFC is likely why that was left off?

> drivers/net/ethernet/sfc/efx.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
> index 884d8d168862..dd0b2363eed1 100644
> --- a/drivers/net/ethernet/sfc/efx.c
> +++ b/drivers/net/ethernet/sfc/efx.c
> @@ -876,6 +876,8 @@ static void efx_pci_remove(struct pci_dev *pci_dev)
> efx->state = STATE_UNINIT;
> rtnl_unlock();
>
> + efx_selftest_async_cancel(efx);
> +
> if (efx->type->sriov_fini)
> efx->type->sriov_fini(efx);
>

2023-04-13 01:19:19

by Ding Hui

[permalink] [raw]
Subject: Re: [RFC PATCH net] sfc: Fix use-after-free due to selftest_work

On 2023/4/13 6:34, Jacob Keller wrote:
>
>
> On 4/11/2023 5:50 PM, Ding Hui wrote:
>> There is a use-after-free scenario that is:
>>
>> When netif_running() is false, user set mac address or vlan tag to VF,
>> the xxx_set_vf_mac() or xxx_set_vf_vlan() will invoke efx_net_stop()
>> and efx_net_open(), since netif_running() is false, the port will not
>> start and keep port_enabled false, but selftest_worker is scheduled
>> in efx_net_open().
>>
>> If we remove the device before selftest_worker run, the efx is freed,
>> then we will get a UAF in run_timer_softirq() like this:
>>
>> [ 1178.907941] ==================================================================
>> [ 1178.907948] BUG: KASAN: use-after-free in run_timer_softirq+0xdea/0xe90
>> [ 1178.907950] Write of size 8 at addr ff11001f449cdc80 by task swapper/47/0
>> [ 1178.907950]
>> [ 1178.907953] CPU: 47 PID: 0 Comm: swapper/47 Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1
>> [ 1178.907954] Hardware name: SANGFOR X620G40/WI2HG-208T1061A, BIOS SPYH051032-U01 04/01/2022
>> [ 1178.907955] Call Trace:
>> [ 1178.907956] <IRQ>
>> [ 1178.907960] dump_stack+0x71/0xab
>> [ 1178.907963] print_address_description+0x6b/0x290
>> [ 1178.907965] ? run_timer_softirq+0xdea/0xe90
>> [ 1178.907967] kasan_report+0x14a/0x2b0
>> [ 1178.907968] run_timer_softirq+0xdea/0xe90
>> [ 1178.907971] ? init_timer_key+0x170/0x170
>> [ 1178.907973] ? hrtimer_cancel+0x20/0x20
>> [ 1178.907976] ? sched_clock+0x5/0x10
>> [ 1178.907978] ? sched_clock_cpu+0x18/0x170
>> [ 1178.907981] __do_softirq+0x1c8/0x5fa
>> [ 1178.907985] irq_exit+0x213/0x240
>> [ 1178.907987] smp_apic_timer_interrupt+0xd0/0x330
>> [ 1178.907989] apic_timer_interrupt+0xf/0x20
>> [ 1178.907990] </IRQ>
>> [ 1178.907991] RIP: 0010:mwait_idle+0xae/0x370
>>
>> I am thinking about several ways to fix the issue:
>>
>> [1] In this RFC, I cancel the selftest_worker unconditionally in
>> efx_pci_remove().
>>
>> [2] Add a test condition, only invoke efx_selftest_async_start() when
>> efx->port_enabled is true in efx_net_open().
>>
>> [3] Move invoking efx_selftest_async_start() from efx_net_open() to
>> efx_start_all() or efx_start_port(), that matching cancel action in
>> efx_stop_port().
>>
>> [4] However, I also notice that in efx_ef10_set_mac_address(), the
>> efx_net_open() depends on original port_enabled, but others are not,
>> if we change all efx_net_open() depends on old state like
>> efx_ef10_set_mac_address() does, the UAF can also be fixed in theory.
>>
>> But I'm not sure which is better, is there any suggestions? Thanks.
>>
>
> I think this fix makes the most sense to me.
>
>> Signed-off-by: Ding Hui <[email protected]>
>> ---
>
> net patches need a Fixes tag indicating what commit this fixes. This
> being RFC is likely why that was left off?
>

Thanks.

The commit dd40781e3a4e ("sfc: Run event/IRQ self-test asynchronously
when interface is brought up") add efx_selftest_async_start() into
efx_net_open(), it was okay then since efx_net_open() was only invoked
by the callback. Base on the original purpose of this commit, the
way [2][3] makes sense.

The commit e340be923012 ("sfc: add ndo_set_vf_mac() function for EF10")
first add efx_ef10_sriov_set_vf_mac(), it invoke efx_net_open(), then
this UAF scenario started.

I'll remove RFC and add Fixes in v2.

Fixes: e340be923012 ("sfc: add ndo_set_vf_mac() function for EF10")

>> drivers/net/ethernet/sfc/efx.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
>> index 884d8d168862..dd0b2363eed1 100644
>> --- a/drivers/net/ethernet/sfc/efx.c
>> +++ b/drivers/net/ethernet/sfc/efx.c
>> @@ -876,6 +876,8 @@ static void efx_pci_remove(struct pci_dev *pci_dev)
>> efx->state = STATE_UNINIT;
>> rtnl_unlock();
>>
>> + efx_selftest_async_cancel(efx);
>> +
>> if (efx->type->sriov_fini)
>> efx->type->sriov_fini(efx);
>>
>

--
Thanks,
- Ding Hui

2023-04-13 07:42:20

by Martin Habets

[permalink] [raw]
Subject: Re: [RFC PATCH net] sfc: Fix use-after-free due to selftest_work

On Wed, Apr 12, 2023 at 08:50:13AM +0800, Ding Hui wrote:
> There is a use-after-free scenario that is:
>
> When netif_running() is false, user set mac address or vlan tag to VF,
> the xxx_set_vf_mac() or xxx_set_vf_vlan() will invoke efx_net_stop()
> and efx_net_open(), since netif_running() is false, the port will not
> start and keep port_enabled false, but selftest_worker is scheduled
> in efx_net_open().
>
> If we remove the device before selftest_worker run, the efx is freed,
> then we will get a UAF in run_timer_softirq() like this:
>
> [ 1178.907941] ==================================================================
> [ 1178.907948] BUG: KASAN: use-after-free in run_timer_softirq+0xdea/0xe90
> [ 1178.907950] Write of size 8 at addr ff11001f449cdc80 by task swapper/47/0
> [ 1178.907950]
> [ 1178.907953] CPU: 47 PID: 0 Comm: swapper/47 Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1
> [ 1178.907954] Hardware name: SANGFOR X620G40/WI2HG-208T1061A, BIOS SPYH051032-U01 04/01/2022
> [ 1178.907955] Call Trace:
> [ 1178.907956] <IRQ>
> [ 1178.907960] dump_stack+0x71/0xab
> [ 1178.907963] print_address_description+0x6b/0x290
> [ 1178.907965] ? run_timer_softirq+0xdea/0xe90
> [ 1178.907967] kasan_report+0x14a/0x2b0
> [ 1178.907968] run_timer_softirq+0xdea/0xe90
> [ 1178.907971] ? init_timer_key+0x170/0x170
> [ 1178.907973] ? hrtimer_cancel+0x20/0x20
> [ 1178.907976] ? sched_clock+0x5/0x10
> [ 1178.907978] ? sched_clock_cpu+0x18/0x170
> [ 1178.907981] __do_softirq+0x1c8/0x5fa
> [ 1178.907985] irq_exit+0x213/0x240
> [ 1178.907987] smp_apic_timer_interrupt+0xd0/0x330
> [ 1178.907989] apic_timer_interrupt+0xf/0x20
> [ 1178.907990] </IRQ>
> [ 1178.907991] RIP: 0010:mwait_idle+0xae/0x370
>
> I am thinking about several ways to fix the issue:
>
> [1] In this RFC, I cancel the selftest_worker unconditionally in
> efx_pci_remove().
>
> [2] Add a test condition, only invoke efx_selftest_async_start() when
> efx->port_enabled is true in efx_net_open().
>
> [3] Move invoking efx_selftest_async_start() from efx_net_open() to
> efx_start_all() or efx_start_port(), that matching cancel action in
> efx_stop_port().

I think moving this to efx_start_port() is best, as you say to match
the cancel in efx_stop_port().

Thanks,
Martin

>
> [4] However, I also notice that in efx_ef10_set_mac_address(), the
> efx_net_open() depends on original port_enabled, but others are not,
> if we change all efx_net_open() depends on old state like
> efx_ef10_set_mac_address() does, the UAF can also be fixed in theory.
>
> But I'm not sure which is better, is there any suggestions? Thanks.
>
> Signed-off-by: Ding Hui <[email protected]>
> ---
> drivers/net/ethernet/sfc/efx.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
> index 884d8d168862..dd0b2363eed1 100644
> --- a/drivers/net/ethernet/sfc/efx.c
> +++ b/drivers/net/ethernet/sfc/efx.c
> @@ -876,6 +876,8 @@ static void efx_pci_remove(struct pci_dev *pci_dev)
> efx->state = STATE_UNINIT;
> rtnl_unlock();
>
> + efx_selftest_async_cancel(efx);
> +
> if (efx->type->sriov_fini)
> efx->type->sriov_fini(efx);
>
> --
> 2.17.1

2023-04-13 07:54:15

by Martin Habets

[permalink] [raw]
Subject: Re: [RFC PATCH net] sfc: Fix use-after-free due to selftest_work

On Wed, Apr 12, 2023 at 03:34:51PM -0700, Jacob Keller wrote:
>
>
> On 4/11/2023 5:50 PM, Ding Hui wrote:
> > There is a use-after-free scenario that is:
> >
> > When netif_running() is false, user set mac address or vlan tag to VF,
> > the xxx_set_vf_mac() or xxx_set_vf_vlan() will invoke efx_net_stop()
> > and efx_net_open(), since netif_running() is false, the port will not
> > start and keep port_enabled false, but selftest_worker is scheduled
> > in efx_net_open().
> >
> > If we remove the device before selftest_worker run, the efx is freed,
> > then we will get a UAF in run_timer_softirq() like this:
> >
> > [ 1178.907941] ==================================================================
> > [ 1178.907948] BUG: KASAN: use-after-free in run_timer_softirq+0xdea/0xe90
> > [ 1178.907950] Write of size 8 at addr ff11001f449cdc80 by task swapper/47/0
> > [ 1178.907950]
> > [ 1178.907953] CPU: 47 PID: 0 Comm: swapper/47 Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1
> > [ 1178.907954] Hardware name: SANGFOR X620G40/WI2HG-208T1061A, BIOS SPYH051032-U01 04/01/2022
> > [ 1178.907955] Call Trace:
> > [ 1178.907956] <IRQ>
> > [ 1178.907960] dump_stack+0x71/0xab
> > [ 1178.907963] print_address_description+0x6b/0x290
> > [ 1178.907965] ? run_timer_softirq+0xdea/0xe90
> > [ 1178.907967] kasan_report+0x14a/0x2b0
> > [ 1178.907968] run_timer_softirq+0xdea/0xe90
> > [ 1178.907971] ? init_timer_key+0x170/0x170
> > [ 1178.907973] ? hrtimer_cancel+0x20/0x20
> > [ 1178.907976] ? sched_clock+0x5/0x10
> > [ 1178.907978] ? sched_clock_cpu+0x18/0x170
> > [ 1178.907981] __do_softirq+0x1c8/0x5fa
> > [ 1178.907985] irq_exit+0x213/0x240
> > [ 1178.907987] smp_apic_timer_interrupt+0xd0/0x330
> > [ 1178.907989] apic_timer_interrupt+0xf/0x20
> > [ 1178.907990] </IRQ>
> > [ 1178.907991] RIP: 0010:mwait_idle+0xae/0x370
> >
> > I am thinking about several ways to fix the issue:
> >
> > [1] In this RFC, I cancel the selftest_worker unconditionally in
> > efx_pci_remove().
> >
> > [2] Add a test condition, only invoke efx_selftest_async_start() when
> > efx->port_enabled is true in efx_net_open().
> >
> > [3] Move invoking efx_selftest_async_start() from efx_net_open() to
> > efx_start_all() or efx_start_port(), that matching cancel action in
> > efx_stop_port().
> >
> > [4] However, I also notice that in efx_ef10_set_mac_address(), the
> > efx_net_open() depends on original port_enabled, but others are not,
> > if we change all efx_net_open() depends on old state like
> > efx_ef10_set_mac_address() does, the UAF can also be fixed in theory.
> >
> > But I'm not sure which is better, is there any suggestions? Thanks.
> >
>
> I think this fix makes the most sense to me.

For me this is too late. It leaves a gap where the selftest timer is still running
but the NIC has already stopped sending events. So we could still get a
failure "channel %d timed out waiting for event queue" from the selftest.

Martin

>
> > Signed-off-by: Ding Hui <[email protected]>
> > ---
>
> net patches need a Fixes tag indicating what commit this fixes. This
> being RFC is likely why that was left off?
>
> > drivers/net/ethernet/sfc/efx.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
> > index 884d8d168862..dd0b2363eed1 100644
> > --- a/drivers/net/ethernet/sfc/efx.c
> > +++ b/drivers/net/ethernet/sfc/efx.c
> > @@ -876,6 +876,8 @@ static void efx_pci_remove(struct pci_dev *pci_dev)
> > efx->state = STATE_UNINIT;
> > rtnl_unlock();
> >
> > + efx_selftest_async_cancel(efx);
> > +
> > if (efx->type->sriov_fini)
> > efx->type->sriov_fini(efx);
> >

2023-04-13 08:15:20

by Ding Hui

[permalink] [raw]
Subject: Re: [RFC PATCH net] sfc: Fix use-after-free due to selftest_work

On 2023/4/13 15:52, Martin Habets wrote:
> On Wed, Apr 12, 2023 at 03:34:51PM -0700, Jacob Keller wrote:
>>
>>
>> On 4/11/2023 5:50 PM, Ding Hui wrote:
>>> There is a use-after-free scenario that is:
>>>
>>> When netif_running() is false, user set mac address or vlan tag to VF,
>>> the xxx_set_vf_mac() or xxx_set_vf_vlan() will invoke efx_net_stop()
>>> and efx_net_open(), since netif_running() is false, the port will not
>>> start and keep port_enabled false, but selftest_worker is scheduled
>>> in efx_net_open().
>>>
>>> If we remove the device before selftest_worker run, the efx is freed,
>>> then we will get a UAF in run_timer_softirq() like this:
>>>
>>> [ 1178.907941] ==================================================================
>>> [ 1178.907948] BUG: KASAN: use-after-free in run_timer_softirq+0xdea/0xe90
>>> [ 1178.907950] Write of size 8 at addr ff11001f449cdc80 by task swapper/47/0
>>> [ 1178.907950]
>>> [ 1178.907953] CPU: 47 PID: 0 Comm: swapper/47 Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1
>>> [ 1178.907954] Hardware name: SANGFOR X620G40/WI2HG-208T1061A, BIOS SPYH051032-U01 04/01/2022
>>> [ 1178.907955] Call Trace:
>>> [ 1178.907956] <IRQ>
>>> [ 1178.907960] dump_stack+0x71/0xab
>>> [ 1178.907963] print_address_description+0x6b/0x290
>>> [ 1178.907965] ? run_timer_softirq+0xdea/0xe90
>>> [ 1178.907967] kasan_report+0x14a/0x2b0
>>> [ 1178.907968] run_timer_softirq+0xdea/0xe90
>>> [ 1178.907971] ? init_timer_key+0x170/0x170
>>> [ 1178.907973] ? hrtimer_cancel+0x20/0x20
>>> [ 1178.907976] ? sched_clock+0x5/0x10
>>> [ 1178.907978] ? sched_clock_cpu+0x18/0x170
>>> [ 1178.907981] __do_softirq+0x1c8/0x5fa
>>> [ 1178.907985] irq_exit+0x213/0x240
>>> [ 1178.907987] smp_apic_timer_interrupt+0xd0/0x330
>>> [ 1178.907989] apic_timer_interrupt+0xf/0x20
>>> [ 1178.907990] </IRQ>
>>> [ 1178.907991] RIP: 0010:mwait_idle+0xae/0x370
>>>
>>> I am thinking about several ways to fix the issue:
>>>
>>> [1] In this RFC, I cancel the selftest_worker unconditionally in
>>> efx_pci_remove().
>>>
>>> [2] Add a test condition, only invoke efx_selftest_async_start() when
>>> efx->port_enabled is true in efx_net_open().
>>>
>>> [3] Move invoking efx_selftest_async_start() from efx_net_open() to
>>> efx_start_all() or efx_start_port(), that matching cancel action in
>>> efx_stop_port().
>>>
>>> [4] However, I also notice that in efx_ef10_set_mac_address(), the
>>> efx_net_open() depends on original port_enabled, but others are not,
>>> if we change all efx_net_open() depends on old state like
>>> efx_ef10_set_mac_address() does, the UAF can also be fixed in theory.
>>>
>>> But I'm not sure which is better, is there any suggestions? Thanks.
>>>
>>
>> I think this fix makes the most sense to me.
>
> For me this is too late. It leaves a gap where the selftest timer is still running
> but the NIC has already stopped sending events. So we could still get a
> failure "channel %d timed out waiting for event queue" from the selftest.
>

Yes, assuming not consider removing, scheduled selftest_work if NIC not
brought up actually, we will also get this failure log.

--
Thanks,
- Ding Hui




2023-04-13 08:37:28

by Ding Hui

[permalink] [raw]
Subject: Re: [RFC PATCH net] sfc: Fix use-after-free due to selftest_work

On 2023/4/13 15:37, Martin Habets wrote:
> On Wed, Apr 12, 2023 at 08:50:13AM +0800, Ding Hui wrote:
>> There is a use-after-free scenario that is:
>>
>> When netif_running() is false, user set mac address or vlan tag to VF,
>> the xxx_set_vf_mac() or xxx_set_vf_vlan() will invoke efx_net_stop()
>> and efx_net_open(), since netif_running() is false, the port will not
>> start and keep port_enabled false, but selftest_worker is scheduled
>> in efx_net_open().
>>
>> If we remove the device before selftest_worker run, the efx is freed,
>> then we will get a UAF in run_timer_softirq() like this:
>>
>> [ 1178.907941] ==================================================================
>> [ 1178.907948] BUG: KASAN: use-after-free in run_timer_softirq+0xdea/0xe90
>> [ 1178.907950] Write of size 8 at addr ff11001f449cdc80 by task swapper/47/0
>> [ 1178.907950]
>> [ 1178.907953] CPU: 47 PID: 0 Comm: swapper/47 Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1
>> [ 1178.907954] Hardware name: SANGFOR X620G40/WI2HG-208T1061A, BIOS SPYH051032-U01 04/01/2022
>> [ 1178.907955] Call Trace:
>> [ 1178.907956] <IRQ>
>> [ 1178.907960] dump_stack+0x71/0xab
>> [ 1178.907963] print_address_description+0x6b/0x290
>> [ 1178.907965] ? run_timer_softirq+0xdea/0xe90
>> [ 1178.907967] kasan_report+0x14a/0x2b0
>> [ 1178.907968] run_timer_softirq+0xdea/0xe90
>> [ 1178.907971] ? init_timer_key+0x170/0x170
>> [ 1178.907973] ? hrtimer_cancel+0x20/0x20
>> [ 1178.907976] ? sched_clock+0x5/0x10
>> [ 1178.907978] ? sched_clock_cpu+0x18/0x170
>> [ 1178.907981] __do_softirq+0x1c8/0x5fa
>> [ 1178.907985] irq_exit+0x213/0x240
>> [ 1178.907987] smp_apic_timer_interrupt+0xd0/0x330
>> [ 1178.907989] apic_timer_interrupt+0xf/0x20
>> [ 1178.907990] </IRQ>
>> [ 1178.907991] RIP: 0010:mwait_idle+0xae/0x370
>>
>> I am thinking about several ways to fix the issue:
>>
>> [1] In this RFC, I cancel the selftest_worker unconditionally in
>> efx_pci_remove().
>>
>> [2] Add a test condition, only invoke efx_selftest_async_start() when
>> efx->port_enabled is true in efx_net_open().
>>
>> [3] Move invoking efx_selftest_async_start() from efx_net_open() to
>> efx_start_all() or efx_start_port(), that matching cancel action in
>> efx_stop_port().
>
> I think moving this to efx_start_port() is best, as you say to match
> the cancel in efx_stop_port().
>

If moving to efx_start_port(), should we worry about that IRQ_TIMEOUT
is still enough?

I'm not sure if there is a long time waiting from starting of schedule
selftest_work to the ending of efx_net_open().

--
Thanks,
- Ding Hui

2023-04-14 09:54:21

by Martin Habets

[permalink] [raw]
Subject: Re: [RFC PATCH net] sfc: Fix use-after-free due to selftest_work

On Thu, Apr 13, 2023 at 04:35:08PM +0800, Ding Hui wrote:
> On 2023/4/13 15:37, Martin Habets wrote:
> > On Wed, Apr 12, 2023 at 08:50:13AM +0800, Ding Hui wrote:
> > > There is a use-after-free scenario that is:
> > >
> > > When netif_running() is false, user set mac address or vlan tag to VF,
> > > the xxx_set_vf_mac() or xxx_set_vf_vlan() will invoke efx_net_stop()
> > > and efx_net_open(), since netif_running() is false, the port will not
> > > start and keep port_enabled false, but selftest_worker is scheduled
> > > in efx_net_open().
> > >
> > > If we remove the device before selftest_worker run, the efx is freed,
> > > then we will get a UAF in run_timer_softirq() like this:
> > >
> > > [ 1178.907941] ==================================================================
> > > [ 1178.907948] BUG: KASAN: use-after-free in run_timer_softirq+0xdea/0xe90
> > > [ 1178.907950] Write of size 8 at addr ff11001f449cdc80 by task swapper/47/0
> > > [ 1178.907950]
> > > [ 1178.907953] CPU: 47 PID: 0 Comm: swapper/47 Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1
> > > [ 1178.907954] Hardware name: SANGFOR X620G40/WI2HG-208T1061A, BIOS SPYH051032-U01 04/01/2022
> > > [ 1178.907955] Call Trace:
> > > [ 1178.907956] <IRQ>
> > > [ 1178.907960] dump_stack+0x71/0xab
> > > [ 1178.907963] print_address_description+0x6b/0x290
> > > [ 1178.907965] ? run_timer_softirq+0xdea/0xe90
> > > [ 1178.907967] kasan_report+0x14a/0x2b0
> > > [ 1178.907968] run_timer_softirq+0xdea/0xe90
> > > [ 1178.907971] ? init_timer_key+0x170/0x170
> > > [ 1178.907973] ? hrtimer_cancel+0x20/0x20
> > > [ 1178.907976] ? sched_clock+0x5/0x10
> > > [ 1178.907978] ? sched_clock_cpu+0x18/0x170
> > > [ 1178.907981] __do_softirq+0x1c8/0x5fa
> > > [ 1178.907985] irq_exit+0x213/0x240
> > > [ 1178.907987] smp_apic_timer_interrupt+0xd0/0x330
> > > [ 1178.907989] apic_timer_interrupt+0xf/0x20
> > > [ 1178.907990] </IRQ>
> > > [ 1178.907991] RIP: 0010:mwait_idle+0xae/0x370
> > >
> > > I am thinking about several ways to fix the issue:
> > >
> > > [1] In this RFC, I cancel the selftest_worker unconditionally in
> > > efx_pci_remove().
> > >
> > > [2] Add a test condition, only invoke efx_selftest_async_start() when
> > > efx->port_enabled is true in efx_net_open().
> > >
> > > [3] Move invoking efx_selftest_async_start() from efx_net_open() to
> > > efx_start_all() or efx_start_port(), that matching cancel action in
> > > efx_stop_port().
> >
> > I think moving this to efx_start_port() is best, as you say to match
> > the cancel in efx_stop_port().
> >
>
> If moving to efx_start_port(), should we worry about that IRQ_TIMEOUT
> is still enough?

1 second is a long time for a machine running code, so it does not worry me.

> I'm not sure if there is a long time waiting from starting of schedule
> selftest_work to the ending of efx_net_open().

I see your point. Looking at efx_start_all() there is the call to
efx_start_datapath() after the call to efx_net_open(), which takes a
relatively long time (well under 200ms though).
Logically it would be better to move efx_selftest_async_start() after this
call. What do you think?

The point here is that efx_start_all() calls efx_start_port() early, and
efx_stop_all() also calls efx_stop_port() early. The calling sequence is
correct but they are not the strict inverse of each other.

Martin

>
> --
> Thanks,
> - Ding Hui

2023-04-14 11:06:16

by Ding Hui

[permalink] [raw]
Subject: Re: [RFC PATCH net] sfc: Fix use-after-free due to selftest_work

On 2023/4/14 17:44, Martin Habets wrote:
> On Thu, Apr 13, 2023 at 04:35:08PM +0800, Ding Hui wrote:
>> On 2023/4/13 15:37, Martin Habets wrote:
>>> On Wed, Apr 12, 2023 at 08:50:13AM +0800, Ding Hui wrote:
>>>> There is a use-after-free scenario that is:
>>>>
>>>> When netif_running() is false, user set mac address or vlan tag to VF,
>>>> the xxx_set_vf_mac() or xxx_set_vf_vlan() will invoke efx_net_stop()
>>>> and efx_net_open(), since netif_running() is false, the port will not
>>>> start and keep port_enabled false, but selftest_worker is scheduled
>>>> in efx_net_open().
>>>>
>>>> If we remove the device before selftest_worker run, the efx is freed,
>>>> then we will get a UAF in run_timer_softirq() like this:
>>>>
>>>> [ 1178.907941] ==================================================================
>>>> [ 1178.907948] BUG: KASAN: use-after-free in run_timer_softirq+0xdea/0xe90
>>>> [ 1178.907950] Write of size 8 at addr ff11001f449cdc80 by task swapper/47/0
>>>> [ 1178.907950]
>>>> [ 1178.907953] CPU: 47 PID: 0 Comm: swapper/47 Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1
>>>> [ 1178.907954] Hardware name: SANGFOR X620G40/WI2HG-208T1061A, BIOS SPYH051032-U01 04/01/2022
>>>> [ 1178.907955] Call Trace:
>>>> [ 1178.907956] <IRQ>
>>>> [ 1178.907960] dump_stack+0x71/0xab
>>>> [ 1178.907963] print_address_description+0x6b/0x290
>>>> [ 1178.907965] ? run_timer_softirq+0xdea/0xe90
>>>> [ 1178.907967] kasan_report+0x14a/0x2b0
>>>> [ 1178.907968] run_timer_softirq+0xdea/0xe90
>>>> [ 1178.907971] ? init_timer_key+0x170/0x170
>>>> [ 1178.907973] ? hrtimer_cancel+0x20/0x20
>>>> [ 1178.907976] ? sched_clock+0x5/0x10
>>>> [ 1178.907978] ? sched_clock_cpu+0x18/0x170
>>>> [ 1178.907981] __do_softirq+0x1c8/0x5fa
>>>> [ 1178.907985] irq_exit+0x213/0x240
>>>> [ 1178.907987] smp_apic_timer_interrupt+0xd0/0x330
>>>> [ 1178.907989] apic_timer_interrupt+0xf/0x20
>>>> [ 1178.907990] </IRQ>
>>>> [ 1178.907991] RIP: 0010:mwait_idle+0xae/0x370
>>>>
>>>> I am thinking about several ways to fix the issue:
>>>>
>>>> [1] In this RFC, I cancel the selftest_worker unconditionally in
>>>> efx_pci_remove().
>>>>
>>>> [2] Add a test condition, only invoke efx_selftest_async_start() when
>>>> efx->port_enabled is true in efx_net_open().
>>>>
>>>> [3] Move invoking efx_selftest_async_start() from efx_net_open() to
>>>> efx_start_all() or efx_start_port(), that matching cancel action in
>>>> efx_stop_port().
>>>
>>> I think moving this to efx_start_port() is best, as you say to match
>>> the cancel in efx_stop_port().
>>>
>>
>> If moving to efx_start_port(), should we worry about that IRQ_TIMEOUT
>> is still enough?
>
> 1 second is a long time for a machine running code, so it does not worry me.
>
>> I'm not sure if there is a long time waiting from starting of schedule
>> selftest_work to the ending of efx_net_open().
>
> I see your point. Looking at efx_start_all() there is the call to
> efx_start_datapath() after the call to efx_net_open(), which takes a
^^^^^^^^^^^^
Do you mean efx_start_port()?

> relatively long time (well under 200ms though).
> Logically it would be better to move efx_selftest_async_start() after this
> call. What do you think?

Agree with you.

> The point here is that efx_start_all() calls efx_start_port() early, and
> efx_stop_all() also calls efx_stop_port() early. The calling sequence is
> correct but they are not the strict inverse of each other.
>

Yeah, that is what I noticed monitor_work does.
Then I'll move efx_selftest_async_start() into efx_start_all(), follows
the monitor_work.

--
Thanks,
- Ding Hui

2023-04-14 12:41:19

by Martin Habets

[permalink] [raw]
Subject: Re: [RFC PATCH net] sfc: Fix use-after-free due to selftest_work

On Fri, Apr 14, 2023 at 07:03:41PM +0800, Ding Hui wrote:
> On 2023/4/14 17:44, Martin Habets wrote:
> > On Thu, Apr 13, 2023 at 04:35:08PM +0800, Ding Hui wrote:
> > > On 2023/4/13 15:37, Martin Habets wrote:
> > > > On Wed, Apr 12, 2023 at 08:50:13AM +0800, Ding Hui wrote:
> > > > > There is a use-after-free scenario that is:
> > > > >
> > > > > When netif_running() is false, user set mac address or vlan tag to VF,
> > > > > the xxx_set_vf_mac() or xxx_set_vf_vlan() will invoke efx_net_stop()
> > > > > and efx_net_open(), since netif_running() is false, the port will not
> > > > > start and keep port_enabled false, but selftest_worker is scheduled
> > > > > in efx_net_open().
> > > > >
> > > > > If we remove the device before selftest_worker run, the efx is freed,
> > > > > then we will get a UAF in run_timer_softirq() like this:
> > > > >
> > > > > [ 1178.907941] ==================================================================
> > > > > [ 1178.907948] BUG: KASAN: use-after-free in run_timer_softirq+0xdea/0xe90
> > > > > [ 1178.907950] Write of size 8 at addr ff11001f449cdc80 by task swapper/47/0
> > > > > [ 1178.907950]
> > > > > [ 1178.907953] CPU: 47 PID: 0 Comm: swapper/47 Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1
> > > > > [ 1178.907954] Hardware name: SANGFOR X620G40/WI2HG-208T1061A, BIOS SPYH051032-U01 04/01/2022
> > > > > [ 1178.907955] Call Trace:
> > > > > [ 1178.907956] <IRQ>
> > > > > [ 1178.907960] dump_stack+0x71/0xab
> > > > > [ 1178.907963] print_address_description+0x6b/0x290
> > > > > [ 1178.907965] ? run_timer_softirq+0xdea/0xe90
> > > > > [ 1178.907967] kasan_report+0x14a/0x2b0
> > > > > [ 1178.907968] run_timer_softirq+0xdea/0xe90
> > > > > [ 1178.907971] ? init_timer_key+0x170/0x170
> > > > > [ 1178.907973] ? hrtimer_cancel+0x20/0x20
> > > > > [ 1178.907976] ? sched_clock+0x5/0x10
> > > > > [ 1178.907978] ? sched_clock_cpu+0x18/0x170
> > > > > [ 1178.907981] __do_softirq+0x1c8/0x5fa
> > > > > [ 1178.907985] irq_exit+0x213/0x240
> > > > > [ 1178.907987] smp_apic_timer_interrupt+0xd0/0x330
> > > > > [ 1178.907989] apic_timer_interrupt+0xf/0x20
> > > > > [ 1178.907990] </IRQ>
> > > > > [ 1178.907991] RIP: 0010:mwait_idle+0xae/0x370
> > > > >
> > > > > I am thinking about several ways to fix the issue:
> > > > >
> > > > > [1] In this RFC, I cancel the selftest_worker unconditionally in
> > > > > efx_pci_remove().
> > > > >
> > > > > [2] Add a test condition, only invoke efx_selftest_async_start() when
> > > > > efx->port_enabled is true in efx_net_open().
> > > > >
> > > > > [3] Move invoking efx_selftest_async_start() from efx_net_open() to
> > > > > efx_start_all() or efx_start_port(), that matching cancel action in
> > > > > efx_stop_port().
> > > >
> > > > I think moving this to efx_start_port() is best, as you say to match
> > > > the cancel in efx_stop_port().
> > > >
> > >
> > > If moving to efx_start_port(), should we worry about that IRQ_TIMEOUT
> > > is still enough?
> >
> > 1 second is a long time for a machine running code, so it does not worry me.
> >
> > > I'm not sure if there is a long time waiting from starting of schedule
> > > selftest_work to the ending of efx_net_open().
> >
> > I see your point. Looking at efx_start_all() there is the call to
> > efx_start_datapath() after the call to efx_net_open(), which takes a
> ^^^^^^^^^^^^
> Do you mean efx_start_port()?

Woops, yes that what I meant.

> > relatively long time (well under 200ms though).
> > Logically it would be better to move efx_selftest_async_start() after this
> > call. What do you think?
>
> Agree with you.
>
> > The point here is that efx_start_all() calls efx_start_port() early, and
> > efx_stop_all() also calls efx_stop_port() early. The calling sequence is
> > correct but they are not the strict inverse of each other.
> >
>
> Yeah, that is what I noticed monitor_work does.
> Then I'll move efx_selftest_async_start() into efx_start_all(), follows
> the monitor_work.

Sounds good.

Thanks,
Martin

> --
> Thanks,
> - Ding Hui