2023-07-26 18:56:49

by Ira Weiny

[permalink] [raw]
Subject: [PATCH] PCI/DOE: Fix destroy_work_on_stack() race

The following debug object splat was observed in testing.

[ 14.061937] ------------[ cut here ]------------
[ 14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510
[ 14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0
...
[ 14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work
[ 14.083485] RIP: 0010:debug_print_object+0x7d/0xb0
...
[ 14.116231] Call Trace:
[ 14.117652] <TASK>
[ 14.118958] ? debug_print_object+0x7d/0xb0
[ 14.120782] ? __warn+0x7d/0x130
[ 14.122399] ? debug_print_object+0x7d/0xb0
[ 14.123746] ? report_bug+0x18d/0x1c0
[ 14.125025] ? handle_bug+0x3c/0x80
[ 14.126506] ? exc_invalid_op+0x13/0x60
[ 14.127796] ? asm_exc_invalid_op+0x16/0x20
[ 14.129380] ? debug_print_object+0x7d/0xb0
[ 14.130688] ? debug_print_object+0x7d/0xb0
[ 14.131997] ? __pfx_doe_statemachine_work+0x10/0x10
[ 14.133597] debug_object_free.part.0+0x11b/0x150
[ 14.134940] doe_statemachine_work+0x45e/0x510
[ 14.136348] process_one_work+0x1d4/0x3c0
...
[ 14.161484] </TASK>
[ 14.162434] ---[ end trace 0000000000000000 ]---

This occurs because destroy_work_on_stack() was called after signaling
the completion in the calling thread. This creates a race between
destroy_work_on_stack() and the task->work struct going of scope in the
pci_doe().

Signal the work complete after destroying the work struct. This is safe
because signal_task_complete() is the final thing the work item does and
the workqueue code is careful not to access the work struct after.

Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y")
Cc: Lukas Wunner <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
---
drivers/pci/doe.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
index 1b97a5ab71a9..e3aab5edaf70 100644
--- a/drivers/pci/doe.c
+++ b/drivers/pci/doe.c
@@ -293,8 +293,8 @@ static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *tas
static void signal_task_complete(struct pci_doe_task *task, int rv)
{
task->rv = rv;
- task->complete(task);
destroy_work_on_stack(&task->work);
+ task->complete(task);
}

static void signal_task_abort(struct pci_doe_task *task, int rv)

---
base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f
change-id: 20230726-doe-fix-f57943f9ea82

Best regards,
--
Ira Weiny <[email protected]>



2023-07-27 08:04:32

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH] PCI/DOE: Fix destroy_work_on_stack() race

On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote:
> The following debug object splat was observed in testing.
[...]
> This occurs because destroy_work_on_stack() was called after signaling
> the completion in the calling thread. This creates a race between
> destroy_work_on_stack() and the task->work struct going of scope in the
> pci_doe().
>
> Signal the work complete after destroying the work struct. This is safe
> because signal_task_complete() is the final thing the work item does and
> the workqueue code is careful not to access the work struct after.
>
> Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y")
> Cc: Lukas Wunner <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>

Reviewed-by: Lukas Wunner <[email protected]>

Thanks for catching this. The offending commit abf04be0e707 was applied
by Dan. Not sure if that means he's going to apply this fix as well?
Would require an ack from Bjorn in that case. Or Bjorn applies it.

Thanks,

Lukas

2023-07-27 17:24:59

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH] PCI/DOE: Fix destroy_work_on_stack() race

On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote:
> The following debug object splat was observed in testing.
>
> [ 14.061937] ------------[ cut here ]------------
> [ 14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510
> [ 14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0
> ...
> [ 14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work
> [ 14.083485] RIP: 0010:debug_print_object+0x7d/0xb0
> ...
> [ 14.116231] Call Trace:
> [ 14.117652] <TASK>
> [ 14.118958] ? debug_print_object+0x7d/0xb0
> [ 14.120782] ? __warn+0x7d/0x130
> [ 14.122399] ? debug_print_object+0x7d/0xb0
> [ 14.123746] ? report_bug+0x18d/0x1c0
> [ 14.125025] ? handle_bug+0x3c/0x80
> [ 14.126506] ? exc_invalid_op+0x13/0x60
> [ 14.127796] ? asm_exc_invalid_op+0x16/0x20
> [ 14.129380] ? debug_print_object+0x7d/0xb0
> [ 14.130688] ? debug_print_object+0x7d/0xb0
> [ 14.131997] ? __pfx_doe_statemachine_work+0x10/0x10
> [ 14.133597] debug_object_free.part.0+0x11b/0x150
> [ 14.134940] doe_statemachine_work+0x45e/0x510
> [ 14.136348] process_one_work+0x1d4/0x3c0
> ...
> [ 14.161484] </TASK>
> [ 14.162434] ---[ end trace 0000000000000000 ]---
>
> This occurs because destroy_work_on_stack() was called after signaling
> the completion in the calling thread. This creates a race between
> destroy_work_on_stack() and the task->work struct going of scope in the
> pci_doe().
>
> Signal the work complete after destroying the work struct. This is safe
> because signal_task_complete() is the final thing the work item does and
> the workqueue code is careful not to access the work struct after.
>
> Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y")
> Cc: Lukas Wunner <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>

Acked-by: Bjorn Helgaas <[email protected]>

Dan, let me know if you'd rather have me take this.

> ---
> drivers/pci/doe.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> index 1b97a5ab71a9..e3aab5edaf70 100644
> --- a/drivers/pci/doe.c
> +++ b/drivers/pci/doe.c
> @@ -293,8 +293,8 @@ static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *tas
> static void signal_task_complete(struct pci_doe_task *task, int rv)
> {
> task->rv = rv;
> - task->complete(task);
> destroy_work_on_stack(&task->work);
> + task->complete(task);
> }
>
> static void signal_task_abort(struct pci_doe_task *task, int rv)
>
> ---
> base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f
> change-id: 20230726-doe-fix-f57943f9ea82
>
> Best regards,
> --
> Ira Weiny <[email protected]>
>

2023-07-27 20:44:38

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH] PCI/DOE: Fix destroy_work_on_stack() race

On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote:
> The following debug object splat was observed in testing.
>
> [ 14.061937] ------------[ cut here ]------------
> [ 14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510
> [ 14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0
> ...
> [ 14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work
> [ 14.083485] RIP: 0010:debug_print_object+0x7d/0xb0
> ...
> [ 14.116231] Call Trace:
> [ 14.117652] <TASK>
> [ 14.118958] ? debug_print_object+0x7d/0xb0
> [ 14.120782] ? __warn+0x7d/0x130
> [ 14.122399] ? debug_print_object+0x7d/0xb0
> [ 14.123746] ? report_bug+0x18d/0x1c0
> [ 14.125025] ? handle_bug+0x3c/0x80
> [ 14.126506] ? exc_invalid_op+0x13/0x60
> [ 14.127796] ? asm_exc_invalid_op+0x16/0x20
> [ 14.129380] ? debug_print_object+0x7d/0xb0
> [ 14.130688] ? debug_print_object+0x7d/0xb0
> [ 14.131997] ? __pfx_doe_statemachine_work+0x10/0x10
> [ 14.133597] debug_object_free.part.0+0x11b/0x150
> [ 14.134940] doe_statemachine_work+0x45e/0x510
> [ 14.136348] process_one_work+0x1d4/0x3c0
> ...
> [ 14.161484] </TASK>
> [ 14.162434] ---[ end trace 0000000000000000 ]---
>
> This occurs because destroy_work_on_stack() was called after signaling
> the completion in the calling thread. This creates a race between
> destroy_work_on_stack() and the task->work struct going of scope in the
> pci_doe().
>
> Signal the work complete after destroying the work struct. This is safe
> because signal_task_complete() is the final thing the work item does and
> the workqueue code is careful not to access the work struct after.
>
> Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y")
> Cc: Lukas Wunner <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>

Thanks, applied to pci/misc with Lukas' reviewed-by and Dan's ack for
v6.6. I edited out the timestamps and some of the call trace from the
splat because they didn't seem relevant.

> ---
> drivers/pci/doe.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> index 1b97a5ab71a9..e3aab5edaf70 100644
> --- a/drivers/pci/doe.c
> +++ b/drivers/pci/doe.c
> @@ -293,8 +293,8 @@ static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *tas
> static void signal_task_complete(struct pci_doe_task *task, int rv)
> {
> task->rv = rv;
> - task->complete(task);
> destroy_work_on_stack(&task->work);
> + task->complete(task);
> }
>
> static void signal_task_abort(struct pci_doe_task *task, int rv)
>
> ---
> base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f
> change-id: 20230726-doe-fix-f57943f9ea82
>
> Best regards,
> --
> Ira Weiny <[email protected]>
>

2023-07-27 21:06:06

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH] PCI/DOE: Fix destroy_work_on_stack() race

Bjorn Helgaas wrote:
> On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote:
> > The following debug object splat was observed in testing.
> >
> > [ 14.061937] ------------[ cut here ]------------
> > [ 14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510
> > [ 14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0
> > ...
> > [ 14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work
> > [ 14.083485] RIP: 0010:debug_print_object+0x7d/0xb0
> > ...
> > [ 14.116231] Call Trace:
> > [ 14.117652] <TASK>
> > [ 14.118958] ? debug_print_object+0x7d/0xb0
> > [ 14.120782] ? __warn+0x7d/0x130
> > [ 14.122399] ? debug_print_object+0x7d/0xb0
> > [ 14.123746] ? report_bug+0x18d/0x1c0
> > [ 14.125025] ? handle_bug+0x3c/0x80
> > [ 14.126506] ? exc_invalid_op+0x13/0x60
> > [ 14.127796] ? asm_exc_invalid_op+0x16/0x20
> > [ 14.129380] ? debug_print_object+0x7d/0xb0
> > [ 14.130688] ? debug_print_object+0x7d/0xb0
> > [ 14.131997] ? __pfx_doe_statemachine_work+0x10/0x10
> > [ 14.133597] debug_object_free.part.0+0x11b/0x150
> > [ 14.134940] doe_statemachine_work+0x45e/0x510
> > [ 14.136348] process_one_work+0x1d4/0x3c0
> > ...
> > [ 14.161484] </TASK>
> > [ 14.162434] ---[ end trace 0000000000000000 ]---
> >
> > This occurs because destroy_work_on_stack() was called after signaling
> > the completion in the calling thread. This creates a race between
> > destroy_work_on_stack() and the task->work struct going of scope in the
> > pci_doe().
> >
> > Signal the work complete after destroying the work struct. This is safe
> > because signal_task_complete() is the final thing the work item does and
> > the workqueue code is careful not to access the work struct after.
> >
> > Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y")
> > Cc: Lukas Wunner <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
>
> Acked-by: Bjorn Helgaas <[email protected]>
> Dan, let me know if you'd rather have me take this.

I have nothing else in the DOE area at this time, so I would be happy if
you took it. It is self contained to drivers/pci/.

Much appreciated.

Acked-by: Dan Williams <[email protected]>

2023-07-27 22:19:14

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH] PCI/DOE: Fix destroy_work_on_stack() race

Bjorn Helgaas wrote:
> On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote:
> > The following debug object splat was observed in testing.
> >
> > [ 14.061937] ------------[ cut here ]------------
> > [ 14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510
> > [ 14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0
> > ...
> > [ 14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work
> > [ 14.083485] RIP: 0010:debug_print_object+0x7d/0xb0
> > ...
> > [ 14.116231] Call Trace:
> > [ 14.117652] <TASK>
> > [ 14.118958] ? debug_print_object+0x7d/0xb0
> > [ 14.120782] ? __warn+0x7d/0x130
> > [ 14.122399] ? debug_print_object+0x7d/0xb0
> > [ 14.123746] ? report_bug+0x18d/0x1c0
> > [ 14.125025] ? handle_bug+0x3c/0x80
> > [ 14.126506] ? exc_invalid_op+0x13/0x60
> > [ 14.127796] ? asm_exc_invalid_op+0x16/0x20
> > [ 14.129380] ? debug_print_object+0x7d/0xb0
> > [ 14.130688] ? debug_print_object+0x7d/0xb0
> > [ 14.131997] ? __pfx_doe_statemachine_work+0x10/0x10
> > [ 14.133597] debug_object_free.part.0+0x11b/0x150
> > [ 14.134940] doe_statemachine_work+0x45e/0x510
> > [ 14.136348] process_one_work+0x1d4/0x3c0
> > ...
> > [ 14.161484] </TASK>
> > [ 14.162434] ---[ end trace 0000000000000000 ]---
> >
> > This occurs because destroy_work_on_stack() was called after signaling
> > the completion in the calling thread. This creates a race between
> > destroy_work_on_stack() and the task->work struct going of scope in the
> > pci_doe().
> >
> > Signal the work complete after destroying the work struct. This is safe
> > because signal_task_complete() is the final thing the work item does and
> > the workqueue code is careful not to access the work struct after.
> >
> > Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y")
> > Cc: Lukas Wunner <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
>
> Thanks, applied to pci/misc with Lukas' reviewed-by and Dan's ack for
> v6.6. I edited out the timestamps and some of the call trace from the
> splat because they didn't seem relevant.
>

Thanks. I'll make sure to remove the timestamps in future.
Ira