2021-11-12 16:54:19

by Ioanna Alifieraki

[permalink] [raw]
Subject: [PATCH] ipmi: Move remove_work to dedicated workqueue

Currently when removing an ipmi_user the removal is deferred as a work on
the system's workqueue. Although this guarantees the free operation will
occur in non atomic context, it can race with the ipmi_msghandler module
removal (see [1]) . In case a remove_user work is scheduled for removal
and shortly after ipmi_msghandler module is removed we can end up in a
situation where the module is removed fist and when the work is executed
the system crashes with :
BUG: unable to handle page fault for address: ffffffffc05c3450
PF: supervisor instruction fetch in kernel mode
PF: error_code(0x0010) - not-present page
because the pages of the module are gone. In cleanup_ipmi() there is no
easy way to detect if there are any pending works to flush them before
removing the module. This patch creates a separate workqueue and schedules
the remove_work works on it. When removing the module the workqueue is
flushed to avoid the race.

[1] https://bugs.launchpad.net/bugs/1950666

Cc: [email protected]
Fixes: 3b9a907223d7 (ipmi: fix sleep-in-atomic in free_user at cleanup SRCU user->release_barrier)
Signed-off-by: Ioanna Alifieraki <[email protected]>
---
drivers/char/ipmi/ipmi_msghandler.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
index deed355422f4..9e0ad2ccd3e0 100644
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -191,6 +191,8 @@ struct ipmi_user {
struct work_struct remove_work;
};

+struct workqueue_struct *remove_work_wq;
+
static struct ipmi_user *acquire_ipmi_user(struct ipmi_user *user, int *index)
__acquires(user->release_barrier)
{
@@ -1297,7 +1299,7 @@ static void free_user(struct kref *ref)
struct ipmi_user *user = container_of(ref, struct ipmi_user, refcount);

/* SRCU cleanup must happen in task context. */
- schedule_work(&user->remove_work);
+ queue_work(remove_work_wq, &user->remove_work);
}

static void _ipmi_destroy_user(struct ipmi_user *user)
@@ -5383,6 +5385,8 @@ static int ipmi_init_msghandler(void)

atomic_notifier_chain_register(&panic_notifier_list, &panic_block);

+ remove_work_wq = create_singlethread_workqueue("ipmi-msghandler-remove-wq");
+
initialized = true;

out:
@@ -5408,6 +5412,9 @@ static void __exit cleanup_ipmi(void)
int count;

if (initialized) {
+ flush_workqueue(remove_work_wq);
+ destroy_workqueue(remove_work_wq);
+
atomic_notifier_chain_unregister(&panic_notifier_list,
&panic_block);

--
2.17.1



2021-11-12 17:05:15

by Christophe JAILLET

[permalink] [raw]
Subject: Re: [PATCH] ipmi: Move remove_work to dedicated workqueue

Le 12/11/2021 à 17:54, Ioanna Alifieraki a écrit :
> Currently when removing an ipmi_user the removal is deferred as a work on
> the system's workqueue. Although this guarantees the free operation will
> occur in non atomic context, it can race with the ipmi_msghandler module
> removal (see [1]) . In case a remove_user work is scheduled for removal
> and shortly after ipmi_msghandler module is removed we can end up in a
> situation where the module is removed fist and when the work is executed
> the system crashes with :
> BUG: unable to handle page fault for address: ffffffffc05c3450
> PF: supervisor instruction fetch in kernel mode
> PF: error_code(0x0010) - not-present page
> because the pages of the module are gone. In cleanup_ipmi() there is no
> easy way to detect if there are any pending works to flush them before
> removing the module. This patch creates a separate workqueue and schedules
> the remove_work works on it. When removing the module the workqueue is
> flushed to avoid the race.
>
> [1] https://bugs.launchpad.net/bugs/1950666
>
> Cc: [email protected]
> Fixes: 3b9a907223d7 (ipmi: fix sleep-in-atomic in free_user at cleanup SRCU user->release_barrier)
> Signed-off-by: Ioanna Alifieraki <[email protected]>
> ---
> drivers/char/ipmi/ipmi_msghandler.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
> index deed355422f4..9e0ad2ccd3e0 100644
> --- a/drivers/char/ipmi/ipmi_msghandler.c
> +++ b/drivers/char/ipmi/ipmi_msghandler.c
> @@ -191,6 +191,8 @@ struct ipmi_user {
> struct work_struct remove_work;
> };
>
> +struct workqueue_struct *remove_work_wq;
> +
> static struct ipmi_user *acquire_ipmi_user(struct ipmi_user *user, int *index)
> __acquires(user->release_barrier)
> {
> @@ -1297,7 +1299,7 @@ static void free_user(struct kref *ref)
> struct ipmi_user *user = container_of(ref, struct ipmi_user, refcount);
>
> /* SRCU cleanup must happen in task context. */
> - schedule_work(&user->remove_work);
> + queue_work(remove_work_wq, &user->remove_work);
> }
>
> static void _ipmi_destroy_user(struct ipmi_user *user)
> @@ -5383,6 +5385,8 @@ static int ipmi_init_msghandler(void)
>
> atomic_notifier_chain_register(&panic_notifier_list, &panic_block);
>
> + remove_work_wq = create_singlethread_workqueue("ipmi-msghandler-remove-wq");
> +
> initialized = true;
>
> out:
> @@ -5408,6 +5412,9 @@ static void __exit cleanup_ipmi(void)
> int count;
>
> if (initialized) {
> + flush_workqueue(remove_work_wq);
> + destroy_workqueue(remove_work_wq);
> +

Hi,

there is no need to call 'flush_workqueue()' before 'destroy_workqueue()'.
'destroy_workqueue()' already drains the queue before destroying it, so
there is no need to flush it explicitly.

Just my 2c.

CJ

> atomic_notifier_chain_unregister(&panic_notifier_list,
> &panic_block);
>
>


2021-11-12 17:09:12

by Corey Minyard

[permalink] [raw]
Subject: Re: [PATCH] ipmi: Move remove_work to dedicated workqueue

On Fri, Nov 12, 2021 at 06:54:13PM +0200, Ioanna Alifieraki wrote:
> Currently when removing an ipmi_user the removal is deferred as a work on
> the system's workqueue. Although this guarantees the free operation will
> occur in non atomic context, it can race with the ipmi_msghandler module
> removal (see [1]) . In case a remove_user work is scheduled for removal
> and shortly after ipmi_msghandler module is removed we can end up in a
> situation where the module is removed fist and when the work is executed
> the system crashes with :
> BUG: unable to handle page fault for address: ffffffffc05c3450
> PF: supervisor instruction fetch in kernel mode
> PF: error_code(0x0010) - not-present page
> because the pages of the module are gone. In cleanup_ipmi() there is no
> easy way to detect if there are any pending works to flush them before
> removing the module. This patch creates a separate workqueue and schedules
> the remove_work works on it. When removing the module the workqueue is
> flushed to avoid the race.

Yeah, this is an issue. One comment below...

>
> [1] https://bugs.launchpad.net/bugs/1950666
>
> Cc: [email protected]
> Fixes: 3b9a907223d7 (ipmi: fix sleep-in-atomic in free_user at cleanup SRCU user->release_barrier)
> Signed-off-by: Ioanna Alifieraki <[email protected]>
> ---
> drivers/char/ipmi/ipmi_msghandler.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
> index deed355422f4..9e0ad2ccd3e0 100644
> --- a/drivers/char/ipmi/ipmi_msghandler.c
> +++ b/drivers/char/ipmi/ipmi_msghandler.c
> @@ -191,6 +191,8 @@ struct ipmi_user {
> struct work_struct remove_work;
> };
>
> +struct workqueue_struct *remove_work_wq;
> +
> static struct ipmi_user *acquire_ipmi_user(struct ipmi_user *user, int *index)
> __acquires(user->release_barrier)
> {
> @@ -1297,7 +1299,7 @@ static void free_user(struct kref *ref)
> struct ipmi_user *user = container_of(ref, struct ipmi_user, refcount);
>
> /* SRCU cleanup must happen in task context. */
> - schedule_work(&user->remove_work);
> + queue_work(remove_work_wq, &user->remove_work);
> }
>
> static void _ipmi_destroy_user(struct ipmi_user *user)
> @@ -5383,6 +5385,8 @@ static int ipmi_init_msghandler(void)
>
> atomic_notifier_chain_register(&panic_notifier_list, &panic_block);
>
> + remove_work_wq = create_singlethread_workqueue("ipmi-msghandler-remove-wq");
> +

Shouldn't you check the return value here?

-corey

> initialized = true;
>
> out:
> @@ -5408,6 +5412,9 @@ static void __exit cleanup_ipmi(void)
> int count;
>
> if (initialized) {
> + flush_workqueue(remove_work_wq);
> + destroy_workqueue(remove_work_wq);
> +
> atomic_notifier_chain_unregister(&panic_notifier_list,
> &panic_block);
>
> --
> 2.17.1
>

2021-11-12 17:19:14

by Ioanna Alifieraki

[permalink] [raw]
Subject: Re: [PATCH] ipmi: Move remove_work to dedicated workqueue

On Fri, Nov 12, 2021 at 7:09 PM Corey Minyard <[email protected]> wrote:
>
> On Fri, Nov 12, 2021 at 06:54:13PM +0200, Ioanna Alifieraki wrote:
> > Currently when removing an ipmi_user the removal is deferred as a work on
> > the system's workqueue. Although this guarantees the free operation will
> > occur in non atomic context, it can race with the ipmi_msghandler module
> > removal (see [1]) . In case a remove_user work is scheduled for removal
> > and shortly after ipmi_msghandler module is removed we can end up in a
> > situation where the module is removed fist and when the work is executed
> > the system crashes with :
> > BUG: unable to handle page fault for address: ffffffffc05c3450
> > PF: supervisor instruction fetch in kernel mode
> > PF: error_code(0x0010) - not-present page
> > because the pages of the module are gone. In cleanup_ipmi() there is no
> > easy way to detect if there are any pending works to flush them before
> > removing the module. This patch creates a separate workqueue and schedules
> > the remove_work works on it. When removing the module the workqueue is
> > flushed to avoid the race.
>
> Yeah, this is an issue. One comment below...
>
> >
> > [1] https://bugs.launchpad.net/bugs/1950666
> >
> > Cc: [email protected]
> > Fixes: 3b9a907223d7 (ipmi: fix sleep-in-atomic in free_user at cleanup SRCU user->release_barrier)
> > Signed-off-by: Ioanna Alifieraki <[email protected]>
> > ---
> > drivers/char/ipmi/ipmi_msghandler.c | 9 ++++++++-
> > 1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
> > index deed355422f4..9e0ad2ccd3e0 100644
> > --- a/drivers/char/ipmi/ipmi_msghandler.c
> > +++ b/drivers/char/ipmi/ipmi_msghandler.c
> > @@ -191,6 +191,8 @@ struct ipmi_user {
> > struct work_struct remove_work;
> > };
> >
> > +struct workqueue_struct *remove_work_wq;
> > +
> > static struct ipmi_user *acquire_ipmi_user(struct ipmi_user *user, int *index)
> > __acquires(user->release_barrier)
> > {
> > @@ -1297,7 +1299,7 @@ static void free_user(struct kref *ref)
> > struct ipmi_user *user = container_of(ref, struct ipmi_user, refcount);
> >
> > /* SRCU cleanup must happen in task context. */
> > - schedule_work(&user->remove_work);
> > + queue_work(remove_work_wq, &user->remove_work);
> > }
> >
> > static void _ipmi_destroy_user(struct ipmi_user *user)
> > @@ -5383,6 +5385,8 @@ static int ipmi_init_msghandler(void)
> >
> > atomic_notifier_chain_register(&panic_notifier_list, &panic_block);
> >
> > + remove_work_wq = create_singlethread_workqueue("ipmi-msghandler-remove-wq");
> > +
>
> Shouldn't you check the return value here?
>

Yes you're right, my bad.
I'll incorporate Christophe's feedback too and send a v2 next week.
Thanks all for the feedback!

> -corey
>
> > initialized = true;
> >
> > out:
> > @@ -5408,6 +5412,9 @@ static void __exit cleanup_ipmi(void)
> > int count;
> >
> > if (initialized) {
> > + flush_workqueue(remove_work_wq);
> > + destroy_workqueue(remove_work_wq);
> > +
> > atomic_notifier_chain_unregister(&panic_notifier_list,
> > &panic_block);
> >
> > --
> > 2.17.1
> >