Subject: [PATCH v2 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().

__build_all_zonelists() acquires zonelist_update_seq by first disabling
interrupts via local_irq_save() and then acquiring the seqlock with
write_seqlock(). This is troublesome and leads to problems on
PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping
lock on PREEMPT_RT and must not be acquired with disabled interrupts.

The API provides write_seqlock_irqsave() which does the right thing in
one step.
printk_deferred_enter() has to be invoked in non-migrate-able context to
ensure that deferred printing is enabled and disabled on the same CPU.
This is the case after zonelist_update_seq has been acquired.

There was discussion on the first submission that the order should be:
local_irq_disable();
printk_deferred_enter();
write_seqlock();

to avoid pitfalls like having an unaccounted printk() coming from
write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
only origin of such a printk() can be a lockdep splat because the
lockdep annotation happens after the sequence count is incremented.
This is exceptional and subject to change.

It was also pointed that PREEMPT_RT can be affected by the printk
problem since its write_seqlock_irqsave() does not really disable
interrupts. This isn't the case because PREEMPT_RT's printk
implementation differs from the mainline implementation in two important
aspects:
- Printing happens in a dedicated threads and not at during the
invocation of printk().
- In emergency cases where synchronous printing is used, a different
driver is used which does not use tty_port::lock.

Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
printk output.

Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
mm/page_alloc.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b7..99b7e7d09c5c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5808,11 +5808,10 @@ static void __build_all_zonelists(void *data)
unsigned long flags;

/*
- * Explicitly disable this CPU's interrupts before taking seqlock
- * to prevent any IRQ handler from calling into the page allocator
- * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
+ * The zonelist_update_seq must be acquired with irqsave because the
+ * reader can be invoked from IRQ with GFP_ATOMIC.
*/
- local_irq_save(flags);
+ write_seqlock_irqsave(&zonelist_update_seq, flags);
/*
* Explicitly disable this CPU's synchronous printk() before taking
* seqlock to prevent any printk() from trying to hold port->lock, for
@@ -5820,7 +5819,6 @@ static void __build_all_zonelists(void *data)
* calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
*/
printk_deferred_enter();
- write_seqlock(&zonelist_update_seq);

#ifdef CONFIG_NUMA
memset(node_load, 0, sizeof(node_load));
@@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data)
#endif
}

- write_sequnlock(&zonelist_update_seq);
printk_deferred_exit();
- local_irq_restore(flags);
+ write_sequnlock_irqrestore(&zonelist_update_seq, flags);
}

static noinline void __init
--
2.40.1



2023-06-23 18:32:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().

On Fri 23-06-23 19:12:32, Sebastian Andrzej Siewior wrote:
> __build_all_zonelists() acquires zonelist_update_seq by first disabling
> interrupts via local_irq_save() and then acquiring the seqlock with
> write_seqlock(). This is troublesome and leads to problems on
> PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping
> lock on PREEMPT_RT and must not be acquired with disabled interrupts.
>
> The API provides write_seqlock_irqsave() which does the right thing in
> one step.
> printk_deferred_enter() has to be invoked in non-migrate-able context to
> ensure that deferred printing is enabled and disabled on the same CPU.
> This is the case after zonelist_update_seq has been acquired.
>
> There was discussion on the first submission that the order should be:
> local_irq_disable();
> printk_deferred_enter();
> write_seqlock();
>
> to avoid pitfalls like having an unaccounted printk() coming from
> write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
> only origin of such a printk() can be a lockdep splat because the
> lockdep annotation happens after the sequence count is incremented.
> This is exceptional and subject to change.
>
> It was also pointed that PREEMPT_RT can be affected by the printk
> problem since its write_seqlock_irqsave() does not really disable
> interrupts. This isn't the case because PREEMPT_RT's printk
> implementation differs from the mainline implementation in two important
> aspects:
> - Printing happens in a dedicated threads and not at during the
> invocation of printk().
> - In emergency cases where synchronous printing is used, a different
> driver is used which does not use tty_port::lock.
>
> Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
> printk output.
>
> Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>

Thanks for extending the changelog. This is much more clearer IMO.

One nit below which I haven't noticed before. Anyway
Acked-by: Michal Hocko <[email protected]>

> ---
> mm/page_alloc.c | 11 ++++-------
> 1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47421bedc12b7..99b7e7d09c5c0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5808,11 +5808,10 @@ static void __build_all_zonelists(void *data)
> unsigned long flags;
>
> /*
> - * Explicitly disable this CPU's interrupts before taking seqlock
> - * to prevent any IRQ handler from calling into the page allocator
> - * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
> + * The zonelist_update_seq must be acquired with irqsave because the
> + * reader can be invoked from IRQ with GFP_ATOMIC.
> */
> - local_irq_save(flags);
> + write_seqlock_irqsave(&zonelist_update_seq, flags);
> /*
> * Explicitly disable this CPU's synchronous printk() before taking
> * seqlock to prevent any printk() from trying to hold port->lock, for

This is not the case anymore because the locking ordering has flipped. I
would just extend the comment above by something like:

* Also disable synchronous printk() to prevent any printk() from trying
* to hold port->lock, for tty_insert_flip_string_and_push_buffer() on
* other CPU might be calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with
* port->lock held.

> @@ -5820,7 +5819,6 @@ static void __build_all_zonelists(void *data)
> * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
> */
> printk_deferred_enter();
> - write_seqlock(&zonelist_update_seq);
>
> #ifdef CONFIG_NUMA
> memset(node_load, 0, sizeof(node_load));
> @@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data)
> #endif
> }
>
> - write_sequnlock(&zonelist_update_seq);
> printk_deferred_exit();
> - local_irq_restore(flags);
> + write_sequnlock_irqrestore(&zonelist_update_seq, flags);
> }
>
> static noinline void __init
> --
> 2.40.1

--
Michal Hocko
SUSE Labs

Subject: [PATCH v3 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().

__build_all_zonelists() acquires zonelist_update_seq by first disabling
interrupts via local_irq_save() and then acquiring the seqlock with
write_seqlock(). This is troublesome and leads to problems on
PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping
lock on PREEMPT_RT and must not be acquired with disabled interrupts.

The API provides write_seqlock_irqsave() which does the right thing in
one step.
printk_deferred_enter() has to be invoked in non-migrate-able context to
ensure that deferred printing is enabled and disabled on the same CPU.
This is the case after zonelist_update_seq has been acquired.

There was discussion on the first submission that the order should be:
local_irq_disable();
printk_deferred_enter();
write_seqlock();

to avoid pitfalls like having an unaccounted printk() coming from
write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
only origin of such a printk() can be a lockdep splat because the
lockdep annotation happens after the sequence count is incremented.
This is exceptional and subject to change.

It was also pointed that PREEMPT_RT can be affected by the printk
problem since its write_seqlock_irqsave() does not really disable
interrupts. This isn't the case because PREEMPT_RT's printk
implementation differs from the mainline implementation in two important
aspects:
- Printing happens in a dedicated threads and not at during the
invocation of printk().
- In emergency cases where synchronous printing is used, a different
driver is used which does not use tty_port::lock.

Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
printk output.

Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
v2…v3
- Update comment as per Michal's suggestion.

v1…v2:
- Improve commit description

mm/page_alloc.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b7..440e9af67b48d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5808,19 +5808,17 @@ static void __build_all_zonelists(void *data)
unsigned long flags;

/*
- * Explicitly disable this CPU's interrupts before taking seqlock
- * to prevent any IRQ handler from calling into the page allocator
- * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
+ * The zonelist_update_seq must be acquired with irqsave because the
+ * reader can be invoked from IRQ with GFP_ATOMIC.
*/
- local_irq_save(flags);
+ write_seqlock_irqsave(&zonelist_update_seq, flags);
/*
- * Explicitly disable this CPU's synchronous printk() before taking
- * seqlock to prevent any printk() from trying to hold port->lock, for
+ * Also disable synchronous printk() to prevent any printk() from
+ * trying to hold port->lock, for
* tty_insert_flip_string_and_push_buffer() on other CPU might be
* calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
*/
printk_deferred_enter();
- write_seqlock(&zonelist_update_seq);

#ifdef CONFIG_NUMA
memset(node_load, 0, sizeof(node_load));
@@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data)
#endif
}

- write_sequnlock(&zonelist_update_seq);
printk_deferred_exit();
- local_irq_restore(flags);
+ write_sequnlock_irqrestore(&zonelist_update_seq, flags);
}

static noinline void __init
--
2.40.1


2023-06-26 08:11:43

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().

On 23.06.23 22:15, Sebastian Andrzej Siewior wrote:
> __build_all_zonelists() acquires zonelist_update_seq by first disabling
> interrupts via local_irq_save() and then acquiring the seqlock with
> write_seqlock(). This is troublesome and leads to problems on
> PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping
> lock on PREEMPT_RT and must not be acquired with disabled interrupts.
>
> The API provides write_seqlock_irqsave() which does the right thing in
> one step.
> printk_deferred_enter() has to be invoked in non-migrate-able context to
> ensure that deferred printing is enabled and disabled on the same CPU.
> This is the case after zonelist_update_seq has been acquired.
>
> There was discussion on the first submission that the order should be:
> local_irq_disable();
> printk_deferred_enter();
> write_seqlock();
>
> to avoid pitfalls like having an unaccounted printk() coming from
> write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
> only origin of such a printk() can be a lockdep splat because the
> lockdep annotation happens after the sequence count is incremented.
> This is exceptional and subject to change.
>
> It was also pointed that PREEMPT_RT can be affected by the printk
> problem since its write_seqlock_irqsave() does not really disable
> interrupts. This isn't the case because PREEMPT_RT's printk
> implementation differs from the mainline implementation in two important
> aspects:
> - Printing happens in a dedicated threads and not at during the
> invocation of printk().
> - In emergency cases where synchronous printing is used, a different
> driver is used which does not use tty_port::lock.
>
> Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
> printk output.
>
> Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> ---
> v2…v3
> - Update comment as per Michal's suggestion.
>
> v1…v2:
> - Improve commit description
>
> mm/page_alloc.c | 15 ++++++---------
> 1 file changed, 6 insertions(+), 9 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47421bedc12b7..440e9af67b48d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5808,19 +5808,17 @@ static void __build_all_zonelists(void *data)
> unsigned long flags;
>
> /*
> - * Explicitly disable this CPU's interrupts before taking seqlock
> - * to prevent any IRQ handler from calling into the page allocator
> - * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
> + * The zonelist_update_seq must be acquired with irqsave because the
> + * reader can be invoked from IRQ with GFP_ATOMIC.
> */
> - local_irq_save(flags);
> + write_seqlock_irqsave(&zonelist_update_seq, flags);
> /*
> - * Explicitly disable this CPU's synchronous printk() before taking
> - * seqlock to prevent any printk() from trying to hold port->lock, for
> + * Also disable synchronous printk() to prevent any printk() from
> + * trying to hold port->lock, for
> * tty_insert_flip_string_and_push_buffer() on other CPU might be
> * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
> */
> printk_deferred_enter();
> - write_seqlock(&zonelist_update_seq);
>
> #ifdef CONFIG_NUMA
> memset(node_load, 0, sizeof(node_load));
> @@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data)
> #endif
> }
>
> - write_sequnlock(&zonelist_update_seq);
> printk_deferred_exit();
> - local_irq_restore(flags);
> + write_sequnlock_irqrestore(&zonelist_update_seq, flags);
> }
>
> static noinline void __init

Reviewed-by: David Hildenbrand <[email protected]>

--
Cheers,

David / dhildenb


2023-06-26 13:22:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().

On Fri, Jun 23, 2023 at 10:15:17PM +0200, Sebastian Andrzej Siewior wrote:
> __build_all_zonelists() acquires zonelist_update_seq by first disabling
> interrupts via local_irq_save() and then acquiring the seqlock with
> write_seqlock(). This is troublesome and leads to problems on
> PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping
> lock on PREEMPT_RT and must not be acquired with disabled interrupts.
>
> The API provides write_seqlock_irqsave() which does the right thing in
> one step.
> printk_deferred_enter() has to be invoked in non-migrate-able context to
> ensure that deferred printing is enabled and disabled on the same CPU.
> This is the case after zonelist_update_seq has been acquired.
>
> There was discussion on the first submission that the order should be:
> local_irq_disable();
> printk_deferred_enter();
> write_seqlock();
>
> to avoid pitfalls like having an unaccounted printk() coming from
> write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
> only origin of such a printk() can be a lockdep splat because the
> lockdep annotation happens after the sequence count is incremented.
> This is exceptional and subject to change.
>
> It was also pointed that PREEMPT_RT can be affected by the printk
> problem since its write_seqlock_irqsave() does not really disable
> interrupts. This isn't the case because PREEMPT_RT's printk
> implementation differs from the mainline implementation in two important
> aspects:
> - Printing happens in a dedicated threads and not at during the
> invocation of printk().
> - In emergency cases where synchronous printing is used, a different
> driver is used which does not use tty_port::lock.
>
> Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
> printk output.
>
> Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> Acked-by: Michal Hocko <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2023-06-28 14:20:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().

Andrew, it seems that we have a consensus on the MM side of things that
this is good enough to go. I am not sure about patch 1, that is more on
lockdep people but I think that this patch is good enough on this own.
Can we get this patch merged into mm tree and see whether any of Tetsuo
concerns pop out?

On Fri 23-06-23 22:15:17, Sebastian Andrzej Siewior wrote:
> __build_all_zonelists() acquires zonelist_update_seq by first disabling
> interrupts via local_irq_save() and then acquiring the seqlock with
> write_seqlock(). This is troublesome and leads to problems on
> PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping
> lock on PREEMPT_RT and must not be acquired with disabled interrupts.
>
> The API provides write_seqlock_irqsave() which does the right thing in
> one step.
> printk_deferred_enter() has to be invoked in non-migrate-able context to
> ensure that deferred printing is enabled and disabled on the same CPU.
> This is the case after zonelist_update_seq has been acquired.
>
> There was discussion on the first submission that the order should be:
> local_irq_disable();
> printk_deferred_enter();
> write_seqlock();
>
> to avoid pitfalls like having an unaccounted printk() coming from
> write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
> only origin of such a printk() can be a lockdep splat because the
> lockdep annotation happens after the sequence count is incremented.
> This is exceptional and subject to change.
>
> It was also pointed that PREEMPT_RT can be affected by the printk
> problem since its write_seqlock_irqsave() does not really disable
> interrupts. This isn't the case because PREEMPT_RT's printk
> implementation differs from the mainline implementation in two important
> aspects:
> - Printing happens in a dedicated threads and not at during the
> invocation of printk().
> - In emergency cases where synchronous printing is used, a different
> driver is used which does not use tty_port::lock.
>
> Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
> printk output.
>
> Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> ---
> v2…v3
> - Update comment as per Michal's suggestion.
>
> v1…v2:
> - Improve commit description
>
> mm/page_alloc.c | 15 ++++++---------
> 1 file changed, 6 insertions(+), 9 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47421bedc12b7..440e9af67b48d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5808,19 +5808,17 @@ static void __build_all_zonelists(void *data)
> unsigned long flags;
>
> /*
> - * Explicitly disable this CPU's interrupts before taking seqlock
> - * to prevent any IRQ handler from calling into the page allocator
> - * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
> + * The zonelist_update_seq must be acquired with irqsave because the
> + * reader can be invoked from IRQ with GFP_ATOMIC.
> */
> - local_irq_save(flags);
> + write_seqlock_irqsave(&zonelist_update_seq, flags);
> /*
> - * Explicitly disable this CPU's synchronous printk() before taking
> - * seqlock to prevent any printk() from trying to hold port->lock, for
> + * Also disable synchronous printk() to prevent any printk() from
> + * trying to hold port->lock, for
> * tty_insert_flip_string_and_push_buffer() on other CPU might be
> * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
> */
> printk_deferred_enter();
> - write_seqlock(&zonelist_update_seq);
>
> #ifdef CONFIG_NUMA
> memset(node_load, 0, sizeof(node_load));
> @@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data)
> #endif
> }
>
> - write_sequnlock(&zonelist_update_seq);
> printk_deferred_exit();
> - local_irq_restore(flags);
> + write_sequnlock_irqrestore(&zonelist_update_seq, flags);
> }
>
> static noinline void __init
> --
> 2.40.1

--
Michal Hocko
SUSE Labs