For drivers/modules running inside a
memalloc_{noio,nofs}_{save,restore} region, if a work-queue is
created, we make sure work executed on the work-queue inherits the
same flag(s).
This in order to conditionally enable drivers to work aligned with
block I/O devices. This commit makes sure that any work queued later
on work-queues created during module initialization, when current's
flags has PF_MEMALLOC_{NOIO,NOFS} set, will inherit the same flags.
We do this in order to enable drivers to be used as a network block
I/O device. This in order to support XFS or other file-systems on top
of a raw block device which uses said drivers as the network transport
layer.
Under intense memory pressure, we get memory reclaims. Assume the
file-system reclaims memory, goes to the raw block device, which calls
into said drivers. Now, if regular GFP_KERNEL allocations in the
drivers require reclaims to be fulfilled, we end up in a circular
dependency.
We break this circular dependency by:
1. Force all allocations in the drivers to use GFP_NOIO, by means of a
parenthetic use of memalloc_noio_{save,restore} on all relevant
entry points.
2. Make sure work-queues inherits current->flags
wrt. PF_MEMALLOC_{NOIO,NOFS}, such that work executed on the
work-queue inherits the same flag(s). That is what this commit
contributes with.
Signed-off-by: Håkon Bugge <[email protected]>
---
include/linux/workqueue.h | 2 ++
kernel/workqueue.c | 17 +++++++++++++++++
2 files changed, 19 insertions(+)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 158784dd189ab..09ecc692ffcae 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -398,6 +398,8 @@ enum wq_flags {
__WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
__WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */
__WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */
+ __WQ_NOIO = 1 << 19, /* internal: execute work with NOIO */
+ __WQ_NOFS = 1 << 20, /* internal: execute work with NOFS */
/* BH wq only allows the following flags */
__WQ_BH_ALLOWS = WQ_BH | WQ_HIGHPRI,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d2dbe099286b9..a1d166a7c0f85 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -51,6 +51,7 @@
#include <linux/uaccess.h>
#include <linux/sched/isolation.h>
#include <linux/sched/debug.h>
+#include <linux/sched/mm.h>
#include <linux/nmi.h>
#include <linux/kvm_para.h>
#include <linux/delay.h>
@@ -3172,6 +3173,10 @@ __acquires(&pool->lock)
unsigned long work_data;
int lockdep_start_depth, rcu_start_depth;
bool bh_draining = pool->flags & POOL_BH_DRAINING;
+ bool use_noio_allocs = pwq->wq->flags & __WQ_NOIO;
+ bool use_nofs_allocs = pwq->wq->flags & __WQ_NOFS;
+ unsigned long noio_flags;
+ unsigned long nofs_flags;
#ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the struct work_struct from
@@ -3184,6 +3189,12 @@ __acquires(&pool->lock)
lockdep_copy_map(&lockdep_map, &work->lockdep_map);
#endif
+ /* Set inherited alloc flags */
+ if (use_noio_allocs)
+ noio_flags = memalloc_noio_save();
+ if (use_nofs_allocs)
+ nofs_flags = memalloc_nofs_save();
+
/* ensure we're on the correct CPU */
WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
raw_smp_processor_id() != pool->cpu);
@@ -3320,6 +3331,12 @@ __acquires(&pool->lock)
/* must be the last step, see the function comment */
pwq_dec_nr_in_flight(pwq, work_data);
+
+ /* Restore alloc flags */
+ if (use_nofs_allocs)
+ memalloc_nofs_restore(nofs_flags);
+ if (use_noio_allocs)
+ memalloc_noio_restore(noio_flags);
}
/**
--
2.39.3
Hello,
On Mon, May 13, 2024 at 02:53:41PM +0200, H?kon Bugge wrote:
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index 158784dd189ab..09ecc692ffcae 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -398,6 +398,8 @@ enum wq_flags {
> __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
> __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */
> __WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */
> + __WQ_NOIO = 1 << 19, /* internal: execute work with NOIO */
> + __WQ_NOFS = 1 << 20, /* internal: execute work with NOFS */
I don't quite understand how this is supposed to be used. The flags are
marked internal but nothing actually sets them. Looking at later patches, I
don't see any usages either. What am I missing?
> @@ -3184,6 +3189,12 @@ __acquires(&pool->lock)
>
> lockdep_copy_map(&lockdep_map, &work->lockdep_map);
> #endif
> + /* Set inherited alloc flags */
> + if (use_noio_allocs)
> + noio_flags = memalloc_noio_save();
> + if (use_nofs_allocs)
> + nofs_flags = memalloc_nofs_save();
> +
> /* ensure we're on the correct CPU */
> WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
> raw_smp_processor_id() != pool->cpu);
> @@ -3320,6 +3331,12 @@ __acquires(&pool->lock)
>
> /* must be the last step, see the function comment */
> pwq_dec_nr_in_flight(pwq, work_data);
> +
> + /* Restore alloc flags */
> + if (use_nofs_allocs)
> + memalloc_nofs_restore(nofs_flags);
> + if (use_noio_allocs)
> + memalloc_noio_restore(noio_flags);
Also, this looks like something that the work function can do on entry and
before exit, no?
Thanks.
--
tejun
> Hello,
>
> On Mon, May 13, 2024 at 02:53:41PM +0200, Håkon Bugge wrote:
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index 158784dd189ab..09ecc692ffcae 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -398,6 +398,8 @@ enum wq_flags {
> __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
> __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */
> __WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */
> + __WQ_NOIO = 1 << 19, /* internal: execute work with NOIO */
> + __WQ_NOFS = 1 << 20, /* internal: execute work with NOFS */
>
> I don't quite understand how this is supposed to be used. The flags are
> marked internal but nothing actually sets them. Looking at later patches, I
> don't see any usages either. What am I missing?
Hi Tejun,
Thank you for so quickly looking into this. You are quite right. During re-basing, I missed a hunk from alloc_workqueue(), where I sample current->flags for PF_MALLOC_{NOIO,NOFS} bits and sets the corresponding bits in wq->flags. Fixed in v2.
> @@ -3184,6 +3189,12 @@ __acquires(&pool->lock)
>
> lockdep_copy_map(&lockdep_map, &work->lockdep_map);
> #endif
> + /* Set inherited alloc flags */
> + if (use_noio_allocs)
> + noio_flags = memalloc_noio_save();
> + if (use_nofs_allocs)
> + nofs_flags = memalloc_nofs_save();
> +
> /* ensure we're on the correct CPU */
> WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
> raw_smp_processor_id() != pool->cpu);
> @@ -3320,6 +3331,12 @@ __acquires(&pool->lock)
>
> /* must be the last step, see the function comment */
> pwq_dec_nr_in_flight(pwq, work_data);
> +
> + /* Restore alloc flags */
> + if (use_nofs_allocs)
> + memalloc_nofs_restore(nofs_flags);
> + if (use_noio_allocs)
> + memalloc_noio_restore(noio_flags);
>
> Also, this looks like something that the work function can do on entry and
> before exit, no?
It _can_ be done in the work functions, but that will be a code sprawl. Only in RDS, we have the following worker functions:
rds_ib_odp_mr_worker();
rds_ib_mr_pool_flush_worker()
rds_ib_odp_mr_worker()
rds_tcp_accept_worker()
rds_connect_worker()
rds_send_worker()
rds_recv_worker()
rds_shutdown_worker()
adding the ones from ib_cm, rdma_cm, mlx5_ib, and mlx5_core, I strongly prefer to have it in one place.
Thxs, Håkon
Hello,
On Tue, May 14, 2024 at 01:48:24PM +0000, Haakon Bugge wrote:
> > Also, this looks like something that the work function can do on entry and
> > before exit, no?
>
> It _can_ be done in the work functions, but that will be a code sprawl.
> Only in RDS, we have the following worker functions:
>
> rds_ib_odp_mr_worker();
> rds_ib_mr_pool_flush_worker()
> rds_ib_odp_mr_worker()
> rds_tcp_accept_worker()
> rds_connect_worker()
> rds_send_worker()
> rds_recv_worker()
> rds_shutdown_worker()
>
> adding the ones from ib_cm, rdma_cm, mlx5_ib, and mlx5_core, I strongly
> prefer to have it in one place.
I haven't seen the code yet, so can't tell for sure but if you're
automatically inherting these flags from the scheduling site, I don't think
that's gonna work. Note that getting a different, more permissive,
allocation context is one of reasons why one might want to use workqueues,
so it'd have to be explicit whether it's in workqueue or in its users.
Thanks.
--
tejun
Hi Tejun,
> On 14 May 2024, at 18:49, Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Tue, May 14, 2024 at 01:48:24PM +0000, Haakon Bugge wrote:
>
> I haven't seen the code yet, so can't tell for sure but if you're
> automatically inherting these flags from the scheduling site, I don't think
> that's gonna work. Note that getting a different, more permissive,
> allocation context is one of reasons why one might want to use workqueues,
> so it'd have to be explicit whether it's in workqueue or in its users.
I wanted to hold off sending the v2 if it came in more comments, but I have sent it now.
Thxs, Håkon