For drivers/modules running inside a
memalloc_{noio,nofs}_{save,restore} region, if a work-queue is
created, we make sure work executed on the work-queue inherits the
same flag(s).
This in order to conditionally enable drivers to work aligned with
block I/O devices. This commit makes sure that any work queued later
on work-queues created during module initialization, when current's
flags has PF_MEMALLOC_{NOIO,NOFS} set, will inherit the same flags.
We do this in order to enable drivers to be used as a network block
I/O device. This in order to support XFS or other file-systems on top
of a raw block device which uses said drivers as the network transport
layer.
Under intense memory pressure, we get memory reclaims. Assume the
file-system reclaims memory, goes to the raw block device, which calls
into said drivers. Now, if regular GFP_KERNEL allocations in the
drivers require reclaims to be fulfilled, we end up in a circular
dependency.
We break this circular dependency by:
1. Force all allocations in the drivers to use GFP_NOIO, by means of a
parenthetic use of memalloc_noio_{save,restore} on all relevant
entry points.
2. Make sure work-queues inherits current->flags
wrt. PF_MEMALLOC_{NOIO,NOFS}, such that work executed on the
work-queue inherits the same flag(s). That is what this commit
contributes with.
Signed-off-by: Håkon Bugge <[email protected]>
---
v1 -> v2:
* Added missing hunk in alloc_workqueue()
---
include/linux/workqueue.h | 2 ++
kernel/workqueue.c | 21 +++++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 158784dd189ab..09ecc692ffcae 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -398,6 +398,8 @@ enum wq_flags {
__WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
__WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */
__WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */
+ __WQ_NOIO = 1 << 19, /* internal: execute work with NOIO */
+ __WQ_NOFS = 1 << 20, /* internal: execute work with NOFS */
/* BH wq only allows the following flags */
__WQ_BH_ALLOWS = WQ_BH | WQ_HIGHPRI,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d2dbe099286b9..8eb7562372ce2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -51,6 +51,7 @@
#include <linux/uaccess.h>
#include <linux/sched/isolation.h>
#include <linux/sched/debug.h>
+#include <linux/sched/mm.h>
#include <linux/nmi.h>
#include <linux/kvm_para.h>
#include <linux/delay.h>
@@ -3172,6 +3173,10 @@ __acquires(&pool->lock)
unsigned long work_data;
int lockdep_start_depth, rcu_start_depth;
bool bh_draining = pool->flags & POOL_BH_DRAINING;
+ bool use_noio_allocs = pwq->wq->flags & __WQ_NOIO;
+ bool use_nofs_allocs = pwq->wq->flags & __WQ_NOFS;
+ unsigned long noio_flags;
+ unsigned long nofs_flags;
#ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the struct work_struct from
@@ -3184,6 +3189,12 @@ __acquires(&pool->lock)
lockdep_copy_map(&lockdep_map, &work->lockdep_map);
#endif
+ /* Set inherited alloc flags */
+ if (use_noio_allocs)
+ noio_flags = memalloc_noio_save();
+ if (use_nofs_allocs)
+ nofs_flags = memalloc_nofs_save();
+
/* ensure we're on the correct CPU */
WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
raw_smp_processor_id() != pool->cpu);
@@ -3320,6 +3331,12 @@ __acquires(&pool->lock)
/* must be the last step, see the function comment */
pwq_dec_nr_in_flight(pwq, work_data);
+
+ /* Restore alloc flags */
+ if (use_nofs_allocs)
+ memalloc_nofs_restore(nofs_flags);
+ if (use_noio_allocs)
+ memalloc_noio_restore(noio_flags);
}
/**
@@ -5583,6 +5600,10 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
/* init wq */
wq->flags = flags;
+ if (current->flags & PF_MEMALLOC_NOIO)
+ wq->flags |= __WQ_NOIO;
+ if (current->flags & PF_MEMALLOC_NOFS)
+ wq->flags |= __WQ_NOFS;
wq->max_active = max_active;
wq->min_active = min(max_active, WQ_DFL_MIN_ACTIVE);
wq->saved_max_active = wq->max_active;
--
2.45.0
> @@ -5583,6 +5600,10 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
>
> /* init wq */
> wq->flags = flags;
> + if (current->flags & PF_MEMALLOC_NOIO)
> + wq->flags |= __WQ_NOIO;
> + if (current->flags & PF_MEMALLOC_NOFS)
> + wq->flags |= __WQ_NOFS;
So, yeah, please don't do this. What if a NOIO callers wants to scheduler a
work item so that it can user GFP_KERNEL allocations. I don't mind a
convenience feature to workqueue for this but this doesn't seem like the
right way. Also, memalloc_noio_save() and memalloc_nofs_save() are
convenience wrappers around memalloc_flags_save(), so it'd probably be
better to deal with gfp flags directly rather than singling out these two
flags.
Thanks.
--
tejun
> On 15 May 2024, at 18:54, Tejun Heo <[email protected]> wrote:
>
>> @@ -5583,6 +5600,10 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
>>
>> /* init wq */
>> wq->flags = flags;
>> + if (current->flags & PF_MEMALLOC_NOIO)
>> + wq->flags |= __WQ_NOIO;
>> + if (current->flags & PF_MEMALLOC_NOFS)
>> + wq->flags |= __WQ_NOFS;
>
> So, yeah, please don't do this. What if a NOIO callers wants to scheduler a
> work item so that it can user GFP_KERNEL allocations.
If one work function want to use GPF_KERNEL and another using GFP_NOIO, queued on the same workqueue, one could create two workqueues. Create one that is surrounded by memalloc_noio_{save,restore}, another surrounded by memalloc_flags_save() + current->flags &= ~PF_MEMALLOC_NOIO and memalloc_flags_restore().
If you imply a work functions that performs combinations of GFP_KERNEL and GFP_NOIO, that sounds a little bit peculiar to me, but if needed, it must be open-coded. But wouldn't that be the same case as a WQ created with WQ_MEM_RECLAIM?
> I don't mind a
> convenience feature to workqueue for this but this doesn't seem like the
> right way. Also, memalloc_noio_save() and memalloc_nofs_save() are
> convenience wrappers around memalloc_flags_save(), so it'd probably be
> better to deal with gfp flags directly rather than singling out these two
> flags.
Actually, based on https://lore.kernel.org/linux-fsdevel/[email protected], I am inclided to skip GFP_NOFS. Also because the use-case for this series does not need GFP_NOFS.
When you say "deal with gfp flags directly", do you imply during WQ creation or queuing work on one? I am OK with adding the other per-process memory allocation flags, but that doesn's solve your initial issue ("if a NOIO callers wants to scheduler a work item so that it can user GFP_KERNEL").
Thxs, Håkon
Hello,
On Thu, May 16, 2024 at 03:27:15PM +0000, Haakon Bugge wrote:
> > So, yeah, please don't do this. What if a NOIO callers wants to scheduler a
> > work item so that it can user GFP_KERNEL allocations.
>
> If one work function want to use GPF_KERNEL and another using GFP_NOIO,
> queued on the same workqueue, one could create two workqueues. Create one
> that is surrounded by memalloc_noio_{save,restore}, another surrounded by
> memalloc_flags_save() + current->flags &= ~PF_MEMALLOC_NOIO and
> memalloc_flags_restore().
This is too subtle and the default behavior doesn't seem great either - in
most cases, the code path which sets up workqueues would be in GFP_KERNEL
context as init paths usually are, so it's not like this would make things
work automatically in most cases. In addition, now, the memory allocations
for workqueues themselves have to be subject to the same GFP restrictions
even when alloc_workqueue() is called from GFP_KERNEL context. It just
doesn't seem well thought out.
> When you say "deal with gfp flags directly", do you imply during WQ
> creation or queuing work on one? I am OK with adding the other per-process
> memory allocation flags, but that doesn's solve your initial issue ("if a
> NOIO callers wants to scheduler a work item so that it can user
> GFP_KERNEL").
It being a purely convenience feature, I don't think there's hard
requirement on where this should go although I don't know where you'd carry
this information if you tied it to each work item. And, please don't single
out specific GFP flags. Please make the feature generic so that users who
may need different GFP masking can also use it too. The underlying GFP
feature is already like that. There's no reason to restrict it from
workqueue side.
Thanks.
--
tejun
Hi,
> On 16 May 2024, at 18:29, Tejun Heo <[email protected]> wrote:
>
>
>> When you say "deal with gfp flags directly", do you imply during WQ
>> creation or queuing work on one? I am OK with adding the other per-process
>> memory allocation flags, but that doesn's solve your initial issue ("if a
>> NOIO callers wants to scheduler a work item so that it can user
>> GFP_KERNEL").
>
> It being a purely convenience feature, I don't think there's hard
> requirement on where this should go although I don't know where you'd carry
> this information if you tied it to each work item. And, please don't single
> out specific GFP flags. Please make the feature generic so that users who
> may need different GFP masking can also use it too. The underlying GFP
> feature is already like that. There's no reason to restrict it from
> workqueue side.
I am preparing a v3 which handles all PF_MEMALLOC* flags. The plan is to send it out tomorrow.
Thxs, Håkon