LinuxLists.cc - [PATCH RFC] lib: Make radix_tree_node

2013-07-16 17:06:42

Subject: [PATCH RFC] lib: Make radix_tree_node_alloc() irq safe

With users of radix_tree_preload() run from interrupt (CFQ is one such
possible user), the following race can happen:

radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
<interrupt>
...
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];

And we give out one radix tree node twice. That clearly results in radix
tree corruption with different results (usually OOPS) depending on which
two users of radix tree race.

Fix the problem by disabling interrupts when working with rtp variable.
In-interrupt user can still deplete our preloaded nodes but at least we
won't corrupt radix trees.

Signed-off-by: Jan Kara <[email protected]>
---
lib/radix-tree.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

There are some questions regarding this patch:
Do we really want to allow in-interrupt users of radix_tree_preload()? CFQ
could certainly do this in older kernels but that particular call site where I
saw the bug hit isn't there anymore so I'm not sure this can really happen with
recent kernels.

Also it is actually harmful to do preloading if you are in interrupt context
anyway. The disadvantage of disallowing radix_tree_preload() in interrupt is
that we would need to tweak radix_tree_node_alloc() to somehow recognize
whether the caller wants it to use preloaded nodes or not and that callers
would have to get it right (although maybe some magic in radix_tree_preload()
could handle that).

Opinions?

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..6f1045d 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -209,18 +209,26 @@ radix_tree_node_alloc(struct radix_tree_root *root)

if (!(gfp_mask & __GFP_WAIT)) {
struct radix_tree_preload *rtp;
+ unsigned long flags;

/*
* Provided the caller has preloaded here, we will always
* succeed in getting a node here (and never reach
- * kmem_cache_alloc)
+ * kmem_cache_alloc)... unless we race with interrupt also
+ * consuming preloaded nodes.
*/
rtp = &__get_cpu_var(radix_tree_preloads);
+ /*
+ * Disable interrupts to make sure radix_tree_node_alloc()
+ * called from interrupt cannot return the same node as we do.
+ */
+ local_irq_save(flags);
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
rtp->nodes[rtp->nr - 1] = NULL;
rtp->nr--;
}
+ local_irq_restore(flags);
}
if (ret == NULL)
ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
@@ -269,6 +277,7 @@ int radix_tree_preload(gfp_t gfp_mask)
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
int ret = -ENOMEM;
+ unsigned long flags;

preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
@@ -278,11 +287,15 @@ int radix_tree_preload(gfp_t gfp_mask)
if (node == NULL)
goto out;
preempt_disable();
+ local_irq_save(flags);
rtp = &__get_cpu_var(radix_tree_preloads);
- if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+ if (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
rtp->nodes[rtp->nr++] = node;
- else
+ local_irq_restore(flags);
+ } else {
+ local_irq_restore(flags);
kmem_cache_free(radix_tree_node_cachep, node);
+ }
}
ret = 0;
out:
--
1.8.1.4

2013-07-17 20:14:58

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH RFC] lib: Make radix_tree_node_alloc() irq safe

On Tue, Jul 16 2013, Jan Kara wrote:
> With users of radix_tree_preload() run from interrupt (CFQ is one such
> possible user), the following race can happen:
>
> radix_tree_preload()
> ...
> radix_tree_insert()
> radix_tree_node_alloc()
> if (rtp->nr) {
> ret = rtp->nodes[rtp->nr - 1];
> <interrupt>
> ...
> radix_tree_preload()
> ...
> radix_tree_insert()
> radix_tree_node_alloc()
> if (rtp->nr) {
> ret = rtp->nodes[rtp->nr - 1];
>
> And we give out one radix tree node twice. That clearly results in radix
> tree corruption with different results (usually OOPS) depending on which
> two users of radix tree race.
>
> Fix the problem by disabling interrupts when working with rtp variable.
> In-interrupt user can still deplete our preloaded nodes but at least we
> won't corrupt radix trees.

Looks good to me, great catch Jan.

--
Jens Axboe

2013-07-17 23:12:04

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH RFC] lib: Make radix_tree_node_alloc() irq safe

On Tue, 16 Jul 2013 19:06:30 +0200 Jan Kara <[email protected]> wrote:

> With users of radix_tree_preload() run from interrupt (CFQ is one such
> possible user), the following race can happen:
>
> radix_tree_preload()
> ...
> radix_tree_insert()
> radix_tree_node_alloc()
> if (rtp->nr) {
> ret = rtp->nodes[rtp->nr - 1];
> <interrupt>
> ...
> radix_tree_preload()
> ...
> radix_tree_insert()
> radix_tree_node_alloc()
> if (rtp->nr) {
> ret = rtp->nodes[rtp->nr - 1];
>
> And we give out one radix tree node twice. That clearly results in radix
> tree corruption with different results (usually OOPS) depending on which
> two users of radix tree race.
>
> Fix the problem by disabling interrupts when working with rtp variable.
> In-interrupt user can still deplete our preloaded nodes but at least we
> won't corrupt radix trees.
>
> ...
>
> There are some questions regarding this patch:
> Do we really want to allow in-interrupt users of radix_tree_preload()? CFQ
> could certainly do this in older kernels but that particular call site where I
> saw the bug hit isn't there anymore so I'm not sure this can really happen with
> recent kernels.

Well, it was never anticipated that interrupt-time code would run
radix_tree_preload(). The whole point in the preloading was to be able
to perform GFP_KERNEL allocations before entering the spinlocked region
which needs to allocate memory.

Doing all that from within an interrupt is daft, because the interrupt code
can't use GFP_KERNEL anyway.

> Also it is actually harmful to do preloading if you are in interrupt context
> anyway. The disadvantage of disallowing radix_tree_preload() in interrupt is
> that we would need to tweak radix_tree_node_alloc() to somehow recognize
> whether the caller wants it to use preloaded nodes or not and that callers
> would have to get it right (although maybe some magic in radix_tree_preload()
> could handle that).
>
> Opinions?

BUG_ON(in_interrupt()) :)

2013-07-17 23:16:08

by David Daney

[permalink] [raw]

Subject: Re: [PATCH RFC] lib: Make radix_tree_node_alloc() irq safe

On 07/17/2013 04:12 PM, Andrew Morton wrote:
> On Tue, 16 Jul 2013 19:06:30 +0200 Jan Kara <[email protected]> wrote:
>
>> With users of radix_tree_preload() run from interrupt (CFQ is one such
>> possible user), the following race can happen:
>>
>> radix_tree_preload()
>> ...
>> radix_tree_insert()
>> radix_tree_node_alloc()
>> if (rtp->nr) {
>> ret = rtp->nodes[rtp->nr - 1];
>> <interrupt>
>> ...
>> radix_tree_preload()
>> ...
>> radix_tree_insert()
>> radix_tree_node_alloc()
>> if (rtp->nr) {
>> ret = rtp->nodes[rtp->nr - 1];
>>
>> And we give out one radix tree node twice. That clearly results in radix
>> tree corruption with different results (usually OOPS) depending on which
>> two users of radix tree race.
>>
>> Fix the problem by disabling interrupts when working with rtp variable.
>> In-interrupt user can still deplete our preloaded nodes but at least we
>> won't corrupt radix trees.
>>
>> ...
>>
>> There are some questions regarding this patch:
>> Do we really want to allow in-interrupt users of radix_tree_preload()? CFQ
>> could certainly do this in older kernels but that particular call site where I
>> saw the bug hit isn't there anymore so I'm not sure this can really happen with
>> recent kernels.
>
> Well, it was never anticipated that interrupt-time code would run
> radix_tree_preload(). The whole point in the preloading was to be able
> to perform GFP_KERNEL allocations before entering the spinlocked region
> which needs to allocate memory.
>
> Doing all that from within an interrupt is daft, because the interrupt code
> can't use GFP_KERNEL anyway.
>
>> Also it is actually harmful to do preloading if you are in interrupt context
>> anyway. The disadvantage of disallowing radix_tree_preload() in interrupt is
>> that we would need to tweak radix_tree_node_alloc() to somehow recognize
>> whether the caller wants it to use preloaded nodes or not and that callers
>> would have to get it right (although maybe some magic in radix_tree_preload()
>> could handle that).
>>
>> Opinions?
>
> BUG_ON(in_interrupt()) :)

Is is really that severe? How about...

WARN_ON() instead?

David Daney

2013-07-18 13:09:36

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH RFC] lib: Make radix_tree_node_alloc() irq safe

On Wed 17-07-13 16:12:00, Andrew Morton wrote:
> On Tue, 16 Jul 2013 19:06:30 +0200 Jan Kara <[email protected]> wrote:
>
> > With users of radix_tree_preload() run from interrupt (CFQ is one such
> > possible user), the following race can happen:
> >
> > radix_tree_preload()
> > ...
> > radix_tree_insert()
> > radix_tree_node_alloc()
> > if (rtp->nr) {
> > ret = rtp->nodes[rtp->nr - 1];
> > <interrupt>
> > ...
> > radix_tree_preload()
> > ...
> > radix_tree_insert()
> > radix_tree_node_alloc()
> > if (rtp->nr) {
> > ret = rtp->nodes[rtp->nr - 1];
> >
> > And we give out one radix tree node twice. That clearly results in radix
> > tree corruption with different results (usually OOPS) depending on which
> > two users of radix tree race.
> >
> > Fix the problem by disabling interrupts when working with rtp variable.
> > In-interrupt user can still deplete our preloaded nodes but at least we
> > won't corrupt radix trees.
> >
> > ...
> >
> > There are some questions regarding this patch:
> > Do we really want to allow in-interrupt users of radix_tree_preload()? CFQ
> > could certainly do this in older kernels but that particular call site where I
> > saw the bug hit isn't there anymore so I'm not sure this can really happen with
> > recent kernels.
>
> Well, it was never anticipated that interrupt-time code would run
> radix_tree_preload(). The whole point in the preloading was to be able
> to perform GFP_KERNEL allocations before entering the spinlocked region
> which needs to allocate memory.
>
> Doing all that from within an interrupt is daft, because the interrupt code
> can't use GFP_KERNEL anyway.
Fully agreed here.

> > Also it is actually harmful to do preloading if you are in interrupt context
> > anyway. The disadvantage of disallowing radix_tree_preload() in interrupt is
> > that we would need to tweak radix_tree_node_alloc() to somehow recognize
> > whether the caller wants it to use preloaded nodes or not and that callers
> > would have to get it right (although maybe some magic in radix_tree_preload()
> > could handle that).
> >
> > Opinions?
>
> BUG_ON(in_interrupt()) :)
Or maybe WARN_ON()... But it's not so easy :) Currently radix tree code
assumes that if gfp_mask doesn't have __GFP_WAIT set caller has performed
radix_tree_preload(). Clearly this will stop working for in-interrupt users
of radix tree. So how do we propagate the information from the caller of
radix_tree_insert() down to radix_tree_node_alloc() whether the preload has
been performed or not? Will we rely on in_interrupt() or use some special
gfp_mask bit?

Secondly, CFQ has this unpleasant property that some functions are
sometimes called from interrupt context and sometimes not. So these
functions would have to check in what context they are called and either
perform preload or not. That's doable but it's going to be a bit ugly and
has to match the check in radix_tree_node_alloc() whether preload should be
used or not. So leaving the checking to the users of radix tree looks
fragile. So maybe we could just silently exit from radix_tree_preload()
when we are in_interrupt()?

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-07-18 21:25:53

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH RFC] lib: Make radix_tree_node_alloc() irq safe

On 07/17/2013 05:12 PM, Andrew Morton wrote:
> On Tue, 16 Jul 2013 19:06:30 +0200 Jan Kara <[email protected]> wrote:
>
>> With users of radix_tree_preload() run from interrupt (CFQ is one such
>> possible user), the following race can happen:
>>
>> radix_tree_preload()
>> ...
>> radix_tree_insert()
>> radix_tree_node_alloc()
>> if (rtp->nr) {
>> ret = rtp->nodes[rtp->nr - 1];
>> <interrupt>
>> ...
>> radix_tree_preload()
>> ...
>> radix_tree_insert()
>> radix_tree_node_alloc()
>> if (rtp->nr) {
>> ret = rtp->nodes[rtp->nr - 1];
>>
>> And we give out one radix tree node twice. That clearly results in radix
>> tree corruption with different results (usually OOPS) depending on which
>> two users of radix tree race.
>>
>> Fix the problem by disabling interrupts when working with rtp variable.
>> In-interrupt user can still deplete our preloaded nodes but at least we
>> won't corrupt radix trees.
>>
>> ...
>>
>> There are some questions regarding this patch:
>> Do we really want to allow in-interrupt users of radix_tree_preload()? CFQ
>> could certainly do this in older kernels but that particular call site where I
>> saw the bug hit isn't there anymore so I'm not sure this can really happen with
>> recent kernels.
>
> Well, it was never anticipated that interrupt-time code would run
> radix_tree_preload(). The whole point in the preloading was to be able
> to perform GFP_KERNEL allocations before entering the spinlocked region
> which needs to allocate memory.
>
> Doing all that from within an interrupt is daft, because the interrupt code
> can't use GFP_KERNEL anyway.
>
>> Also it is actually harmful to do preloading if you are in interrupt context
>> anyway. The disadvantage of disallowing radix_tree_preload() in interrupt is
>> that we would need to tweak radix_tree_node_alloc() to somehow recognize
>> whether the caller wants it to use preloaded nodes or not and that callers
>> would have to get it right (although maybe some magic in radix_tree_preload()
>> could handle that).
>>
>> Opinions?
>
> BUG_ON(in_interrupt()) :)

Good point Andrew, it'd be better to "document" the restriction (since
the use is non-sensical). It's actually not CFQ code that does this,
it's the io context management.

Excuse the crappy mailer, but something ala:

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 9c4bb82..bcb9b17 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -366,7 +366,7 @@ struct io_cq *ioc_create_icq(struct io_context *ioc,
struct
if (!icq)
return NULL;

- if (radix_tree_preload(gfp_mask) < 0) {
+ if ((gfp_mask & __GFP_WAIT) && radix_tree_preload(gfp_mask) < 0) {
kmem_cache_free(et->icq_cache, icq);
return NULL;
}
@@ -394,7 +394,10 @@ struct io_cq *ioc_create_icq(struct io_context
*ioc, struct

spin_unlock(&ioc->lock);
spin_unlock_irq(q->queue_lock);
- radix_tree_preload_end();
+
+ if (gfp_mask & __GFP_WAIT)
+ radix_tree_preload_end();
+
return icq;
}

--
Jens Axboe

2013-07-18 21:30:38

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH RFC] lib: Make radix_tree_node_alloc() irq safe

On 07/18/2013 07:09 AM, Jan Kara wrote:
> On Wed 17-07-13 16:12:00, Andrew Morton wrote:
>> On Tue, 16 Jul 2013 19:06:30 +0200 Jan Kara <[email protected]> wrote:
>>
>>> With users of radix_tree_preload() run from interrupt (CFQ is one such
>>> possible user), the following race can happen:
>>>
>>> radix_tree_preload()
>>> ...
>>> radix_tree_insert()
>>> radix_tree_node_alloc()
>>> if (rtp->nr) {
>>> ret = rtp->nodes[rtp->nr - 1];
>>> <interrupt>
>>> ...
>>> radix_tree_preload()
>>> ...
>>> radix_tree_insert()
>>> radix_tree_node_alloc()
>>> if (rtp->nr) {
>>> ret = rtp->nodes[rtp->nr - 1];
>>>
>>> And we give out one radix tree node twice. That clearly results in radix
>>> tree corruption with different results (usually OOPS) depending on which
>>> two users of radix tree race.
>>>
>>> Fix the problem by disabling interrupts when working with rtp variable.
>>> In-interrupt user can still deplete our preloaded nodes but at least we
>>> won't corrupt radix trees.
>>>
>>> ...
>>>
>>> There are some questions regarding this patch:
>>> Do we really want to allow in-interrupt users of radix_tree_preload()? CFQ
>>> could certainly do this in older kernels but that particular call site where I
>>> saw the bug hit isn't there anymore so I'm not sure this can really happen with
>>> recent kernels.
>>
>> Well, it was never anticipated that interrupt-time code would run
>> radix_tree_preload(). The whole point in the preloading was to be able
>> to perform GFP_KERNEL allocations before entering the spinlocked region
>> which needs to allocate memory.
>>
>> Doing all that from within an interrupt is daft, because the interrupt code
>> can't use GFP_KERNEL anyway.
> Fully agreed here.
>
>>> Also it is actually harmful to do preloading if you are in interrupt context
>>> anyway. The disadvantage of disallowing radix_tree_preload() in interrupt is
>>> that we would need to tweak radix_tree_node_alloc() to somehow recognize
>>> whether the caller wants it to use preloaded nodes or not and that callers
>>> would have to get it right (although maybe some magic in radix_tree_preload()
>>> could handle that).
>>>
>>> Opinions?
>>
>> BUG_ON(in_interrupt()) :)
> Or maybe WARN_ON()... But it's not so easy :) Currently radix tree code
> assumes that if gfp_mask doesn't have __GFP_WAIT set caller has performed
> radix_tree_preload(). Clearly this will stop working for in-interrupt users
> of radix tree. So how do we propagate the information from the caller of
> radix_tree_insert() down to radix_tree_node_alloc() whether the preload has
> been performed or not? Will we rely on in_interrupt() or use some special
> gfp_mask bit?

Should have read the full thread... in_interrupt() is ugly to base
decisions on, imho. I'd say just use __GFP_WAIT to signal this.

> Secondly, CFQ has this unpleasant property that some functions are
> sometimes called from interrupt context and sometimes not. So these
> functions would have to check in what context they are called and either
> perform preload or not. That's doable but it's going to be a bit ugly and
> has to match the check in radix_tree_node_alloc() whether preload should be
> used or not. So leaving the checking to the users of radix tree looks
> fragile. So maybe we could just silently exit from radix_tree_preload()
> when we are in_interrupt()?

Which CFQ functions are these? Generally we get callbacks from the
drivers on both queue and complete times that can be done at various
contexts, so it's not something that is easily solvable. I'm assuming
you are referring to the blk-ioc.c functions here, though?

--
Jens Axboe

2013-07-18 21:37:18

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH RFC] lib: Make radix_tree_node_alloc() irq safe

On Thu, 18 Jul 2013 15:09:32 +0200 Jan Kara <[email protected]> wrote:

> On Wed 17-07-13 16:12:00, Andrew Morton wrote:
> > On Tue, 16 Jul 2013 19:06:30 +0200 Jan Kara <[email protected]> wrote:
> >
> > BUG_ON(in_interrupt()) :)
> Or maybe WARN_ON()... But it's not so easy :) Currently radix tree code
> assumes that if gfp_mask doesn't have __GFP_WAIT set caller has performed
> radix_tree_preload(). Clearly this will stop working for in-interrupt users
> of radix tree. So how do we propagate the information from the caller of
> radix_tree_insert() down to radix_tree_node_alloc() whether the preload has
> been performed or not? Will we rely on in_interrupt() or use some special
> gfp_mask bit?

Well, it won't stop working. The interrupt-time
radix_tree_node_alloc() call will try to grab a node from the cpu-local
magazine and if that failed, will call kmem_cache_alloc(). Presumably
the caller has passed in GFP_ATOMIC, so the kmem_cache_alloc() will use
page reserves, which seems appropriate.

This will mean that the interrupt-time node allocation will sometimes
steal a preloaded node from process-context code. In the absolutely
worst case, the process-context code will then need to try
kmem_cache_alloc(), which will probably succeed anyway.

It's not perfect - we'd prefer that process-context node allocations
not get stolen in this fashion. That's easily fixed with

--- a/lib/radix-tree.c~a
+++ a/lib/radix-tree.c
@@ -207,7 +207,10 @@ radix_tree_node_alloc(struct radix_tree_
struct radix_tree_node *ret = NULL;
gfp_t gfp_mask = root_gfp_mask(root);

- if (!(gfp_mask & __GFP_WAIT)) {
+ /*
+ * Lengthy comment goes here
+ */
+ if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
struct radix_tree_preload *rtp;

/*

But I don't know if it's worth it.

> Secondly, CFQ has this unpleasant property that some functions are
> sometimes called from interrupt context and sometimes not. So these
> functions would have to check in what context they are called and either
> perform preload or not. That's doable but it's going to be a bit ugly and
> has to match the check in radix_tree_node_alloc() whether preload should be
> used or not. So leaving the checking to the users of radix tree looks
> fragile.

mm... The CFQ code should be passing around a gfp_t anyway - GFP_NOIO
or GFP_ATOMIC, depending on the calling context. So don't call
radix_tree_preload() if it's GFP_ATOMIC.

> So maybe we could just silently exit from radix_tree_preload()
> when we are in_interrupt()?

Or that.

2013-07-22 15:21:43

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH RFC] lib: Make radix_tree_node_alloc() irq safe

On Thu 18-07-13 15:30:27, Jens Axboe wrote:
> On 07/18/2013 07:09 AM, Jan Kara wrote:
> > On Wed 17-07-13 16:12:00, Andrew Morton wrote:
> >> On Tue, 16 Jul 2013 19:06:30 +0200 Jan Kara <[email protected]> wrote:
> >>
> >>> With users of radix_tree_preload() run from interrupt (CFQ is one such
> >>> possible user), the following race can happen:
> >>>
> >>> radix_tree_preload()
> >>> ...
> >>> radix_tree_insert()
> >>> radix_tree_node_alloc()
> >>> if (rtp->nr) {
> >>> ret = rtp->nodes[rtp->nr - 1];
> >>> <interrupt>
> >>> ...
> >>> radix_tree_preload()
> >>> ...
> >>> radix_tree_insert()
> >>> radix_tree_node_alloc()
> >>> if (rtp->nr) {
> >>> ret = rtp->nodes[rtp->nr - 1];
> >>>
> >>> And we give out one radix tree node twice. That clearly results in radix
> >>> tree corruption with different results (usually OOPS) depending on which
> >>> two users of radix tree race.
> >>>
> >>> Fix the problem by disabling interrupts when working with rtp variable.
> >>> In-interrupt user can still deplete our preloaded nodes but at least we
> >>> won't corrupt radix trees.
> >>>
> >>> ...
> >>>
> >>> There are some questions regarding this patch:
> >>> Do we really want to allow in-interrupt users of radix_tree_preload()? CFQ
> >>> could certainly do this in older kernels but that particular call site where I
> >>> saw the bug hit isn't there anymore so I'm not sure this can really happen with
> >>> recent kernels.
> >>
> >> Well, it was never anticipated that interrupt-time code would run
> >> radix_tree_preload(). The whole point in the preloading was to be able
> >> to perform GFP_KERNEL allocations before entering the spinlocked region
> >> which needs to allocate memory.
> >>
> >> Doing all that from within an interrupt is daft, because the interrupt code
> >> can't use GFP_KERNEL anyway.
> > Fully agreed here.
> >
> >>> Also it is actually harmful to do preloading if you are in interrupt context
> >>> anyway. The disadvantage of disallowing radix_tree_preload() in interrupt is
> >>> that we would need to tweak radix_tree_node_alloc() to somehow recognize
> >>> whether the caller wants it to use preloaded nodes or not and that callers
> >>> would have to get it right (although maybe some magic in radix_tree_preload()
> >>> could handle that).
> >>>
> >>> Opinions?
> >>
> >> BUG_ON(in_interrupt()) :)
> > Or maybe WARN_ON()... But it's not so easy :) Currently radix tree code
> > assumes that if gfp_mask doesn't have __GFP_WAIT set caller has performed
> > radix_tree_preload(). Clearly this will stop working for in-interrupt users
> > of radix tree. So how do we propagate the information from the caller of
> > radix_tree_insert() down to radix_tree_node_alloc() whether the preload has
> > been performed or not? Will we rely on in_interrupt() or use some special
> > gfp_mask bit?
>
> Should have read the full thread... in_interrupt() is ugly to base
> decisions on, imho. I'd say just use __GFP_WAIT to signal this.
Yeah, probably that would be nicer and doable in radix_tree_preload().
But in radix_tree_node_alloc() we get gfp_mask as set for the radix tree
and thus it's always going to be GFP_ATOMIC in case of ioc radix tree. So
there we have to have something different if we don't want in interrupt
users to use preallocated nodes. As Andrew suggests maybe that's not
necessary but then my original patch is what you end up with.

> > Secondly, CFQ has this unpleasant property that some functions are
> > sometimes called from interrupt context and sometimes not. So these
> > functions would have to check in what context they are called and either
> > perform preload or not. That's doable but it's going to be a bit ugly and
> > has to match the check in radix_tree_node_alloc() whether preload should be
> > used or not. So leaving the checking to the users of radix tree looks
> > fragile. So maybe we could just silently exit from radix_tree_preload()
> > when we are in_interrupt()?
>
> Which CFQ functions are these? Generally we get callbacks from the
> drivers on both queue and complete times that can be done at various
> contexts, so it's not something that is easily solvable. I'm assuming
> you are referring to the blk-ioc.c functions here, though?
Yes, that's exactly what I'm referring to.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-07-22 15:38:13

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH RFC] lib: Make radix_tree_node_alloc() irq safe

On Mon, Jul 22 2013, Jan Kara wrote:
> On Thu 18-07-13 15:30:27, Jens Axboe wrote:
> > On 07/18/2013 07:09 AM, Jan Kara wrote:
> > > On Wed 17-07-13 16:12:00, Andrew Morton wrote:
> > >> On Tue, 16 Jul 2013 19:06:30 +0200 Jan Kara <[email protected]> wrote:
> > >>
> > >>> With users of radix_tree_preload() run from interrupt (CFQ is one such
> > >>> possible user), the following race can happen:
> > >>>
> > >>> radix_tree_preload()
> > >>> ...
> > >>> radix_tree_insert()
> > >>> radix_tree_node_alloc()
> > >>> if (rtp->nr) {
> > >>> ret = rtp->nodes[rtp->nr - 1];
> > >>> <interrupt>
> > >>> ...
> > >>> radix_tree_preload()
> > >>> ...
> > >>> radix_tree_insert()
> > >>> radix_tree_node_alloc()
> > >>> if (rtp->nr) {
> > >>> ret = rtp->nodes[rtp->nr - 1];
> > >>>
> > >>> And we give out one radix tree node twice. That clearly results in radix
> > >>> tree corruption with different results (usually OOPS) depending on which
> > >>> two users of radix tree race.
> > >>>
> > >>> Fix the problem by disabling interrupts when working with rtp variable.
> > >>> In-interrupt user can still deplete our preloaded nodes but at least we
> > >>> won't corrupt radix trees.
> > >>>
> > >>> ...
> > >>>
> > >>> There are some questions regarding this patch:
> > >>> Do we really want to allow in-interrupt users of radix_tree_preload()? CFQ
> > >>> could certainly do this in older kernels but that particular call site where I
> > >>> saw the bug hit isn't there anymore so I'm not sure this can really happen with
> > >>> recent kernels.
> > >>
> > >> Well, it was never anticipated that interrupt-time code would run
> > >> radix_tree_preload(). The whole point in the preloading was to be able
> > >> to perform GFP_KERNEL allocations before entering the spinlocked region
> > >> which needs to allocate memory.
> > >>
> > >> Doing all that from within an interrupt is daft, because the interrupt code
> > >> can't use GFP_KERNEL anyway.
> > > Fully agreed here.
> > >
> > >>> Also it is actually harmful to do preloading if you are in interrupt context
> > >>> anyway. The disadvantage of disallowing radix_tree_preload() in interrupt is
> > >>> that we would need to tweak radix_tree_node_alloc() to somehow recognize
> > >>> whether the caller wants it to use preloaded nodes or not and that callers
> > >>> would have to get it right (although maybe some magic in radix_tree_preload()
> > >>> could handle that).
> > >>>
> > >>> Opinions?
> > >>
> > >> BUG_ON(in_interrupt()) :)
> > > Or maybe WARN_ON()... But it's not so easy :) Currently radix tree code
> > > assumes that if gfp_mask doesn't have __GFP_WAIT set caller has performed
> > > radix_tree_preload(). Clearly this will stop working for in-interrupt users
> > > of radix tree. So how do we propagate the information from the caller of
> > > radix_tree_insert() down to radix_tree_node_alloc() whether the preload has
> > > been performed or not? Will we rely on in_interrupt() or use some special
> > > gfp_mask bit?
> >
> > Should have read the full thread... in_interrupt() is ugly to base
> > decisions on, imho. I'd say just use __GFP_WAIT to signal this.
> Yeah, probably that would be nicer and doable in radix_tree_preload().
> But in radix_tree_node_alloc() we get gfp_mask as set for the radix tree
> and thus it's always going to be GFP_ATOMIC in case of ioc radix tree. So
> there we have to have something different if we don't want in interrupt
> users to use preallocated nodes. As Andrew suggests maybe that's not
> necessary but then my original patch is what you end up with.

At least for the ioc functions, failure can be tolerated (and is
handled). So for this specific case, we don't need to do anything futher
than just the GFP_ATOMIC set in the tree. If that fails, we just don't
link the ioc this time.

--
Jens Axboe

2013-07-22 20:30:36

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH RFC] lib: Make radix_tree_node_alloc() irq safe

On Thu 18-07-13 14:37:15, Andrew Morton wrote:
> On Thu, 18 Jul 2013 15:09:32 +0200 Jan Kara <[email protected]> wrote:
>
> > On Wed 17-07-13 16:12:00, Andrew Morton wrote:
> > > On Tue, 16 Jul 2013 19:06:30 +0200 Jan Kara <[email protected]> wrote:
> > >
> > > BUG_ON(in_interrupt()) :)
> > Or maybe WARN_ON()... But it's not so easy :) Currently radix tree code
> > assumes that if gfp_mask doesn't have __GFP_WAIT set caller has performed
> > radix_tree_preload(). Clearly this will stop working for in-interrupt users
> > of radix tree. So how do we propagate the information from the caller of
> > radix_tree_insert() down to radix_tree_node_alloc() whether the preload has
> > been performed or not? Will we rely on in_interrupt() or use some special
> > gfp_mask bit?
>
> Well, it won't stop working. The interrupt-time
> radix_tree_node_alloc() call will try to grab a node from the cpu-local
> magazine and if that failed, will call kmem_cache_alloc(). Presumably
> the caller has passed in GFP_ATOMIC, so the kmem_cache_alloc() will use
> page reserves, which seems appropriate.
True. But then we have to make radix_tree_node_alloc() irq safe as I did
in my patch.

> This will mean that the interrupt-time node allocation will sometimes
> steal a preloaded node from process-context code. In the absolutely
> worst case, the process-context code will then need to try
> kmem_cache_alloc(), which will probably succeed anyway.
What I found somewhat nasty about stealing of preallocated nodes by
interrupt is that process-context user can do preload which succeeds so he
assumes radix_tree_node_alloc() cannot fail. If interrupt steals nodes,
radix_tree_node_alloc() can return NULL for process-context user and that
may oops or whatever (I've checked and there are users where this would
currently happen - e.g. mm/vmalloc.c:new_vmap_block()). And it will be
quite hard to track down why that happened.

So the question is more about the status of radix_tree_preload() - do we
make it just an optimization (so that allocations have higher chance of
success / stress the system less) or do we want to guarantee the caller
that nodes are there to be used? In the first case we have to fix some
users of radix_tree_preload() in the second case we have to tweak
radix_tree_node_alloc() like you suggest below.

> It's not perfect - we'd prefer that process-context node allocations
> not get stolen in this fashion. That's easily fixed with
>
> --- a/lib/radix-tree.c~a
> +++ a/lib/radix-tree.c
> @@ -207,7 +207,10 @@ radix_tree_node_alloc(struct radix_tree_
> struct radix_tree_node *ret = NULL;
> gfp_t gfp_mask = root_gfp_mask(root);
>
> - if (!(gfp_mask & __GFP_WAIT)) {
> + /*
> + * Lengthy comment goes here
> + */
> + if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
> struct radix_tree_preload *rtp;
>
> /*
>
> But I don't know if it's worth it.
>
> > Secondly, CFQ has this unpleasant property that some functions are
> > sometimes called from interrupt context and sometimes not. So these
> > functions would have to check in what context they are called and either
> > perform preload or not. That's doable but it's going to be a bit ugly and
> > has to match the check in radix_tree_node_alloc() whether preload should be
> > used or not. So leaving the checking to the users of radix tree looks
> > fragile.
>
> mm... The CFQ code should be passing around a gfp_t anyway - GFP_NOIO
> or GFP_ATOMIC, depending on the calling context. So don't call
> radix_tree_preload() if it's GFP_ATOMIC.
I've checked and there are other users which can end up calling
radix_tree_preload() with GFP_ATOMIC or similar gfp masks (e.g. some
add_to_swap_cache() users). So I think it would be better to handle that
case inside a common function. Maybe we could have a separate function like
radix_tree_try_preload() which would just skip any preloading if mask
doesn't have __GFP_WAIT set and radix_tree_preload() will complain if it is
passed a gfp mask without __GFP_WAIT. Hm?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR