It's a bug in linux-next, but taking me too long to identify which
commit is "to blame", so let me throw it over to you without more
delay: I think __PageMovable() now needs to check !PageSlab().
I had made a small experimental change somewhere, rebuilt and rebooted,
was not surprised to crash once swapping and compaction came in,
but was surprised to find the crash in isolate_movable_page(),
called by compaction's isolate_migratepages_block().
page->mapping was ffffffff811303aa, which qualifies as __PageMovable(),
which expects struct movable_operations at page->mapping minus low bits.
But ffffffff811303aa was the address of SLUB's rcu_free_slab(): I have
CONFIG_CC_OPTIMIZE_FOR_SIZE=y, so function addresses may have low bits set.
Over to you! Thanks,
Hugh
On Tue, Sep 27, 2022 at 10:16:35PM -0700, Hugh Dickins wrote:
> It's a bug in linux-next, but taking me too long to identify which
> commit is "to blame", so let me throw it over to you without more
> delay: I think __PageMovable() now needs to check !PageSlab().
>
> I had made a small experimental change somewhere, rebuilt and rebooted,
> was not surprised to crash once swapping and compaction came in,
> but was surprised to find the crash in isolate_movable_page(),
> called by compaction's isolate_migratepages_block().
>
> page->mapping was ffffffff811303aa, which qualifies as __PageMovable(),
> which expects struct movable_operations at page->mapping minus low bits.
> But ffffffff811303aa was the address of SLUB's rcu_free_slab(): I have
> CONFIG_CC_OPTIMIZE_FOR_SIZE=y, so function addresses may have low bits set.
>
> Over to you! Thanks,
> Hugh
Wow, didn't expect this.
Thank you for report!
That should be due to commit 65505d1f2338e7
("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")
as now rcu_head can use some bits that shares with mapping.
Hmm IMO we have two choices...
1. simply drop the commit as it's only for debugging (RCU folks may not like [1])
2. make __PageMovable() to use true page flag, with approach [2])
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
Thanks!
--
Thanks,
Hyeonggon
On Wed, Sep 28, 2022 at 02:49:02PM +0900, Hyeonggon Yoo wrote:
> On Tue, Sep 27, 2022 at 10:16:35PM -0700, Hugh Dickins wrote:
> > It's a bug in linux-next, but taking me too long to identify which
> > commit is "to blame", so let me throw it over to you without more
> > delay: I think __PageMovable() now needs to check !PageSlab().
> >
> > I had made a small experimental change somewhere, rebuilt and rebooted,
> > was not surprised to crash once swapping and compaction came in,
> > but was surprised to find the crash in isolate_movable_page(),
> > called by compaction's isolate_migratepages_block().
> >
> > page->mapping was ffffffff811303aa, which qualifies as __PageMovable(),
> > which expects struct movable_operations at page->mapping minus low bits.
> > But ffffffff811303aa was the address of SLUB's rcu_free_slab(): I have
> > CONFIG_CC_OPTIMIZE_FOR_SIZE=y, so function addresses may have low bits set.
> >
> > Over to you! Thanks,
> > Hugh
>
> Wow, didn't expect this.
> Thank you for report!
>
> That should be due to commit 65505d1f2338e7
> ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")
> as now rcu_head can use some bits that shares with mapping.
>
> Hmm IMO we have two choices...
>
> 1. simply drop the commit as it's only for debugging (RCU folks may not like [1])
Yeah definitely don't like this option as patches are out that depend on
this (not yet merged though). :-)
> 2. make __PageMovable() to use true page flag, with approach [2])
What are the drawbacks of making it a true flag?
thanks,
- Joel
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/linux-mm/[email protected]/
>
> Thanks!
>
> --
> Thanks,
> Hyeonggon
On Wed, Sep 28, 2022 at 01:48:46PM +0000, Joel Fernandes wrote:
> On Wed, Sep 28, 2022 at 02:49:02PM +0900, Hyeonggon Yoo wrote:
> > On Tue, Sep 27, 2022 at 10:16:35PM -0700, Hugh Dickins wrote:
> > > It's a bug in linux-next, but taking me too long to identify which
> > > commit is "to blame", so let me throw it over to you without more
> > > delay: I think __PageMovable() now needs to check !PageSlab().
> > >
> > > I had made a small experimental change somewhere, rebuilt and rebooted,
> > > was not surprised to crash once swapping and compaction came in,
> > > but was surprised to find the crash in isolate_movable_page(),
> > > called by compaction's isolate_migratepages_block().
> > >
> > > page->mapping was ffffffff811303aa, which qualifies as __PageMovable(),
> > > which expects struct movable_operations at page->mapping minus low bits.
> > > But ffffffff811303aa was the address of SLUB's rcu_free_slab(): I have
> > > CONFIG_CC_OPTIMIZE_FOR_SIZE=y, so function addresses may have low bits set.
> > >
> > > Over to you! Thanks,
> > > Hugh
> >
> > Wow, didn't expect this.
> > Thank you for report!
> >
> > That should be due to commit 65505d1f2338e7
> > ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")
> > as now rcu_head can use some bits that shares with mapping.
> >
> > Hmm IMO we have two choices...
> >
> > 1. simply drop the commit as it's only for debugging (RCU folks may not like [1])
>
> Yeah definitely don't like this option as patches are out that depend on
> this (not yet merged though). :-)
Yeah, better to avoid that.
> > 2. make __PageMovable() to use true page flag, with approach [2])
>
> What are the drawbacks of making it a true flag?
We don't have free bit in page flags unless we free one ;)
> thanks,
>
> - Joel
> > [1] https://lore.kernel.org/all/[email protected]/
> > [2] https://lore.kernel.org/linux-mm/[email protected]/
--
Thanks,
Hyeonggon
On 9/28/22 15:48, Joel Fernandes wrote:
> On Wed, Sep 28, 2022 at 02:49:02PM +0900, Hyeonggon Yoo wrote:
>> On Tue, Sep 27, 2022 at 10:16:35PM -0700, Hugh Dickins wrote:
>>> It's a bug in linux-next, but taking me too long to identify which
>>> commit is "to blame", so let me throw it over to you without more
>>> delay: I think __PageMovable() now needs to check !PageSlab().
When I tried that, the result wasn't really nice:
https://lore.kernel.org/all/[email protected]/
And what if there's another conflicting page "type" later. Or the
debugging variant of rcu_head in struct page itself. The __PageMovable()
is just too fragile.
>>> I had made a small experimental change somewhere, rebuilt and rebooted,
>>> was not surprised to crash once swapping and compaction came in,
>>> but was surprised to find the crash in isolate_movable_page(),
>>> called by compaction's isolate_migratepages_block().
>>>
>>> page->mapping was ffffffff811303aa, which qualifies as __PageMovable(),
>>> which expects struct movable_operations at page->mapping minus low bits.
>>> But ffffffff811303aa was the address of SLUB's rcu_free_slab(): I have
>>> CONFIG_CC_OPTIMIZE_FOR_SIZE=y, so function addresses may have low bits set.
>>>
>>> Over to you! Thanks,
>>> Hugh
>>
>> Wow, didn't expect this.
>> Thank you for report!
>>
>> That should be due to commit 65505d1f2338e7
>> ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")
>> as now rcu_head can use some bits that shares with mapping.
>>
>> Hmm IMO we have two choices...
>>
>> 1. simply drop the commit as it's only for debugging (RCU folks may not like [1])
>
> Yeah definitely don't like this option as patches are out that depend on
> this (not yet merged though). :-)
But we'll have to do that for now and postpone to 6.2 I'm afraid as
merge window for 6.1 is too close to have confidence in any solution
that we came up this moment.
>> 2. make __PageMovable() to use true page flag, with approach [2])
>
> What are the drawbacks of making it a true flag?
Even if we free PageSlab, I'm sure there will be better uses of a free
page flag than __PageMovable.
3. With frozen page allocation
https://lore.kernel.org/all/[email protected]/
slab pages will have refcount 0 and compaction will skip them for that
reason. But it had other unresolved issues with page isolation code IIRC.
> thanks,
>
> - Joel
>
>
>
>
>> [1] https://lore.kernel.org/all/[email protected]/
>> [2] https://lore.kernel.org/linux-mm/[email protected]/
>>
>> Thanks!
>>
>> --
>> Thanks,
>> Hyeonggon
On Wed, 28 Sep 2022, Vlastimil Babka wrote:
> On 9/28/22 15:48, Joel Fernandes wrote:
> > On Wed, Sep 28, 2022 at 02:49:02PM +0900, Hyeonggon Yoo wrote:
> >> On Tue, Sep 27, 2022 at 10:16:35PM -0700, Hugh Dickins wrote:
> >>> It's a bug in linux-next, but taking me too long to identify which
> >>> commit is "to blame", so let me throw it over to you without more
> >>> delay: I think __PageMovable() now needs to check !PageSlab().
>
> When I tried that, the result wasn't really nice:
>
> https://lore.kernel.org/all/[email protected]/
>
> And what if there's another conflicting page "type" later. Or the debugging
> variant of rcu_head in struct page itself. The __PageMovable() is just too
> fragile.
I don't disagree (and don't really know all the things you're thinking
of in there). But if it's important to rescue this feature for 6.1, a
different approach may be the very simple patch below (I met a similar
issue with OPTIMIZE_FOR_SIZE in i915 a year ago, and just remembered).
But you be the judge of it: (a) I do not know whether rcu_free_slab
is the only risky address ever stuffed into that field; and (b) I'm
clueless when it comes to those architectures (powerpc etc) where the
the address of a function is something different from the address of
the function (have I conveyed my cluelessness adequately?).
Hugh
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1953,7 +1953,12 @@ static void __free_slab(struct kmem_cach
__free_pages(folio_page(folio, 0), order);
}
-static void rcu_free_slab(struct rcu_head *h)
+/*
+ * rcu_free_slab() must be __aligned(4) because its address is saved
+ * in the rcu_head field, which coincides with page->mapping, which
+ * causes trouble if compaction mistakes it for PAGE_MAPPING_MOVABLE.
+ */
+__aligned(4) static void rcu_free_slab(struct rcu_head *h)
{
struct slab *slab = container_of(h, struct slab, rcu_head);
On Wed, Sep 28, 2022 at 06:20:10PM +0200, Vlastimil Babka wrote:
> On 9/28/22 15:48, Joel Fernandes wrote:
> > On Wed, Sep 28, 2022 at 02:49:02PM +0900, Hyeonggon Yoo wrote:
> > > On Tue, Sep 27, 2022 at 10:16:35PM -0700, Hugh Dickins wrote:
> > > > It's a bug in linux-next, but taking me too long to identify which
> > > > commit is "to blame", so let me throw it over to you without more
> > > > delay: I think __PageMovable() now needs to check !PageSlab().
>
> When I tried that, the result wasn't really nice:
>
> https://lore.kernel.org/all/[email protected]/
>
> And what if there's another conflicting page "type" later. Or the debugging
> variant of rcu_head in struct page itself. The __PageMovable() is just too
> fragile.
>
> > > > I had made a small experimental change somewhere, rebuilt and rebooted,
> > > > was not surprised to crash once swapping and compaction came in,
> > > > but was surprised to find the crash in isolate_movable_page(),
> > > > called by compaction's isolate_migratepages_block().
> > > >
> > > > page->mapping was ffffffff811303aa, which qualifies as __PageMovable(),
> > > > which expects struct movable_operations at page->mapping minus low bits.
> > > > But ffffffff811303aa was the address of SLUB's rcu_free_slab(): I have
> > > > CONFIG_CC_OPTIMIZE_FOR_SIZE=y, so function addresses may have low bits set.
> > > >
> > > > Over to you! Thanks,
> > > > Hugh
> > >
> > > Wow, didn't expect this.
> > > Thank you for report!
> > >
> > > That should be due to commit 65505d1f2338e7
> > > ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")
> > > as now rcu_head can use some bits that shares with mapping.
> > >
> > > Hmm IMO we have two choices...
> > >
> > > 1. simply drop the commit as it's only for debugging (RCU folks may not like [1])
> >
> > Yeah definitely don't like this option as patches are out that depend on
> > this (not yet merged though). :-)
>
> But we'll have to do that for now and postpone to 6.2 I'm afraid as merge
> window for 6.1 is too close to have confidence in any solution that we came
> up this moment.
Agreed.
>
> > > 2. make __PageMovable() to use true page flag, with approach [2])
> >
> > What are the drawbacks of making it a true flag?
>
> Even if we free PageSlab, I'm sure there will be better uses of a free page
> flag than __PageMovable.
>
> 3. With frozen page allocation
> https://lore.kernel.org/all/[email protected]/
>
> slab pages will have refcount 0 and compaction will skip them for that
> reason. But it had other unresolved issues with page isolation code IIRC.
>
4. Always align function rcu_free_slab() with 'aligned' attribute
[1] https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-aligned-function-attribute
[2] __aligned in include/linux/compiler_attributes.h
> > thanks,
> >
> > - Joel
> >
> >
> >
> >
> > > [1] https://lore.kernel.org/all/[email protected]/
> > > [2] https://lore.kernel.org/linux-mm/[email protected]/
> > >
> > > Thanks!
> > >
> > > --
> > > Thanks,
> > > Hyeonggon
>
--
Thanks,
Hyeonggon
> On Sep 28, 2022, at 1:56 PM, Hyeonggon Yoo <[email protected]> wrote:
>
> On Wed, Sep 28, 2022 at 06:20:10PM +0200, Vlastimil Babka wrote:
[..]
>>>> Thank you for report!
>>>>
>>>> That should be due to commit 65505d1f2338e7
>>>> ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")
>>>> as now rcu_head can use some bits that shares with mapping.
>>>>
>>>> Hmm IMO we have two choices...
>>>>
>>>> 1. simply drop the commit as it's only for debugging (RCU folks may not like [1])
>>>
>>> Yeah definitely don't like this option as patches are out that depend on
>>> this (not yet merged though). :-)
>>
>> But we'll have to do that for now and postpone to 6.2 I'm afraid as merge
>> window for 6.1 is too close to have confidence in any solution that we came
>> up this moment.
I am ok with your postponing till then, gives me time to get other patches in, in the mean while :-)
Thanks again for your work on this,
- Joel
On 9/28/22 19:50, Hugh Dickins wrote:
> On Wed, 28 Sep 2022, Vlastimil Babka wrote:
>> On 9/28/22 15:48, Joel Fernandes wrote:
>> > On Wed, Sep 28, 2022 at 02:49:02PM +0900, Hyeonggon Yoo wrote:
>> >> On Tue, Sep 27, 2022 at 10:16:35PM -0700, Hugh Dickins wrote:
>> >>> It's a bug in linux-next, but taking me too long to identify which
>> >>> commit is "to blame", so let me throw it over to you without more
>> >>> delay: I think __PageMovable() now needs to check !PageSlab().
>>
>> When I tried that, the result wasn't really nice:
>>
>> https://lore.kernel.org/all/[email protected]/
>>
>> And what if there's another conflicting page "type" later. Or the debugging
>> variant of rcu_head in struct page itself. The __PageMovable() is just too
>> fragile.
>
> I don't disagree (and don't really know all the things you're thinking
> of in there). But if it's important to rescue this feature for 6.1, a
> different approach may be the very simple patch below (I met a similar
> issue with OPTIMIZE_FOR_SIZE in i915 a year ago, and just remembered).
>
> But you be the judge of it: (a) I do not know whether rcu_free_slab
> is the only risky address ever stuffed into that field; and (b) I'm
> clueless when it comes to those architectures (powerpc etc) where the
> the address of a function is something different from the address of
> the function (have I conveyed my cluelessness adequately?).
Thanks a lot Hugh! That's a sufficiently small fix (compared to the other
options) that I'm probably give it one last try.
> Hugh
>
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1953,7 +1953,12 @@ static void __free_slab(struct kmem_cach
> __free_pages(folio_page(folio, 0), order);
> }
>
> -static void rcu_free_slab(struct rcu_head *h)
> +/*
> + * rcu_free_slab() must be __aligned(4) because its address is saved
> + * in the rcu_head field, which coincides with page->mapping, which
> + * causes trouble if compaction mistakes it for PAGE_MAPPING_MOVABLE.
> + */
> +__aligned(4) static void rcu_free_slab(struct rcu_head *h)
> {
> struct slab *slab = container_of(h, struct slab, rcu_head);
>
> -static void rcu_free_slab(struct rcu_head *h)
> +/*
> + * rcu_free_slab() must be __aligned(4) because its address is saved
> + * in the rcu_head field, which coincides with page->mapping, which
> + * causes trouble if compaction mistakes it for PAGE_MAPPING_MOVABLE.
> + */
> +__aligned(4) static void rcu_free_slab(struct rcu_head *h)
> {
> struct slab *slab = container_of(h, struct slab, rcu_head);
>
Isn't that going to cause grief with options that align
functions on 16/32byte boundaries when adding space for
'other stuff'?
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On 9/29/22 13:53, David Laight wrote:
>> -static void rcu_free_slab(struct rcu_head *h)
>> +/*
>> + * rcu_free_slab() must be __aligned(4) because its address is saved
>> + * in the rcu_head field, which coincides with page->mapping, which
>> + * causes trouble if compaction mistakes it for PAGE_MAPPING_MOVABLE.
>> + */
>> +__aligned(4) static void rcu_free_slab(struct rcu_head *h)
>> {
>> struct slab *slab = container_of(h, struct slab, rcu_head);
>>
>
> Isn't that going to cause grief with options that align
> functions on 16/32byte boundaries when adding space for
> 'other stuff'?
How is that done exactly? Also having higher alignment (16/32) is not in
conflict with asking for 4?
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
From: Vlastimil Babka
> Sent: 29 September 2022 14:01
>
> On 9/29/22 13:53, David Laight wrote:
> >> -static void rcu_free_slab(struct rcu_head *h)
> >> +/*
> >> + * rcu_free_slab() must be __aligned(4) because its address is saved
> >> + * in the rcu_head field, which coincides with page->mapping, which
> >> + * causes trouble if compaction mistakes it for PAGE_MAPPING_MOVABLE.
> >> + */
> >> +__aligned(4) static void rcu_free_slab(struct rcu_head *h)
> >> {
> >> struct slab *slab = container_of(h, struct slab, rcu_head);
> >>
> >
> > Isn't that going to cause grief with options that align
> > functions on 16/32byte boundaries when adding space for
> > 'other stuff'?
>
> How is that done exactly? Also having higher alignment (16/32) is not in
> conflict with asking for 4?
It depends on which one is actually used.
The __aligned(4) might take precedence over the default alignment.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Thu, 29 Sep 2022, Vlastimil Babka wrote:
> On 9/28/22 19:50, Hugh Dickins wrote:
> > On Wed, 28 Sep 2022, Vlastimil Babka wrote:
> >> On 9/28/22 15:48, Joel Fernandes wrote:
> >> > On Wed, Sep 28, 2022 at 02:49:02PM +0900, Hyeonggon Yoo wrote:
> >> >> On Tue, Sep 27, 2022 at 10:16:35PM -0700, Hugh Dickins wrote:
> >> >>> It's a bug in linux-next, but taking me too long to identify which
> >> >>> commit is "to blame", so let me throw it over to you without more
> >> >>> delay: I think __PageMovable() now needs to check !PageSlab().
> >>
> >> When I tried that, the result wasn't really nice:
> >>
> >> https://lore.kernel.org/all/[email protected]/
> >>
> >> And what if there's another conflicting page "type" later. Or the debugging
> >> variant of rcu_head in struct page itself. The __PageMovable() is just too
> >> fragile.
> >
> > I don't disagree (and don't really know all the things you're thinking
> > of in there). But if it's important to rescue this feature for 6.1, a
> > different approach may be the very simple patch below (I met a similar
> > issue with OPTIMIZE_FOR_SIZE in i915 a year ago, and just remembered).
> >
> > But you be the judge of it: (a) I do not know whether rcu_free_slab
> > is the only risky address ever stuffed into that field; and (b) I'm
> > clueless when it comes to those architectures (powerpc etc) where the
> > the address of a function is something different from the address of
> > the function (have I conveyed my cluelessness adequately?).
>
> Thanks a lot Hugh! That's a sufficiently small fix (compared to the other
> options) that I'm probably give it one last try.
I suddenly worried that you might be waiting on me for a Signed-off-by,
which I couldn't give until I researched my reservations (a) and (b):
but I'm pleased to see from your kernel.org tree that you've gone ahead
and folded it in - thanks.
Regarding (a): great, you've found it too, mm/slab.c's kmem_rcu_free()
looks like it needs the same __aligned(4) as mm/slub.c's rcu_free_slabi().
Regarding (b): I booted the PowerMac G5 to take a look, and dredged up
the relevant phrase "function descriptor" from depths of my memory: I
was right to consider that case, but it's not a worry - the first field
of a function descriptor structure (on all the architectures I found it)
is the function address, so the function descriptor address would be
aligned 4 or 8 anyway.
Regarding "conflicting" alignment requests: yes, I agree with you,
it would have to be a toolchain bug if when asked to align 2 and to
align 4, it chose not to align 4.
So, no worries at my end now.
Hugh
On 9/29/22 23:54, Hugh Dickins wrote:
> On Thu, 29 Sep 2022, Vlastimil Babka wrote:
>> On 9/28/22 19:50, Hugh Dickins wrote:
>> > On Wed, 28 Sep 2022, Vlastimil Babka wrote:
>> >> On 9/28/22 15:48, Joel Fernandes wrote:
>> >> > On Wed, Sep 28, 2022 at 02:49:02PM +0900, Hyeonggon Yoo wrote:
>> >> >> On Tue, Sep 27, 2022 at 10:16:35PM -0700, Hugh Dickins wrote:
>> >> >>> It's a bug in linux-next, but taking me too long to identify which
>> >> >>> commit is "to blame", so let me throw it over to you without more
>> >> >>> delay: I think __PageMovable() now needs to check !PageSlab().
>> >>
>> >> When I tried that, the result wasn't really nice:
>> >>
>> >> https://lore.kernel.org/all/[email protected]/
>> >>
>> >> And what if there's another conflicting page "type" later. Or the debugging
>> >> variant of rcu_head in struct page itself. The __PageMovable() is just too
>> >> fragile.
>> >
>> > I don't disagree (and don't really know all the things you're thinking
>> > of in there). But if it's important to rescue this feature for 6.1, a
>> > different approach may be the very simple patch below (I met a similar
>> > issue with OPTIMIZE_FOR_SIZE in i915 a year ago, and just remembered).
>> >
>> > But you be the judge of it: (a) I do not know whether rcu_free_slab
>> > is the only risky address ever stuffed into that field; and (b) I'm
>> > clueless when it comes to those architectures (powerpc etc) where the
>> > the address of a function is something different from the address of
>> > the function (have I conveyed my cluelessness adequately?).
>>
>> Thanks a lot Hugh! That's a sufficiently small fix (compared to the other
>> options) that I'm probably give it one last try.
>
> I suddenly worried that you might be waiting on me for a Signed-off-by,
> which I couldn't give until I researched my reservations (a) and (b):
> but I'm pleased to see from your kernel.org tree that you've gone ahead
> and folded it in - thanks.
Yeah could have been more explicit about that, sorry. But made the whole
thing a very last merge so I can still drop it before the pull request.
> Regarding (a): great, you've found it too, mm/slab.c's kmem_rcu_free()
> looks like it needs the same __aligned(4) as mm/slub.c's rcu_free_slabi().
Right.
> Regarding (b): I booted the PowerMac G5 to take a look, and dredged up
> the relevant phrase "function descriptor" from depths of my memory: I
> was right to consider that case, but it's not a worry - the first field
> of a function descriptor structure (on all the architectures I found it)
> is the function address, so the function descriptor address would be
> aligned 4 or 8 anyway.
Thanks. I admit I wasn't that thorough, just consulted somebody internally :)
> Regarding "conflicting" alignment requests: yes, I agree with you,
> it would have to be a toolchain bug if when asked to align 2 and to
> align 4, it chose not to align 4.
Yeah. But I still would be less worried if another __aligned(X) function
existed in the tree already. Found only data. I assume the i915 thing wasn't
fixed like this in the tree? So if there are buggy toolchains or anything,
it will be us to discover them.
So I think we still should defuse the __PageMovable() mine somehow.
> So, no worries at my end now.
> Hugh
> > > Regarding "conflicting" alignment requests: yes, I agree with you,
> > > it would have to be a toolchain bug if when asked to align 2 and to
> > > align 4, it chose not to align 4.
See https://godbolt.org/z/3nGsTaf5e
the align() directive takes precedence.
Here you only want to ensure the alignment is at least 4.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Fri, 30 Sep 2022, Vlastimil Babka wrote:
> On 9/29/22 23:54, Hugh Dickins wrote:
> > On Thu, 29 Sep 2022, Vlastimil Babka wrote:
> >>
> >> Thanks a lot Hugh! That's a sufficiently small fix (compared to the other
> >> options) that I'm probably give it one last try.
> >
> > I suddenly worried that you might be waiting on me for a Signed-off-by,
> > which I couldn't give until I researched my reservations (a) and (b):
> > but I'm pleased to see from your kernel.org tree that you've gone ahead
> > and folded it in - thanks.
>
> Yeah could have been more explicit about that, sorry. But made the whole
> thing a very last merge so I can still drop it before the pull request.
No probs, you did the right thing.
>
> > Regarding (a): great, you've found it too, mm/slab.c's kmem_rcu_free()
> > looks like it needs the same __aligned(4) as mm/slub.c's rcu_free_slabi().
>
> Right.
>
> > Regarding (b): I booted the PowerMac G5 to take a look, and dredged up
> > the relevant phrase "function descriptor" from depths of my memory: I
> > was right to consider that case, but it's not a worry - the first field
> > of a function descriptor structure (on all the architectures I found it)
> > is the function address, so the function descriptor address would be
> > aligned 4 or 8 anyway.
>
> Thanks. I admit I wasn't that thorough, just consulted somebody internally :)
Exactly what I had hoped you would do.
>
> > Regarding "conflicting" alignment requests: yes, I agree with you,
> > it would have to be a toolchain bug if when asked to align 2 and to
> > align 4, it chose not to align 4.
>
> Yeah. But I still would be less worried if another __aligned(X) function
> existed in the tree already. Found only data. I assume the i915 thing wasn't
> fixed like this in the tree? So if there are buggy toolchains or anything,
> it will be us to discover them.
Linus put it in himself, after it had got lost over several -rcs:
5.15's cdc1e6e225e3 ("drm/i915: fix blank screen booting crashes").
Originally I'd written "__aligned(4)" explicitly, but later found i915
relied on it elsewhere since 4.9, and had an __i915_sw_fence_call for it.
But 5.17's 44505168d743 ("drm/i915: Drop stealing of bits from
i915_sw_fence function pointer") appears to have removed all that now.
I think that gives assurance that the x86 toolchains are okay;
but I imagine i915 is unlikely to be found on other architectures,
so not quite so much assurance there.
Hugh
On Fri, 30 Sep 2022, David Laight wrote:
> > > > Regarding "conflicting" alignment requests: yes, I agree with you,
> > > > it would have to be a toolchain bug if when asked to align 2 and to
> > > > align 4, it chose not to align 4.
>
> See https://godbolt.org/z/3nGsTaf5e
> the align() directive takes precedence.
>
> Here you only want to ensure the alignment is at least 4.
Sorry, I don't understand the point you are making,
nor how to deduce it from the link which you give:
I'll leave it to those who understand better.
Hugh
From: Hugh Dickins
> Sent: 30 September 2022 17:22
>
> On Fri, 30 Sep 2022, David Laight wrote:
> > > > > Regarding "conflicting" alignment requests: yes, I agree with you,
> > > > > it would have to be a toolchain bug if when asked to align 2 and to
> > > > > align 4, it chose not to align 4.
> >
> > See https://godbolt.org/z/3nGsTaf5e
> > the align() directive takes precedence.
> >
> > Here you only want to ensure the alignment is at least 4.
>
> Sorry, I don't understand the point you are making,
> nor how to deduce it from the link which you give:
> I'll leave it to those who understand better.
(I've copied Kees Cook - I think it is his patches that are
increasing the function alignment.
He'll need to find the thread history...)
IIRC -malign_functions is being used to put functions onto
16 byte boundaries to get aligned pad space before the function.
Adding __align(4) overrides this and (I think) will remove
the pad space.
I could only see the alignment directives in gcc ouput.
clang didn't seem to generate them.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Thu, Sep 29, 2022 at 02:54:45PM -0700, Hugh Dickins wrote:
> On Thu, 29 Sep 2022, Vlastimil Babka wrote:
> > On 9/28/22 19:50, Hugh Dickins wrote:
> > > On Wed, 28 Sep 2022, Vlastimil Babka wrote:
> > >> On 9/28/22 15:48, Joel Fernandes wrote:
> > >> > On Wed, Sep 28, 2022 at 02:49:02PM +0900, Hyeonggon Yoo wrote:
> > >> >> On Tue, Sep 27, 2022 at 10:16:35PM -0700, Hugh Dickins wrote:
> > >> >>> It's a bug in linux-next, but taking me too long to identify which
> > >> >>> commit is "to blame", so let me throw it over to you without more
> > >> >>> delay: I think __PageMovable() now needs to check !PageSlab().
> > >>
> > >> When I tried that, the result wasn't really nice:
> > >>
> > >> https://lore.kernel.org/all/[email protected]/
> > >>
> > >> And what if there's another conflicting page "type" later. Or the debugging
> > >> variant of rcu_head in struct page itself. The __PageMovable() is just too
> > >> fragile.
> > >
> > > I don't disagree (and don't really know all the things you're thinking
> > > of in there). But if it's important to rescue this feature for 6.1, a
> > > different approach may be the very simple patch below (I met a similar
> > > issue with OPTIMIZE_FOR_SIZE in i915 a year ago, and just remembered).
> > >
> > > But you be the judge of it: (a) I do not know whether rcu_free_slab
> > > is the only risky address ever stuffed into that field; and (b) I'm
> > > clueless when it comes to those architectures (powerpc etc) where the
> > > the address of a function is something different from the address of
> > > the function (have I conveyed my cluelessness adequately?).
> >
> > Thanks a lot Hugh! That's a sufficiently small fix (compared to the other
> > options) that I'm probably give it one last try.
>
> I suddenly worried that you might be waiting on me for a Signed-off-by,
> which I couldn't give until I researched my reservations (a) and (b):
> but I'm pleased to see from your kernel.org tree that you've gone ahead
> and folded it in - thanks.
>
> Regarding (a): great, you've found it too, mm/slab.c's kmem_rcu_free()
> looks like it needs the same __aligned(4) as mm/slub.c's rcu_free_slabi().
Just one more thing, rcu_leak_callback too. RCU seem to use it
internally to catch double call_rcu().
And some suggestions:
- what about adding runtime WARN() on slab init code to catch
unexpected arch/toolchain issues?
- instead of 4, we may use macro definition? like (PAGE_MAPPING_FLAGS + 1)?
--
Thanks,
Hyeonggon
On Sun, Oct 02, 2022 at 02:48:02PM +0900, Hyeonggon Yoo wrote:
> Just one more thing, rcu_leak_callback too. RCU seem to use it
> internally to catch double call_rcu().
>
> And some suggestions:
> - what about adding runtime WARN() on slab init code to catch
> unexpected arch/toolchain issues?
> - instead of 4, we may use macro definition? like (PAGE_MAPPING_FLAGS + 1)?
I think the real problem here is that isolate_movable_page() is
insufficiently paranoid. Looking at the gyrations that GUP and the
page cache do to convince themselves that the page they got really is
the page they wanted, there are a few missing pieces (eg checking that
you actually got a refcount on _this_ page and not some random other
page you were temporarily part of a compound page with).
This patch does three things:
- Turns one of the comments into English. There are some others
which I'm still scratching my head over.
- Uses a folio to help distinguish which operations are being done
to the head vs the specific page (this is somewhat an abuse of the
folio concept, but it's acceptable)
- Add the aforementioned check that we're actually operating on the
page that we think we want to be.
- Add a check that the folio isn't secretly a slab.
We could put the slab check in PageMapping and call it after taking
the folio lock, but that seems pointless. It's the acquisition of
the refcount which stabilises the slab flag, not holding the lock.
diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..a65598308c83 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -59,6 +59,7 @@
int isolate_movable_page(struct page *page, isolate_mode_t mode)
{
+ struct folio *folio = page_folio(page);
const struct movable_operations *mops;
/*
@@ -70,16 +71,23 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
* the put_page() at the end of this block will take care of
* release this page, thus avoiding a nasty leakage.
*/
- if (unlikely(!get_page_unless_zero(page)))
+ if (unlikely(!folio_try_get(folio)))
goto out;
+ /* Recheck the page is still part of the folio we just got */
+ if (unlikely(page_folio(page) != folio))
+ goto out_put;
+
/*
- * Check PageMovable before holding a PG_lock because page's owner
- * assumes anybody doesn't touch PG_lock of newly allocated page
- * so unconditionally grabbing the lock ruins page's owner side.
+ * Check movable flag before taking the folio lock because
+ * we use non-atomic bitops on newly allocated page flags so
+ * unconditionally grabbing the lock ruins page's owner side.
*/
- if (unlikely(!__PageMovable(page)))
- goto out_putpage;
+ if (unlikely(!__folio_test_movable(folio)))
+ goto out_put;
+ if (unlikely(folio_test_slab(folio)))
+ goto out_put;
+
/*
* As movable pages are not isolated from LRU lists, concurrent
* compaction threads can race against page migration functions
@@ -91,8 +99,8 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
* lets be sure we have the page lock
* before proceeding with the movable page isolation steps.
*/
- if (unlikely(!trylock_page(page)))
- goto out_putpage;
+ if (unlikely(!folio_trylock(folio)))
+ goto out_put;
if (!PageMovable(page) || PageIsolated(page))
goto out_no_isolated;
@@ -106,14 +114,14 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
/* Driver shouldn't use PG_isolated bit of page->flags */
WARN_ON_ONCE(PageIsolated(page));
SetPageIsolated(page);
- unlock_page(page);
+ folio_unlock(folio);
return 0;
out_no_isolated:
- unlock_page(page);
-out_putpage:
- put_page(page);
+ folio_unlock(folio);
+out_put:
+ folio_put(folio);
out:
return -EBUSY;
}
On Tue, Oct 04, 2022 at 11:26:33PM +0900, Hyeonggon Yoo wrote:
> > It's the acquisition of
> > the refcount which stabilises the slab flag, not holding the lock.
>
> But can you please elaborate how this prevents race between
> allocation & initialization of a slab and isolate_movable_page()?
>
> Or maybe we can handle it with frozen folio as Vlastimil suggested? ;-)
Yes, we discussed that a little yesterday. I'm hoping to have a
refreshed patchset for frozen folios out today. Some of this patch
is still needed, even if we go that route.
> > @@ -91,8 +99,8 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
> > * lets be sure we have the page lock
> > * before proceeding with the movable page isolation steps.
> > */
> > - if (unlikely(!trylock_page(page)))
> > - goto out_putpage;
> > + if (unlikely(!folio_trylock(folio)))
> > + goto out_put;
>
> I don't know much about callers that this is trying to avoid race aginst...
>
> But for this to make sense, I think *every users* that doing their stuff with
> sub-page of a compound page should acquire folio lock and not page lock
> of sub-page, right?
There is no page lock per se. If you try to acquire the lock on a tail
page, it acquires the lock on its head page. It's been that way for a
very long time. A lot of people are confused by this, which was part of
the motivation for making it explicit with folios.
On Mon, Oct 03, 2022 at 06:00:35PM +0100, Matthew Wilcox wrote:
> On Sun, Oct 02, 2022 at 02:48:02PM +0900, Hyeonggon Yoo wrote:
> > Just one more thing, rcu_leak_callback too. RCU seem to use it
> > internally to catch double call_rcu().
> >
> > And some suggestions:
> > - what about adding runtime WARN() on slab init code to catch
> > unexpected arch/toolchain issues?
> > - instead of 4, we may use macro definition? like (PAGE_MAPPING_FLAGS + 1)?
>
> I think the real problem here is that isolate_movable_page() is
> insufficiently paranoid. Looking at the gyrations that GUP and the
> page cache do to convince themselves that the page they got really is
> the page they wanted, there are a few missing pieces (eg checking that
> you actually got a refcount on _this_ page and not some random other
> page you were temporarily part of a compound page with).
>
> This patch does three things:
>
> - Turns one of the comments into English. There are some others
> which I'm still scratching my head over.
> - Uses a folio to help distinguish which operations are being done
> to the head vs the specific page (this is somewhat an abuse of the
> folio concept, but it's acceptable)
> - Add the aforementioned check that we're actually operating on the
> page that we think we want to be.
> - Add a check that the folio isn't secretly a slab.
>
> We could put the slab check in PageMapping and call it after taking
> the folio lock, but that seems pointless
I partially agree with this patch. I actually like it.
> It's the acquisition of
> the refcount which stabilises the slab flag, not holding the lock.
But can you please elaborate how this prevents race between
allocation & initialization of a slab and isolate_movable_page()?
Or maybe we can handle it with frozen folio as Vlastimil suggested? ;-)
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6a1597c92261..a65598308c83 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -59,6 +59,7 @@
>
> int isolate_movable_page(struct page *page, isolate_mode_t mode)
> {
> + struct folio *folio = page_folio(page);
> const struct movable_operations *mops;
>
> /*
> @@ -70,16 +71,23 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
> * the put_page() at the end of this block will take care of
> * release this page, thus avoiding a nasty leakage.
> */
> - if (unlikely(!get_page_unless_zero(page)))
> + if (unlikely(!folio_try_get(folio)))
> goto out;
>
> + /* Recheck the page is still part of the folio we just got */
> + if (unlikely(page_folio(page) != folio))
> + goto out_put;
> +
> /*
> - * Check PageMovable before holding a PG_lock because page's owner
> - * assumes anybody doesn't touch PG_lock of newly allocated page
> - * so unconditionally grabbing the lock ruins page's owner side.
> + * Check movable flag before taking the folio lock because
> + * we use non-atomic bitops on newly allocated page flags so
> + * unconditionally grabbing the lock ruins page's owner side.
> */
> - if (unlikely(!__PageMovable(page)))
> - goto out_putpage;
> + if (unlikely(!__folio_test_movable(folio)))
> + goto out_put;
> + if (unlikely(folio_test_slab(folio)))
> + goto out_put;
> +
> /*
> * As movable pages are not isolated from LRU lists, concurrent
> * compaction threads can race against page migration functions
> @@ -91,8 +99,8 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
> * lets be sure we have the page lock
> * before proceeding with the movable page isolation steps.
> */
> - if (unlikely(!trylock_page(page)))
> - goto out_putpage;
> + if (unlikely(!folio_trylock(folio)))
> + goto out_put;
I don't know much about callers that this is trying to avoid race aginst...
But for this to make sense, I think *every users* that doing their stuff with
sub-page of a compound page should acquire folio lock and not page lock
of sub-page, right?
> if (!PageMovable(page) || PageIsolated(page))
> goto out_no_isolated;
> @@ -106,14 +114,14 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
> /* Driver shouldn't use PG_isolated bit of page->flags */
> WARN_ON_ONCE(PageIsolated(page));
> SetPageIsolated(page);
> - unlock_page(page);
> + folio_unlock(folio);
>
> return 0;
>
> out_no_isolated:
> - unlock_page(page);
> -out_putpage:
> - put_page(page);
> + folio_unlock(folio);
> +out_put:
> + folio_put(folio);
> out:
> return -EBUSY;
> }
--
Thanks,
Hyeonggon
On Tue, Oct 04, 2022 at 03:40:36PM +0100, Matthew Wilcox wrote:
> On Tue, Oct 04, 2022 at 11:26:33PM +0900, Hyeonggon Yoo wrote:
> > > It's the acquisition of
> > > the refcount which stabilises the slab flag, not holding the lock.
> >
> > But can you please elaborate how this prevents race between
> > allocation & initialization of a slab and isolate_movable_page()?
> >
> > Or maybe we can handle it with frozen folio as Vlastimil suggested? ;-)
>
> Yes, we discussed that a little yesterday. I'm hoping to have a
> refreshed patchset for frozen folios out today. Some of this patch
> is still needed, even if we go that route.
Good to hear that.
With that, everyting looks sane to me.
> > > @@ -91,8 +99,8 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
> > > * lets be sure we have the page lock
> > > * before proceeding with the movable page isolation steps.
> > > */
> > > - if (unlikely(!trylock_page(page)))
> > > - goto out_putpage;
> > > + if (unlikely(!folio_trylock(folio)))
> > > + goto out_put;
> >
> > I don't know much about callers that this is trying to avoid race aginst...
> >
> > But for this to make sense, I think *every users* that doing their stuff with
> > sub-page of a compound page should acquire folio lock and not page lock
> > of sub-page, right?
>
> There is no page lock per se. If you try to acquire the lock on a tail
> page, it acquires the lock on its head page. It's been that way for a
> very long time. A lot of people are confused by this, which was part of
> the motivation for making it explicit with folios.
You are right! Reading the code, too bad
I even assumed that there was sub-page lock.
--
Thanks,
Hyeonggon
On 10/3/22 19:00, Matthew Wilcox wrote:
> On Sun, Oct 02, 2022 at 02:48:02PM +0900, Hyeonggon Yoo wrote:
>> Just one more thing, rcu_leak_callback too. RCU seem to use it
>> internally to catch double call_rcu().
>>
>> And some suggestions:
>> - what about adding runtime WARN() on slab init code to catch
>> unexpected arch/toolchain issues?
>> - instead of 4, we may use macro definition? like (PAGE_MAPPING_FLAGS + 1)?
>
> I think the real problem here is that isolate_movable_page() is
> insufficiently paranoid. Looking at the gyrations that GUP and the
> page cache do to convince themselves that the page they got really is
> the page they wanted, there are a few missing pieces (eg checking that
> you actually got a refcount on _this_ page and not some random other
> page you were temporarily part of a compound page with).
>
> This patch does three things:
>
> - Turns one of the comments into English. There are some others
> which I'm still scratching my head over.
> - Uses a folio to help distinguish which operations are being done
> to the head vs the specific page (this is somewhat an abuse of the
> folio concept, but it's acceptable)
> - Add the aforementioned check that we're actually operating on the
> page that we think we want to be.
> - Add a check that the folio isn't secretly a slab.
>
> We could put the slab check in PageMapping and call it after taking
> the folio lock, but that seems pointless. It's the acquisition of
> the refcount which stabilises the slab flag, not holding the lock.
>
I would like to have a working safe version in -next, even if we are able
simplify it later thanks to frozen refcounts. I've made a formal patch of
yours, but I'm still convinced the slab check needs to be more paranoid so
it can't observe a false positive __folio_test_movable() while missing the
folio_test_slab(), hence I added the barriers as in my previous attempt [1].
Does that work for you and can I add your S-o-b?
[1] https://lore.kernel.org/all/[email protected]/
----8<----
From 1d481f279f07d332ea381dfd6247a292ad403ed6 Mon Sep 17 00:00:00 2001
From: Matthew Wilcox <[email protected]>
Date: Mon, 24 Oct 2022 16:11:27 +0200
Subject: [PATCH] mm/migrate: make isolate_movable_page() skip slab pages
In the next commit we want to rearrange struct slab fields to allow a
larger rcu_head. Afterwards, the page->mapping field will overlap
with SLUB's "struct list_head slab_list", where the value of prev
pointer can become LIST_POISON2, which is 0x122 + POISON_POINTER_DELTA.
Unfortunately the bit 1 being set can confuse PageMovable() to be a
false positive and cause a GPF as reported by lkp [1].
I think the real problem here is that isolate_movable_page() is
insufficiently paranoid. Looking at the gyrations that GUP and the
page cache do to convince themselves that the page they got really is
the page they wanted, there are a few missing pieces (eg checking that
you actually got a refcount on _this_ page and not some random other
page you were temporarily part of a compound page with).
This patch does three things:
- Turns one of the comments into English. There are some others
which I'm still scratching my head over.
- Uses a folio to help distinguish which operations are being done
to the head vs the specific page (this is somewhat an abuse of the
folio concept, but it's acceptable)
- Add the aforementioned check that we're actually operating on the
page that we think we want to be.
- Add a check that the folio isn't secretly a slab.
We could put the slab check in PageMapping and call it after taking
the folio lock, but that seems pointless. It's the acquisition of
the refcount which stabilises the slab flag, not holding the lock.
[ [email protected]: add memory barriers to SLAB and SLUB's page allocation
and freeing, and their counterparts to isolate_movable_page(), to make
the checks for folio_test_slab() and __folio_test_movable() SMP safe ]
[1] https://lore.kernel.org/all/[email protected]/
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/migrate.c | 38 ++++++++++++++++++++++++++------------
mm/slab.c | 6 +++++-
mm/slub.c | 6 +++++-
3 files changed, 36 insertions(+), 14 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 1379e1912772..ad79e7c23db5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -60,6 +60,7 @@
int isolate_movable_page(struct page *page, isolate_mode_t mode)
{
+ struct folio *folio = page_folio(page);
const struct movable_operations *mops;
/*
@@ -71,16 +72,29 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
* the put_page() at the end of this block will take care of
* release this page, thus avoiding a nasty leakage.
*/
- if (unlikely(!get_page_unless_zero(page)))
+ if (unlikely(!folio_try_get(folio)))
goto out;
+ /* Recheck the page is still part of the folio we just got */
+ if (unlikely(page_folio(page) != folio))
+ goto out_put;
+
+ if (unlikely(folio_test_slab(folio)))
+ goto out_put;
+ /* Pairs with smp_wmb() in slab freeing, e.g. SLUB's __free_slab() */
+ smp_rmb();
/*
- * Check PageMovable before holding a PG_lock because page's owner
- * assumes anybody doesn't touch PG_lock of newly allocated page
- * so unconditionally grabbing the lock ruins page's owner side.
+ * Check movable flag before taking the folio lock because
+ * we use non-atomic bitops on newly allocated page flags so
+ * unconditionally grabbing the lock ruins page's owner side.
*/
- if (unlikely(!__PageMovable(page)))
- goto out_putpage;
+ if (unlikely(!__folio_test_movable(folio)))
+ goto out_put;
+ /* Pairs with smp_wmb() in slab allocation, e.g. SLUB's alloc_slab_page() */
+ smp_rmb();
+ if (unlikely(folio_test_slab(folio)))
+ goto out_put;
+
/*
* As movable pages are not isolated from LRU lists, concurrent
* compaction threads can race against page migration functions
@@ -92,8 +106,8 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
* lets be sure we have the page lock
* before proceeding with the movable page isolation steps.
*/
- if (unlikely(!trylock_page(page)))
- goto out_putpage;
+ if (unlikely(!folio_trylock(folio)))
+ goto out_put;
if (!PageMovable(page) || PageIsolated(page))
goto out_no_isolated;
@@ -107,14 +121,14 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
/* Driver shouldn't use PG_isolated bit of page->flags */
WARN_ON_ONCE(PageIsolated(page));
SetPageIsolated(page);
- unlock_page(page);
+ folio_unlock(folio);
return 0;
out_no_isolated:
- unlock_page(page);
-out_putpage:
- put_page(page);
+ folio_unlock(folio);
+out_put:
+ folio_put(folio);
out:
return -EBUSY;
}
diff --git a/mm/slab.c b/mm/slab.c
index 59c8e28f7b6a..219beb48588e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1370,6 +1370,8 @@ static struct slab *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
account_slab(slab, cachep->gfporder, cachep, flags);
__folio_set_slab(folio);
+ /* Make the flag visible before any changes to folio->mapping */
+ smp_wmb();
/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
if (sk_memalloc_socks() && page_is_pfmemalloc(folio_page(folio, 0)))
slab_set_pfmemalloc(slab);
@@ -1387,9 +1389,11 @@ static void kmem_freepages(struct kmem_cache *cachep, struct slab *slab)
BUG_ON(!folio_test_slab(folio));
__slab_clear_pfmemalloc(slab);
- __folio_clear_slab(folio);
page_mapcount_reset(folio_page(folio, 0));
folio->mapping = NULL;
+ /* Make the mapping reset visible before clearing the flag */
+ smp_wmb();
+ __folio_clear_slab(folio);
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += 1 << order;
diff --git a/mm/slub.c b/mm/slub.c
index 157527d7101b..6dc17cb915c5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1800,6 +1800,8 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
slab = folio_slab(folio);
__folio_set_slab(folio);
+ /* Make the flag visible before any changes to folio->mapping */
+ smp_wmb();
if (page_is_pfmemalloc(folio_page(folio, 0)))
slab_set_pfmemalloc(slab);
@@ -2008,8 +2010,10 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
}
__slab_clear_pfmemalloc(slab);
- __folio_clear_slab(folio);
folio->mapping = NULL;
+ /* Make the mapping reset visible before clearing the flag */
+ smp_wmb();
+ __folio_clear_slab(folio);
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += pages;
unaccount_slab(slab, order, s);
--
2.38.0
On 10/24/22 16:35, Vlastimil Babka wrote:
> On 10/3/22 19:00, Matthew Wilcox wrote:
>> On Sun, Oct 02, 2022 at 02:48:02PM +0900, Hyeonggon Yoo wrote:
>>> Just one more thing, rcu_leak_callback too. RCU seem to use it
>>> internally to catch double call_rcu().
>>>
>>> And some suggestions:
>>> - what about adding runtime WARN() on slab init code to catch
>>> unexpected arch/toolchain issues?
>>> - instead of 4, we may use macro definition? like (PAGE_MAPPING_FLAGS + 1)?
>>
>> I think the real problem here is that isolate_movable_page() is
>> insufficiently paranoid. Looking at the gyrations that GUP and the
>> page cache do to convince themselves that the page they got really is
>> the page they wanted, there are a few missing pieces (eg checking that
>> you actually got a refcount on _this_ page and not some random other
>> page you were temporarily part of a compound page with).
>>
>> This patch does three things:
>>
>> - Turns one of the comments into English. There are some others
>> which I'm still scratching my head over.
>> - Uses a folio to help distinguish which operations are being done
>> to the head vs the specific page (this is somewhat an abuse of the
>> folio concept, but it's acceptable)
>> - Add the aforementioned check that we're actually operating on the
>> page that we think we want to be.
>> - Add a check that the folio isn't secretly a slab.
>>
>> We could put the slab check in PageMapping and call it after taking
>> the folio lock, but that seems pointless. It's the acquisition of
>> the refcount which stabilises the slab flag, not holding the lock.
>>
>
> I would like to have a working safe version in -next, even if we are able
> simplify it later thanks to frozen refcounts. I've made a formal patch of
> yours, but I'm still convinced the slab check needs to be more paranoid so
> it can't observe a false positive __folio_test_movable() while missing the
> folio_test_slab(), hence I added the barriers as in my previous attempt [1].
> Does that work for you and can I add your S-o-b?
Tentatively the series is here for anyone interested, will send it for
proper review after the S-o-b is clarified.
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-6.2/fit_rcu_head
> [1] https://lore.kernel.org/all/[email protected]/
On Mon, Oct 24, 2022 at 04:35:04PM +0200, Vlastimil Babka wrote:
> I would like to have a working safe version in -next, even if we are able
> simplify it later thanks to frozen refcounts. I've made a formal patch of
> yours, but I'm still convinced the slab check needs to be more paranoid so
> it can't observe a false positive __folio_test_movable() while missing the
> folio_test_slab(), hence I added the barriers as in my previous attempt [1].
> Does that work for you and can I add your S-o-b?
Thanks for picking this back up.
> +++ b/mm/slab.c
> @@ -1370,6 +1370,8 @@ static struct slab *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
>
> account_slab(slab, cachep->gfporder, cachep, flags);
> __folio_set_slab(folio);
> + /* Make the flag visible before any changes to folio->mapping */
> + smp_wmb();
So what's the point of using __folio_set_slab() only to call smp_wmb()
afterwards? If we call folio_set_slab() instead, don't all the other
barriers go away? (This is a genuine question; I am bad at this kind
of reasoning). Obviously it would still need a comment.
On 10/24/22 17:06, Matthew Wilcox wrote:
> On Mon, Oct 24, 2022 at 04:35:04PM +0200, Vlastimil Babka wrote:
>> I would like to have a working safe version in -next, even if we are able
>> simplify it later thanks to frozen refcounts. I've made a formal patch of
>> yours, but I'm still convinced the slab check needs to be more paranoid so
>> it can't observe a false positive __folio_test_movable() while missing the
>> folio_test_slab(), hence I added the barriers as in my previous attempt [1].
>> Does that work for you and can I add your S-o-b?
>
> Thanks for picking this back up.
>
>> +++ b/mm/slab.c
>> @@ -1370,6 +1370,8 @@ static struct slab *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
>>
>> account_slab(slab, cachep->gfporder, cachep, flags);
>> __folio_set_slab(folio);
>> + /* Make the flag visible before any changes to folio->mapping */
>> + smp_wmb();
>
> So what's the point of using __folio_set_slab() only to call smp_wmb()
> afterwards? If we call folio_set_slab() instead, don't all the other
> barriers go away? (This is a genuine question; I am bad at this kind
> of reasoning). Obviously it would still need a comment.
AFAIU (which doesn't mean much, TBH :)) folio_set_slab() makes the setting
of the flag protected against other flags set operations so our setting is
not lost in a non-atomic RMW. But as we are the only one who can be setting
any page/folio flag here (isolate_movable_page() for sure doesn't), we don't
need it for that kind of atomicity for page/folio flags field.
And, simply changing it to folio_set_slab() would not add the sufficient
smp_wmb() semantics to order the flags write visibility against a later
write to the struct slab field that overlaps page->mapping. Only some atomic
operations have that implicit barrier, (per
Documentation/memory-barriers.txt and Documentation/atomic_bitops.txt) and
set_bit() is not one of those. So we'd still need a smp_mb__after_atomic()
AFAIU and at that point, doing the above seems less obscure to me.
(Of course if we had the reason to use folio_set_slab() for its own atomic
guarantee, then smp_mb__after_atomic() instead of smp_wmb() would be better
as on some architectures it would make the barrier no-op).
On Mon, 24 Oct 2022, Vlastimil Babka wrote:
> On 10/3/22 19:00, Matthew Wilcox wrote:
> > On Sun, Oct 02, 2022 at 02:48:02PM +0900, Hyeonggon Yoo wrote:
> >> Just one more thing, rcu_leak_callback too. RCU seem to use it
> >> internally to catch double call_rcu().
> >>
> >> And some suggestions:
> >> - what about adding runtime WARN() on slab init code to catch
> >> unexpected arch/toolchain issues?
> >> - instead of 4, we may use macro definition? like (PAGE_MAPPING_FLAGS + 1)?
> >
> > I think the real problem here is that isolate_movable_page() is
> > insufficiently paranoid. Looking at the gyrations that GUP and the
> > page cache do to convince themselves that the page they got really is
> > the page they wanted, there are a few missing pieces (eg checking that
> > you actually got a refcount on _this_ page and not some random other
> > page you were temporarily part of a compound page with).
> >
> > This patch does three things:
> >
> > - Turns one of the comments into English. There are some others
> > which I'm still scratching my head over.
> > - Uses a folio to help distinguish which operations are being done
> > to the head vs the specific page (this is somewhat an abuse of the
> > folio concept, but it's acceptable)
> > - Add the aforementioned check that we're actually operating on the
> > page that we think we want to be.
> > - Add a check that the folio isn't secretly a slab.
> >
> > We could put the slab check in PageMapping and call it after taking
> > the folio lock, but that seems pointless. It's the acquisition of
> > the refcount which stabilises the slab flag, not holding the lock.
> >
>
> I would like to have a working safe version in -next, even if we are able
> simplify it later thanks to frozen refcounts. I've made a formal patch of
> yours, but I'm still convinced the slab check needs to be more paranoid so
> it can't observe a false positive __folio_test_movable() while missing the
> folio_test_slab(), hence I added the barriers as in my previous attempt [1].
> Does that work for you and can I add your S-o-b?
>
> [1] https://lore.kernel.org/all/[email protected]/
Ignore me, don't let me distract if you're happy with Matthew's patch
(I know little of PageMovable, and I haven't tried to understand it);
but it did look to me more like 6.2 material, and I was surprised that
you dropped the simple align(4) approach for 6.1.
Because of Hyeonggon's rcu_leak_callback() observation? That was a
good catch, but turned out to be irrelevant, because it was only for
an RCU debugging option, which would never be set up on a struct page
(well, maybe it would in a dynamically-allocated-struct-page future).
Hugh
On 10/25/22 06:19, Hugh Dickins wrote:
> On Mon, 24 Oct 2022, Vlastimil Babka wrote:
>>
>> [1] https://lore.kernel.org/all/[email protected]/
>
> Ignore me, don't let me distract if you're happy with Matthew's patch
> (I know little of PageMovable, and I haven't tried to understand it);
> but it did look to me more like 6.2 material,
Yes, it is targetted towards 6.2 now.
> and I was surprised that
> you dropped the simple align(4) approach for 6.1.
>
> Because of Hyeonggon's rcu_leak_callback() observation? That was a
> good catch, but turned out to be irrelevant, because it was only for
> an RCU debugging option, which would never be set up on a struct page
> (well, maybe it would in a dynamically-allocated-struct-page future).
It was mainly due to David's observation:
https://lore.kernel.org/all/[email protected]/
I did also conclude that align(4) indeed overrides anything set via
CONFIG_FUNCTION_ALIGNMENT (and passed by -falign-functions) in a way that a
globally configured larger alignment can be made smaller by this macro, and
it was all too risky and last minute to me, while Joel's series wasn't
targetted to 6.1 anyway so there was no urgency.
And yeah it looks like a gcc bug to me.
> Hugh
On Mon, Oct 24, 2022 at 04:35:04PM +0200, Vlastimil Babka wrote:
[,,,]
> I would like to have a working safe version in -next, even if we are able
> simplify it later thanks to frozen refcounts. I've made a formal patch of
> yours, but I'm still convinced the slab check needs to be more paranoid so
> it can't observe a false positive __folio_test_movable() while missing the
> folio_test_slab(), hence I added the barriers as in my previous attempt [1].
> Does that work for you and can I add your S-o-b?
>
> [1] https://lore.kernel.org/all/[email protected]/
>
> ----8<----
> From 1d481f279f07d332ea381dfd6247a292ad403ed6 Mon Sep 17 00:00:00 2001
> From: Matthew Wilcox <[email protected]>
> Date: Mon, 24 Oct 2022 16:11:27 +0200
> Subject: [PATCH] mm/migrate: make isolate_movable_page() skip slab pages
>
> In the next commit we want to rearrange struct slab fields to allow a
> larger rcu_head. Afterwards, the page->mapping field will overlap
> with SLUB's "struct list_head slab_list", where the value of prev
> pointer can become LIST_POISON2, which is 0x122 + POISON_POINTER_DELTA.
> Unfortunately the bit 1 being set can confuse PageMovable() to be a
> false positive and cause a GPF as reported by lkp [1].
>
> I think the real problem here is that isolate_movable_page() is
> insufficiently paranoid. Looking at the gyrations that GUP and the
> page cache do to convince themselves that the page they got really is
> the page they wanted, there are a few missing pieces (eg checking that
> you actually got a refcount on _this_ page and not some random other
> page you were temporarily part of a compound page with).
>
> This patch does three things:
>
> - Turns one of the comments into English. There are some others
> which I'm still scratching my head over.
> - Uses a folio to help distinguish which operations are being done
> to the head vs the specific page (this is somewhat an abuse of the
> folio concept, but it's acceptable)
> - Add the aforementioned check that we're actually operating on the
> page that we think we want to be.
> - Add a check that the folio isn't secretly a slab.
>
> We could put the slab check in PageMapping and call it after taking
> the folio lock, but that seems pointless. It's the acquisition of
> the refcount which stabilises the slab flag, not holding the lock.
>
> [ [email protected]: add memory barriers to SLAB and SLUB's page allocation
> and freeing, and their counterparts to isolate_movable_page(), to make
> the checks for folio_test_slab() and __folio_test_movable() SMP safe ]
>
> [1] https://lore.kernel.org/all/[email protected]/
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---
> mm/migrate.c | 38 ++++++++++++++++++++++++++------------
> mm/slab.c | 6 +++++-
> mm/slub.c | 6 +++++-
> 3 files changed, 36 insertions(+), 14 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 1379e1912772..ad79e7c23db5 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -60,6 +60,7 @@
>
> int isolate_movable_page(struct page *page, isolate_mode_t mode)
> {
> + struct folio *folio = page_folio(page);
> const struct movable_operations *mops;
>
> /*
> @@ -71,16 +72,29 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
> * the put_page() at the end of this block will take care of
> * release this page, thus avoiding a nasty leakage.
> */
> - if (unlikely(!get_page_unless_zero(page)))
> + if (unlikely(!folio_try_get(folio)))
> goto out;
>
> + /* Recheck the page is still part of the folio we just got */
> + if (unlikely(page_folio(page) != folio))
> + goto out_put;
> +
> + if (unlikely(folio_test_slab(folio)))
> + goto out_put;
> + /* Pairs with smp_wmb() in slab freeing, e.g. SLUB's __free_slab() */
> + smp_rmb();
> /*
> - * Check PageMovable before holding a PG_lock because page's owner
> - * assumes anybody doesn't touch PG_lock of newly allocated page
> - * so unconditionally grabbing the lock ruins page's owner side.
> + * Check movable flag before taking the folio lock because
> + * we use non-atomic bitops on newly allocated page flags so
> + * unconditionally grabbing the lock ruins page's owner side.
> */
> - if (unlikely(!__PageMovable(page)))
> - goto out_putpage;
> + if (unlikely(!__folio_test_movable(folio)))
> + goto out_put;
> + /* Pairs with smp_wmb() in slab allocation, e.g. SLUB's alloc_slab_page() */
> + smp_rmb();
> + if (unlikely(folio_test_slab(folio)))
> + goto out_put;
> +
> /*
> * As movable pages are not isolated from LRU lists, concurrent
> * compaction threads can race against page migration functions
> @@ -92,8 +106,8 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
> * lets be sure we have the page lock
> * before proceeding with the movable page isolation steps.
> */
> - if (unlikely(!trylock_page(page)))
> - goto out_putpage;
> + if (unlikely(!folio_trylock(folio)))
> + goto out_put;
>
> if (!PageMovable(page) || PageIsolated(page))
> goto out_no_isolated;
> @@ -107,14 +121,14 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
> /* Driver shouldn't use PG_isolated bit of page->flags */
> WARN_ON_ONCE(PageIsolated(page));
> SetPageIsolated(page);
> - unlock_page(page);
> + folio_unlock(folio);
>
> return 0;
>
> out_no_isolated:
> - unlock_page(page);
> -out_putpage:
> - put_page(page);
> + folio_unlock(folio);
> +out_put:
> + folio_put(folio);
> out:
> return -EBUSY;
> }
> diff --git a/mm/slab.c b/mm/slab.c
> index 59c8e28f7b6a..219beb48588e 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1370,6 +1370,8 @@ static struct slab *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
>
> account_slab(slab, cachep->gfporder, cachep, flags);
> __folio_set_slab(folio);
> + /* Make the flag visible before any changes to folio->mapping */
> + smp_wmb();
> /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
> if (sk_memalloc_socks() && page_is_pfmemalloc(folio_page(folio, 0)))
> slab_set_pfmemalloc(slab);
> @@ -1387,9 +1389,11 @@ static void kmem_freepages(struct kmem_cache *cachep, struct slab *slab)
>
> BUG_ON(!folio_test_slab(folio));
> __slab_clear_pfmemalloc(slab);
> - __folio_clear_slab(folio);
> page_mapcount_reset(folio_page(folio, 0));
> folio->mapping = NULL;
> + /* Make the mapping reset visible before clearing the flag */
> + smp_wmb();
> + __folio_clear_slab(folio);
>
> if (current->reclaim_state)
> current->reclaim_state->reclaimed_slab += 1 << order;
> diff --git a/mm/slub.c b/mm/slub.c
> index 157527d7101b..6dc17cb915c5 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1800,6 +1800,8 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
>
> slab = folio_slab(folio);
> __folio_set_slab(folio);
> + /* Make the flag visible before any changes to folio->mapping */
> + smp_wmb();
> if (page_is_pfmemalloc(folio_page(folio, 0)))
> slab_set_pfmemalloc(slab);
>
> @@ -2008,8 +2010,10 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
> }
>
> __slab_clear_pfmemalloc(slab);
> - __folio_clear_slab(folio);
> folio->mapping = NULL;
> + /* Make the mapping reset visible before clearing the flag */
> + smp_wmb();
> + __folio_clear_slab(folio);
> if (current->reclaim_state)
> current->reclaim_state->reclaimed_slab += pages;
> unaccount_slab(slab, order, s);
> --
> 2.38.0
Do we need to try this with memory barriers before frozen refcount lands in?
It's quite complicated and IIUC there is a still theoretical race:
At isolate_movable_page: At slab alloc: At slab free:
folio = alloc_pages(flags, order)
folio_try_get()
folio_test_slab() == false
__folio_set_slab(folio)
smp_wmb()
call_rcu(&slab->rcu_head, rcu_free_slab);
smp_rmb()
__folio_test_movable() == true
folio->mapping = NULL;
smp_wmb()
__folio_clear_slab(folio);
smp_rmb()
folio_test_slab() == false
folio_trylock()
mops->isolate_page() (*crash*)
Please let me know if I'm missing something ;-)
Thanks!
--
Hyeonggon
On 10/25/22 15:47, Hyeonggon Yoo wrote:
> On Mon, Oct 24, 2022 at 04:35:04PM +0200, Vlastimil Babka wrote:
>
> [,,,]
>
>> diff --git a/mm/slab.c b/mm/slab.c
>> index 59c8e28f7b6a..219beb48588e 100644
>> --- a/mm/slab.c
>> +++ b/mm/slab.c
>> @@ -1370,6 +1370,8 @@ static struct slab *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
>>
>> account_slab(slab, cachep->gfporder, cachep, flags);
>> __folio_set_slab(folio);
>> + /* Make the flag visible before any changes to folio->mapping */
>> + smp_wmb();
>> /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
>> if (sk_memalloc_socks() && page_is_pfmemalloc(folio_page(folio, 0)))
>> slab_set_pfmemalloc(slab);
>> @@ -1387,9 +1389,11 @@ static void kmem_freepages(struct kmem_cache *cachep, struct slab *slab)
>>
>> BUG_ON(!folio_test_slab(folio));
>> __slab_clear_pfmemalloc(slab);
>> - __folio_clear_slab(folio);
>> page_mapcount_reset(folio_page(folio, 0));
>> folio->mapping = NULL;
>> + /* Make the mapping reset visible before clearing the flag */
>> + smp_wmb();
>> + __folio_clear_slab(folio);
>>
>> if (current->reclaim_state)
>> current->reclaim_state->reclaimed_slab += 1 << order;
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 157527d7101b..6dc17cb915c5 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -1800,6 +1800,8 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
>>
>> slab = folio_slab(folio);
>> __folio_set_slab(folio);
>> + /* Make the flag visible before any changes to folio->mapping */
>> + smp_wmb();
>> if (page_is_pfmemalloc(folio_page(folio, 0)))
>> slab_set_pfmemalloc(slab);
>>
>> @@ -2008,8 +2010,10 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
>> }
>>
>> __slab_clear_pfmemalloc(slab);
>> - __folio_clear_slab(folio);
>> folio->mapping = NULL;
>> + /* Make the mapping reset visible before clearing the flag */
>> + smp_wmb();
>> + __folio_clear_slab(folio);
>> if (current->reclaim_state)
>> current->reclaim_state->reclaimed_slab += pages;
>> unaccount_slab(slab, order, s);
>> --
>> 2.38.0
>
> Do we need to try this with memory barriers before frozen refcount lands in?
There was IIRC an unresolved issue with frozen refcount tripping the page
isolation code so I didn't want to be depending on that.
> It's quite complicated and IIUC there is a still theoretical race:
>
> At isolate_movable_page: At slab alloc: At slab free:
> folio = alloc_pages(flags, order)
>
> folio_try_get()
> folio_test_slab() == false
> __folio_set_slab(folio)
> smp_wmb()
>
> call_rcu(&slab->rcu_head, rcu_free_slab);
>
>
> smp_rmb()
> __folio_test_movable() == true
>
> folio->mapping = NULL;
> smp_wmb()
> __folio_clear_slab(folio);
> smp_rmb()
> folio_test_slab() == false
>
> folio_trylock()
There's also between above and below:
if (!PageMovable(page) || PageIsolated(page))
goto out_no_isolated;
mops = page_movable_ops(page);
If we put another smp_rmb() before the PageMovable test, could that have
helped? It would assure we observe the folio->mapping = NULL; from the "slab
free" side?
But yeah, it's getting ridiculous. Maybe there's a simpler way to check two
bits in two different bytes atomically. Or maybe it's just an impossible
task, I feel I just dunno computers at this point.
> mops->isolate_page() (*crash*)
>
>
> Please let me know if I'm missing something ;-)
> Thanks!
>
On Tue, 25 Oct 2022, Vlastimil Babka wrote:
> On 10/25/22 06:19, Hugh Dickins wrote:
> > On Mon, 24 Oct 2022, Vlastimil Babka wrote:
>
> >>
> >> [1] https://lore.kernel.org/all/[email protected]/
> >
> > Ignore me, don't let me distract if you're happy with Matthew's patch
> > (I know little of PageMovable, and I haven't tried to understand it);
> > but it did look to me more like 6.2 material,
>
> Yes, it is targetted towards 6.2 now.
That's good.
>
> > and I was surprised that
> > you dropped the simple align(4) approach for 6.1.
> >
> > Because of Hyeonggon's rcu_leak_callback() observation? That was a
> > good catch, but turned out to be irrelevant, because it was only for
> > an RCU debugging option, which would never be set up on a struct page
> > (well, maybe it would in a dynamically-allocated-struct-page future).
>
> It was mainly due to David's observation:
>
> https://lore.kernel.org/all/[email protected]/
>
> I did also conclude that align(4) indeed overrides anything set via
> CONFIG_FUNCTION_ALIGNMENT (and passed by -falign-functions) in a way that a
> globally configured larger alignment can be made smaller by this macro, and
> it was all too risky and last minute to me, while Joel's series wasn't
> targetted to 6.1 anyway so there was no urgency.
Oh, I had thought it was heading for 6.1. Yes, you have good reason
to drop the alignment trick there - thanks for verifying David's point.
I hadn't seen followup, and assumed that the earlier i915 example gave
assurance that alignment was usable (but maybe that's part of why the
i915 folks stopped doing it that way).
Thanks,
Hugh
>
> And yeah it looks like a gcc bug to me.
>
> > Hugh
On 10/25/22 16:08, Vlastimil Babka wrote:
> On 10/25/22 15:47, Hyeonggon Yoo wrote:
>> On Mon, Oct 24, 2022 at 04:35:04PM +0200, Vlastimil Babka wrote:
>>
>> [,,,]
>>
>>> diff --git a/mm/slab.c b/mm/slab.c
>>> index 59c8e28f7b6a..219beb48588e 100644
>>> --- a/mm/slab.c
>>> +++ b/mm/slab.c
>>> @@ -1370,6 +1370,8 @@ static struct slab *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
>>>
>>> account_slab(slab, cachep->gfporder, cachep, flags);
>>> __folio_set_slab(folio);
>>> + /* Make the flag visible before any changes to folio->mapping */
>>> + smp_wmb();
>>> /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
>>> if (sk_memalloc_socks() && page_is_pfmemalloc(folio_page(folio, 0)))
>>> slab_set_pfmemalloc(slab);
>>> @@ -1387,9 +1389,11 @@ static void kmem_freepages(struct kmem_cache *cachep, struct slab *slab)
>>>
>>> BUG_ON(!folio_test_slab(folio));
>>> __slab_clear_pfmemalloc(slab);
>>> - __folio_clear_slab(folio);
>>> page_mapcount_reset(folio_page(folio, 0));
>>> folio->mapping = NULL;
>>> + /* Make the mapping reset visible before clearing the flag */
>>> + smp_wmb();
>>> + __folio_clear_slab(folio);
>>>
>>> if (current->reclaim_state)
>>> current->reclaim_state->reclaimed_slab += 1 << order;
>>> diff --git a/mm/slub.c b/mm/slub.c
>>> index 157527d7101b..6dc17cb915c5 100644
>>> --- a/mm/slub.c
>>> +++ b/mm/slub.c
>>> @@ -1800,6 +1800,8 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
>>>
>>> slab = folio_slab(folio);
>>> __folio_set_slab(folio);
>>> + /* Make the flag visible before any changes to folio->mapping */
>>> + smp_wmb();
>>> if (page_is_pfmemalloc(folio_page(folio, 0)))
>>> slab_set_pfmemalloc(slab);
>>>
>>> @@ -2008,8 +2010,10 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
>>> }
>>>
>>> __slab_clear_pfmemalloc(slab);
>>> - __folio_clear_slab(folio);
>>> folio->mapping = NULL;
>>> + /* Make the mapping reset visible before clearing the flag */
>>> + smp_wmb();
>>> + __folio_clear_slab(folio);
>>> if (current->reclaim_state)
>>> current->reclaim_state->reclaimed_slab += pages;
>>> unaccount_slab(slab, order, s);
>>> --
>>> 2.38.0
>>
>> Do we need to try this with memory barriers before frozen refcount lands in?
>
> There was IIRC an unresolved issue with frozen refcount tripping the page
> isolation code so I didn't want to be depending on that.
>
>> It's quite complicated and IIUC there is a still theoretical race:
>>
>> At isolate_movable_page: At slab alloc: At slab free:
>> folio = alloc_pages(flags, order)
>>
>> folio_try_get()
>> folio_test_slab() == false
>> __folio_set_slab(folio)
>> smp_wmb()
>>
>> call_rcu(&slab->rcu_head, rcu_free_slab);
>>
>>
>> smp_rmb()
>> __folio_test_movable() == true
>>
>> folio->mapping = NULL;
>> smp_wmb()
>> __folio_clear_slab(folio);
>> smp_rmb()
>> folio_test_slab() == false
>>
>> folio_trylock()
>
> There's also between above and below:
>
> if (!PageMovable(page) || PageIsolated(page))
> goto out_no_isolated;
>
> mops = page_movable_ops(page);
>
> If we put another smp_rmb() before the PageMovable test, could that have
> helped? It would assure we observe the folio->mapping = NULL; from the "slab
> free" side?
>
> But yeah, it's getting ridiculous. Maybe there's a simpler way to check two
> bits in two different bytes atomically. Or maybe it's just an impossible
> task, I feel I just dunno computers at this point.
After more thought, I think I just made a mistake by doing two
folio_test_slab() tests around a single __folio_test_movable(). What I was
supposed to do was two __folio_test_movable() tests around a single
folio_test_slab()... I hope. That should take care of your scenario, or do
you see another one? Thanks.
----8----
From 5ca1c10f6411d73ad579b58d4fa10326bf77cf0a Mon Sep 17 00:00:00 2001
From: Matthew Wilcox <[email protected]>
Date: Mon, 24 Oct 2022 16:11:27 +0200
Subject: [PATCH] mm/migrate: make isolate_movable_page() skip slab pages
In the next commit we want to rearrange struct slab fields to allow a
larger rcu_head. Afterwards, the page->mapping field will overlap
with SLUB's "struct list_head slab_list", where the value of prev
pointer can become LIST_POISON2, which is 0x122 + POISON_POINTER_DELTA.
Unfortunately the bit 1 being set can confuse PageMovable() to be a
false positive and cause a GPF as reported by lkp [1].
I think the real problem here is that isolate_movable_page() is
insufficiently paranoid. Looking at the gyrations that GUP and the
page cache do to convince themselves that the page they got really is
the page they wanted, there are a few missing pieces (eg checking that
you actually got a refcount on _this_ page and not some random other
page you were temporarily part of a compound page with).
This patch does three things:
- Turns one of the comments into English. There are some others
which I'm still scratching my head over.
- Uses a folio to help distinguish which operations are being done
to the head vs the specific page (this is somewhat an abuse of the
folio concept, but it's acceptable)
- Add the aforementioned check that we're actually operating on the
page that we think we want to be.
- Add a check that the folio isn't secretly a slab.
We could put the slab check in PageMapping and call it after taking
the folio lock, but that seems pointless. It's the acquisition of
the refcount which stabilises the slab flag, not holding the lock.
[ [email protected]: add memory barriers to SLAB and SLUB's page allocation
and freeing, and their counterparts to isolate_movable_page(), to make
the checks for folio_test_slab() and __folio_test_movable() SMP safe ]
[1] https://lore.kernel.org/all/[email protected]/
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/migrate.c | 40 ++++++++++++++++++++++++++++------------
mm/slab.c | 6 +++++-
mm/slub.c | 6 +++++-
3 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 1379e1912772..f0f58e42c1d4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -60,6 +60,7 @@
int isolate_movable_page(struct page *page, isolate_mode_t mode)
{
+ struct folio *folio = page_folio(page);
const struct movable_operations *mops;
/*
@@ -71,16 +72,31 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
* the put_page() at the end of this block will take care of
* release this page, thus avoiding a nasty leakage.
*/
- if (unlikely(!get_page_unless_zero(page)))
+ if (unlikely(!folio_try_get(folio)))
goto out;
+ /* Recheck the page is still part of the folio we just got */
+ if (unlikely(page_folio(page) != folio))
+ goto out_put;
+
/*
- * Check PageMovable before holding a PG_lock because page's owner
- * assumes anybody doesn't touch PG_lock of newly allocated page
- * so unconditionally grabbing the lock ruins page's owner side.
+ * Check movable flag before taking the folio lock because
+ * we use non-atomic bitops on newly allocated page flags so
+ * unconditionally grabbing the lock ruins page's owner side.
+ * Make sure we don't have a slab folio here as its usage of the
+ * mapping field can cause a false positive movable flag.
*/
- if (unlikely(!__PageMovable(page)))
- goto out_putpage;
+ if (unlikely(!__folio_test_movable(folio)))
+ goto out_put;
+ /* Pairs with smp_wmb() in slab allocation, e.g. SLUB's alloc_slab_page() */
+ smp_rmb();
+ if (unlikely(folio_test_slab(folio)))
+ goto out_put;
+ /* Pairs with smp_wmb() in slab freeing, e.g. SLUB's __free_slab() */
+ smp_rmb();
+ if (unlikely(!__folio_test_movable(folio)))
+ goto out_put;
+
/*
* As movable pages are not isolated from LRU lists, concurrent
* compaction threads can race against page migration functions
@@ -92,8 +108,8 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
* lets be sure we have the page lock
* before proceeding with the movable page isolation steps.
*/
- if (unlikely(!trylock_page(page)))
- goto out_putpage;
+ if (unlikely(!folio_trylock(folio)))
+ goto out_put;
if (!PageMovable(page) || PageIsolated(page))
goto out_no_isolated;
@@ -107,14 +123,14 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
/* Driver shouldn't use PG_isolated bit of page->flags */
WARN_ON_ONCE(PageIsolated(page));
SetPageIsolated(page);
- unlock_page(page);
+ folio_unlock(folio);
return 0;
out_no_isolated:
- unlock_page(page);
-out_putpage:
- put_page(page);
+ folio_unlock(folio);
+out_put:
+ folio_put(folio);
out:
return -EBUSY;
}
diff --git a/mm/slab.c b/mm/slab.c
index 59c8e28f7b6a..219beb48588e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1370,6 +1370,8 @@ static struct slab *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
account_slab(slab, cachep->gfporder, cachep, flags);
__folio_set_slab(folio);
+ /* Make the flag visible before any changes to folio->mapping */
+ smp_wmb();
/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
if (sk_memalloc_socks() && page_is_pfmemalloc(folio_page(folio, 0)))
slab_set_pfmemalloc(slab);
@@ -1387,9 +1389,11 @@ static void kmem_freepages(struct kmem_cache *cachep, struct slab *slab)
BUG_ON(!folio_test_slab(folio));
__slab_clear_pfmemalloc(slab);
- __folio_clear_slab(folio);
page_mapcount_reset(folio_page(folio, 0));
folio->mapping = NULL;
+ /* Make the mapping reset visible before clearing the flag */
+ smp_wmb();
+ __folio_clear_slab(folio);
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += 1 << order;
diff --git a/mm/slub.c b/mm/slub.c
index 99ba865afc4a..5e6519d5169c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1800,6 +1800,8 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
slab = folio_slab(folio);
__folio_set_slab(folio);
+ /* Make the flag visible before any changes to folio->mapping */
+ smp_wmb();
if (page_is_pfmemalloc(folio_page(folio, 0)))
slab_set_pfmemalloc(slab);
@@ -2000,8 +2002,10 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
int pages = 1 << order;
__slab_clear_pfmemalloc(slab);
- __folio_clear_slab(folio);
folio->mapping = NULL;
+ /* Make the mapping reset visible before clearing the flag */
+ smp_wmb();
+ __folio_clear_slab(folio);
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += pages;
unaccount_slab(slab, order, s);
--
2.38.0
On Wed, Oct 26, 2022 at 12:52:01PM +0200, Vlastimil Babka wrote:
> On 10/25/22 16:08, Vlastimil Babka wrote:
> > On 10/25/22 15:47, Hyeonggon Yoo wrote:
> >> On Mon, Oct 24, 2022 at 04:35:04PM +0200, Vlastimil Babka wrote:
> >>
> >> [,,,]
> >>
> >>> diff --git a/mm/slab.c b/mm/slab.c
> >>> index 59c8e28f7b6a..219beb48588e 100644
> >>> --- a/mm/slab.c
> >>> +++ b/mm/slab.c
> >>> @@ -1370,6 +1370,8 @@ static struct slab *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
> >>>
> >>> account_slab(slab, cachep->gfporder, cachep, flags);
> >>> __folio_set_slab(folio);
> >>> + /* Make the flag visible before any changes to folio->mapping */
> >>> + smp_wmb();
> >>> /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
> >>> if (sk_memalloc_socks() && page_is_pfmemalloc(folio_page(folio, 0)))
> >>> slab_set_pfmemalloc(slab);
> >>> @@ -1387,9 +1389,11 @@ static void kmem_freepages(struct kmem_cache *cachep, struct slab *slab)
> >>>
> >>> BUG_ON(!folio_test_slab(folio));
> >>> __slab_clear_pfmemalloc(slab);
> >>> - __folio_clear_slab(folio);
> >>> page_mapcount_reset(folio_page(folio, 0));
> >>> folio->mapping = NULL;
> >>> + /* Make the mapping reset visible before clearing the flag */
> >>> + smp_wmb();
> >>> + __folio_clear_slab(folio);
> >>>
> >>> if (current->reclaim_state)
> >>> current->reclaim_state->reclaimed_slab += 1 << order;
> >>> diff --git a/mm/slub.c b/mm/slub.c
> >>> index 157527d7101b..6dc17cb915c5 100644
> >>> --- a/mm/slub.c
> >>> +++ b/mm/slub.c
> >>> @@ -1800,6 +1800,8 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
> >>>
> >>> slab = folio_slab(folio);
> >>> __folio_set_slab(folio);
> >>> + /* Make the flag visible before any changes to folio->mapping */
> >>> + smp_wmb();
> >>> if (page_is_pfmemalloc(folio_page(folio, 0)))
> >>> slab_set_pfmemalloc(slab);
> >>>
> >>> @@ -2008,8 +2010,10 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
> >>> }
> >>>
> >>> __slab_clear_pfmemalloc(slab);
> >>> - __folio_clear_slab(folio);
> >>> folio->mapping = NULL;
> >>> + /* Make the mapping reset visible before clearing the flag */
> >>> + smp_wmb();
> >>> + __folio_clear_slab(folio);
> >>> if (current->reclaim_state)
> >>> current->reclaim_state->reclaimed_slab += pages;
> >>> unaccount_slab(slab, order, s);
> >>> --
> >>> 2.38.0
> >>
> >> Do we need to try this with memory barriers before frozen refcount lands in?
> >
> > There was IIRC an unresolved issue with frozen refcount tripping the page
> > isolation code so I didn't want to be depending on that.
Understood.
> >> It's quite complicated and IIUC there is a still theoretical race:
> >>
> >> At isolate_movable_page: At slab alloc: At slab free:
> >> folio = alloc_pages(flags, order)
> >>
> >> folio_try_get()
> >> folio_test_slab() == false
> >> __folio_set_slab(folio)
> >> smp_wmb()
> >>
> >> call_rcu(&slab->rcu_head, rcu_free_slab);
> >>
> >>
> >> smp_rmb()
> >> __folio_test_movable() == true
> >>
> >> folio->mapping = NULL;
> >> smp_wmb()
> >> __folio_clear_slab(folio);
> >> smp_rmb()
> >> folio_test_slab() == false
> >>
> >> folio_trylock()
> >
> > There's also between above and below:
> >
> > if (!PageMovable(page) || PageIsolated(page))
> > goto out_no_isolated;
> >
> > mops = page_movable_ops(page);
> >
> > If we put another smp_rmb() before the PageMovable test, could that have
> > helped? It would assure we observe the folio->mapping = NULL; from the "slab
> > free" side?
> >
> > But yeah, it's getting ridiculous. Maybe there's a simpler way to check two
> > bits in two different bytes atomically. Or maybe it's just an impossible
> > task, I feel I just dunno computers at this point.
>
> After more thought, I think I just made a mistake by doing two
> folio_test_slab() tests around a single __folio_test_movable(). What I was
> supposed to do was two __folio_test_movable() tests around a single
> folio_test_slab()... I hope. That should take care of your scenario, or do
> you see another one? Thanks.
I couldn't find one with this patch.
To best my understanding this looks at least correct to me.
Hope there is no other place that is confused
by anonymous checking for random page like here.
Thanks.
> ----8----
> From 5ca1c10f6411d73ad579b58d4fa10326bf77cf0a Mon Sep 17 00:00:00 2001
> From: Matthew Wilcox <[email protected]>
> Date: Mon, 24 Oct 2022 16:11:27 +0200
> Subject: [PATCH] mm/migrate: make isolate_movable_page() skip slab pages
>
> In the next commit we want to rearrange struct slab fields to allow a
> larger rcu_head. Afterwards, the page->mapping field will overlap
> with SLUB's "struct list_head slab_list", where the value of prev
> pointer can become LIST_POISON2, which is 0x122 + POISON_POINTER_DELTA.
> Unfortunately the bit 1 being set can confuse PageMovable() to be a
> false positive and cause a GPF as reported by lkp [1].
>
> I think the real problem here is that isolate_movable_page() is
> insufficiently paranoid. Looking at the gyrations that GUP and the
> page cache do to convince themselves that the page they got really is
> the page they wanted, there are a few missing pieces (eg checking that
> you actually got a refcount on _this_ page and not some random other
> page you were temporarily part of a compound page with).
>
> This patch does three things:
>
> - Turns one of the comments into English. There are some others
> which I'm still scratching my head over.
> - Uses a folio to help distinguish which operations are being done
> to the head vs the specific page (this is somewhat an abuse of the
> folio concept, but it's acceptable)
> - Add the aforementioned check that we're actually operating on the
> page that we think we want to be.
> - Add a check that the folio isn't secretly a slab.
>
> We could put the slab check in PageMapping and call it after taking
> the folio lock, but that seems pointless. It's the acquisition of
> the refcount which stabilises the slab flag, not holding the lock.
>
> [ [email protected]: add memory barriers to SLAB and SLUB's page allocation
> and freeing, and their counterparts to isolate_movable_page(), to make
> the checks for folio_test_slab() and __folio_test_movable() SMP safe ]
>
> [1] https://lore.kernel.org/all/[email protected]/
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---
> mm/migrate.c | 40 ++++++++++++++++++++++++++++------------
> mm/slab.c | 6 +++++-
> mm/slub.c | 6 +++++-
> 3 files changed, 38 insertions(+), 14 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 1379e1912772..f0f58e42c1d4 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -60,6 +60,7 @@
>
> int isolate_movable_page(struct page *page, isolate_mode_t mode)
> {
> + struct folio *folio = page_folio(page);
> const struct movable_operations *mops;
>
> /*
> @@ -71,16 +72,31 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
> * the put_page() at the end of this block will take care of
> * release this page, thus avoiding a nasty leakage.
> */
> - if (unlikely(!get_page_unless_zero(page)))
> + if (unlikely(!folio_try_get(folio)))
> goto out;
>
> + /* Recheck the page is still part of the folio we just got */
> + if (unlikely(page_folio(page) != folio))
> + goto out_put;
> +
> /*
> - * Check PageMovable before holding a PG_lock because page's owner
> - * assumes anybody doesn't touch PG_lock of newly allocated page
> - * so unconditionally grabbing the lock ruins page's owner side.
> + * Check movable flag before taking the folio lock because
> + * we use non-atomic bitops on newly allocated page flags so
> + * unconditionally grabbing the lock ruins page's owner side.
> + * Make sure we don't have a slab folio here as its usage of the
> + * mapping field can cause a false positive movable flag.
> */
> - if (unlikely(!__PageMovable(page)))
> - goto out_putpage;
> + if (unlikely(!__folio_test_movable(folio)))
> + goto out_put;
> + /* Pairs with smp_wmb() in slab allocation, e.g. SLUB's alloc_slab_page() */
> + smp_rmb();
> + if (unlikely(folio_test_slab(folio)))
> + goto out_put;
> + /* Pairs with smp_wmb() in slab freeing, e.g. SLUB's __free_slab() */
> + smp_rmb();
> + if (unlikely(!__folio_test_movable(folio)))
> + goto out_put;
> +
> /*
> * As movable pages are not isolated from LRU lists, concurrent
> * compaction threads can race against page migration functions
> @@ -92,8 +108,8 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
> * lets be sure we have the page lock
> * before proceeding with the movable page isolation steps.
> */
> - if (unlikely(!trylock_page(page)))
> - goto out_putpage;
> + if (unlikely(!folio_trylock(folio)))
> + goto out_put;
>
> if (!PageMovable(page) || PageIsolated(page))
> goto out_no_isolated;
> @@ -107,14 +123,14 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode)
> /* Driver shouldn't use PG_isolated bit of page->flags */
> WARN_ON_ONCE(PageIsolated(page));
> SetPageIsolated(page);
> - unlock_page(page);
> + folio_unlock(folio);
>
> return 0;
>
> out_no_isolated:
> - unlock_page(page);
> -out_putpage:
> - put_page(page);
> + folio_unlock(folio);
> +out_put:
> + folio_put(folio);
> out:
> return -EBUSY;
> }
> diff --git a/mm/slab.c b/mm/slab.c
> index 59c8e28f7b6a..219beb48588e 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1370,6 +1370,8 @@ static struct slab *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
>
> account_slab(slab, cachep->gfporder, cachep, flags);
> __folio_set_slab(folio);
> + /* Make the flag visible before any changes to folio->mapping */
> + smp_wmb();
> /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
> if (sk_memalloc_socks() && page_is_pfmemalloc(folio_page(folio, 0)))
> slab_set_pfmemalloc(slab);
> @@ -1387,9 +1389,11 @@ static void kmem_freepages(struct kmem_cache *cachep, struct slab *slab)
>
> BUG_ON(!folio_test_slab(folio));
> __slab_clear_pfmemalloc(slab);
> - __folio_clear_slab(folio);
> page_mapcount_reset(folio_page(folio, 0));
> folio->mapping = NULL;
> + /* Make the mapping reset visible before clearing the flag */
> + smp_wmb();
> + __folio_clear_slab(folio);
>
> if (current->reclaim_state)
> current->reclaim_state->reclaimed_slab += 1 << order;
> diff --git a/mm/slub.c b/mm/slub.c
> index 99ba865afc4a..5e6519d5169c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1800,6 +1800,8 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
>
> slab = folio_slab(folio);
> __folio_set_slab(folio);
> + /* Make the flag visible before any changes to folio->mapping */
> + smp_wmb();
> if (page_is_pfmemalloc(folio_page(folio, 0)))
> slab_set_pfmemalloc(slab);
>
> @@ -2000,8 +2002,10 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
> int pages = 1 << order;
>
> __slab_clear_pfmemalloc(slab);
> - __folio_clear_slab(folio);
> folio->mapping = NULL;
> + /* Make the mapping reset visible before clearing the flag */
> + smp_wmb();
> + __folio_clear_slab(folio);
> if (current->reclaim_state)
> current->reclaim_state->reclaimed_slab += pages;
> unaccount_slab(slab, order, s);
> --
> 2.38.0
On 10/24/22 16:35, Vlastimil Babka wrote:
> On 10/3/22 19:00, Matthew Wilcox wrote:
>> On Sun, Oct 02, 2022 at 02:48:02PM +0900, Hyeonggon Yoo wrote:
>>> Just one more thing, rcu_leak_callback too. RCU seem to use it
>>> internally to catch double call_rcu().
>>>
>>> And some suggestions:
>>> - what about adding runtime WARN() on slab init code to catch
>>> unexpected arch/toolchain issues?
>>> - instead of 4, we may use macro definition? like (PAGE_MAPPING_FLAGS + 1)?
>>
>> I think the real problem here is that isolate_movable_page() is
>> insufficiently paranoid. Looking at the gyrations that GUP and the
>> page cache do to convince themselves that the page they got really is
>> the page they wanted, there are a few missing pieces (eg checking that
>> you actually got a refcount on _this_ page and not some random other
>> page you were temporarily part of a compound page with).
>>
>> This patch does three things:
>>
>> - Turns one of the comments into English. There are some others
>> which I'm still scratching my head over.
>> - Uses a folio to help distinguish which operations are being done
>> to the head vs the specific page (this is somewhat an abuse of the
>> folio concept, but it's acceptable)
>> - Add the aforementioned check that we're actually operating on the
>> page that we think we want to be.
>> - Add a check that the folio isn't secretly a slab.
>>
>> We could put the slab check in PageMapping and call it after taking
>> the folio lock, but that seems pointless. It's the acquisition of
>> the refcount which stabilises the slab flag, not holding the lock.
>>
>
> I would like to have a working safe version in -next, even if we are able
> simplify it later thanks to frozen refcounts. I've made a formal patch of
> yours, but I'm still convinced the slab check needs to be more paranoid so
> it can't observe a false positive __folio_test_movable() while missing the
> folio_test_slab(), hence I added the barriers as in my previous attempt [1].
> Does that work for you and can I add your S-o-b?
>
> [1] https://lore.kernel.org/all/[email protected]/
To move on, I pushed a branch based on a new version of [1] above. It
lacks Matthew's folio parts, which are not IMHO that critical right now,
so can be added later.
It's here:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-6.2/fit_rcu_head
Will also send for formal review soon.