Pages on CMA area could have MIGRATE_ISOLATE as well as MIGRATE_CMA
so current is_pinnable_page could miss CMA pages which has MIGRATE_
ISOLATE. It ends up pinning CMA pages as longterm at pin_user_pages
APIs so CMA allocation keep failed until the pin is released.
CPU 0 CPU 1 - Task B
cma_alloc
alloc_contig_range
pin_user_pages_fast(FOLL_LONGTERM)
change pageblock as MIGRATE_ISOLATE
internal_get_user_pages_fast
lockless_pages_from_mm
gup_pte_range
try_grab_folio
is_pinnable_page
return true;
So, pinned the page successfully.
page migration failure with pinned page
..
.. After 30 sec
unpin_user_page(page)
CMA allocation succeeded after 30 sec.
The CMA allocation path protects the migration type change race
using zone->lock but what GUP path need to know is just whether the
page is on CMA area or not rather than exact migration type.
Thus, we don't need zone->lock but just checks migration type in
either of (MIGRATE_ISOLATE and MIGRATE_CMA).
Adding the MIGRATE_ISOLATE check in is_pinnable_page could cause
rejecting of pinning pages on MIGRATE_ISOLATE pageblocks even
though it's neither CMA nor movable zone if the page is temporarily
unmovable. However, such a migration failure by unexpected temporal
refcount holding is general issue, not only come from MIGRATE_ISOLATE
and the MIGRATE_ISOLATE is also transient state like other temporal
elevated refcount problem.
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
* from v3 - https://lore.kernel.org/all/[email protected]/
* Fix typo and adding more description - akpm
* from v2 - https://lore.kernel.org/all/[email protected]/
* Use __READ_ONCE instead of volatile - akpm
* from v1 - https://lore.kernel.org/all/[email protected]/
* fix build warning - lkp
* fix refetching issue of migration type
* add side effect on !ZONE_MOVABLE and !MIGRATE_CMA in description - david
include/linux/mm.h | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6acca5cecbc5..cbf79eb790e0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1625,8 +1625,19 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
#ifdef CONFIG_MIGRATION
static inline bool is_pinnable_page(struct page *page)
{
- return !(is_zone_movable_page(page) || is_migrate_cma_page(page)) ||
- is_zero_pfn(page_to_pfn(page));
+#ifdef CONFIG_CMA
+ /*
+ * use volatile to use local variable mt instead of
+ * refetching mt value.
+ */
+ int __mt = get_pageblock_migratetype(page);
+ int mt = __READ_ONCE(__mt);
+
+ if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
+ return false;
+#endif
+
+ return !(is_zone_movable_page(page) || is_zero_pfn(page_to_pfn(page)));
}
#else
static inline bool is_pinnable_page(struct page *page)
--
2.36.0.512.ge40c2bad7a-goog
On 5/10/22 17:09, Minchan Kim wrote:
> On Tue, May 10, 2022 at 04:58:13PM -0700, John Hubbard wrote:
>> On 5/10/22 4:31 PM, Minchan Kim wrote:
>>>>> + int __mt = get_pageblock_migratetype(page);
>>>>> + int mt = __READ_ONCE(__mt);
>>>>
>>>> Although I saw the email discussion about this in v2, that discussion
>>>> didn't go far enough. It started with "don't use volatile", and went
>>>> on to "try __READ_ONCE() instead", but it should have continued on
>>>> to "you don't need this at all".
>>>
>>> That's really what I want to hear from experts so wanted to learn
>>> "Why". How could we prevent refetching of the mt if we don't use
>>> __READ_ONCE or volatile there?
>>>
>>>>
>>>> Because you don't. There is nothing you are racing with, and adding
>>>> __READ_ONCE() in order to avoid a completely not-going-to-happen
>>>> compiler re-invocation of a significant code block is just very wrong.
>>>>
>>>> So let's just let it go entirely. :)
>>>
>>> Yeah, once it's clear for everyone, I am happy to remove the
>>> unnecessary lines.
>>>
>>>>
>>>>> +
>>>>> + if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
>>>>
>>
>> With or without __READ_ONCE() or volatile or anything else,
>> this code will do what you want. Which is: loosely check
>> for either of the above.
>>
>> What functional problem do you think you are preventing
>> with __READ_ONCE()? Because I don't see one.
>
> I discussed the issue at v1 so please take a look.
>
> https://lore.kernel.org/all/[email protected]/
I read that, but there was never any real justification there for needing
to prevent a re-read of mt, just a preference: "I'd like to keep use the local
variable mt's value in folloing conditions checks instead of refetching
the value from get_pageblock_migratetype."
But I don't believe that there is any combination of values of mt that
will cause a problem here.
I also think that once we pull in experts, they will tell us that the
compiler is not going to re-run a non-trivial function to re-fetch a
value, but I'm not one of those experts, so that's still arguable. But
imagine what the kernel code would look like if every time we call
a large function, we have to consider if it actually gets called some
arbitrary number of times, due to (anti-) optimizations by the compiler.
This seems like something that is not really happening.
thanks,
--
John Hubbard
NVIDIA
On 5/11/22 16:45, Paul E. McKenney wrote:
>>
>> Well no, because the "&" operation is a single operation on the CPU, and
>> isn't going to get split up like that.
>
> Chiming in a bit late...
Much appreciated!
>
> The usual way that this sort of thing causes trouble is if there is a
> single store instruction that changes the value from MIGRATE_ISOLATE
> to MIGRATE_CMA, and if the compiler decides to fetch twice, AND twice,
Doing an AND twice for "x & constant" this definitely blows my mind. Is
nothing sacred? :)
> and then combine the results. This could give a zero outcome where the
> underlying variable never had the value zero.
>
> Is this sort of thing low probability?
>
> Definitely.
>
> Isn't this sort of thing prohibited?
>
> Definitely not.
>
> So what you have will likely work for at least a while longer, but it
> is not guaranteed and it forces you to think a lot harder about what
> the current implementations of the compiler can and cannot do to you.
>
> The following LWN article goes through some of the possible optimizations
> (vandalisms?) in this area: https://lwn.net/Articles/793253/
>
hmm, I don't think we hit any of those cases, do we? Because here, the
"write" side is via a non-inline function that I just don't believe the
compiler is allowed to call twice. Or is it?
Minchan's earlier summary:
CPU 0 CPU1
set_pageblock_migratetype(MIGRATE_ISOLATE)
if (get_pageblock_migrate(page) & MIGRATE_CMA)
set_pageblock_migratetype(MIGRATE_CMA)
if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
...where set_pageblock_migratetype() is not inline.
thanks,
--
John Hubbard
NVIDIA
> In the end, it is your code, so you get to decide how much you would
> like to keep track of what compilers get up to over time. ;-)
>
> Thanx, Paul
On Wed, May 11, 2022 at 04:57:04PM -0700, John Hubbard wrote:
> On 5/11/22 16:45, Paul E. McKenney wrote:
> > >
> > > Well no, because the "&" operation is a single operation on the CPU, and
> > > isn't going to get split up like that.
> >
> > Chiming in a bit late...
>
> Much appreciated!
>
> > The usual way that this sort of thing causes trouble is if there is a
> > single store instruction that changes the value from MIGRATE_ISOLATE
> > to MIGRATE_CMA, and if the compiler decides to fetch twice, AND twice,
>
> Doing an AND twice for "x & constant" this definitely blows my mind. Is
> nothing sacred? :)
Apparently there is not much sacred to compiler writers in search of
additional optimizations. :-/
> > and then combine the results. This could give a zero outcome where the
> > underlying variable never had the value zero.
> >
> > Is this sort of thing low probability?
> >
> > Definitely.
> >
> > Isn't this sort of thing prohibited?
> >
> > Definitely not.
> >
> > So what you have will likely work for at least a while longer, but it
> > is not guaranteed and it forces you to think a lot harder about what
> > the current implementations of the compiler can and cannot do to you.
> >
> > The following LWN article goes through some of the possible optimizations
> > (vandalisms?) in this area: https://lwn.net/Articles/793253/
>
> hmm, I don't think we hit any of those cases, do we? Because here, the
> "write" side is via a non-inline function that I just don't believe the
> compiler is allowed to call twice. Or is it?
Not yet. But if link-time optimizations (LTO) continue their march,
I wouldn't feel safe ruling it out...
> Minchan's earlier summary:
>
> CPU 0 CPU1
>
>
> set_pageblock_migratetype(MIGRATE_ISOLATE)
>
> if (get_pageblock_migrate(page) & MIGRATE_CMA)
>
> set_pageblock_migratetype(MIGRATE_CMA)
>
> if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
>
> ...where set_pageblock_migratetype() is not inline.
...especially if the code is reorganized for whatever reason.
> thanks,
> --
> John Hubbard
> NVIDIA
But again:
> > In the end, it is your code, so you get to decide how much you would
> > like to keep track of what compilers get up to over time. ;-)
Thanx, Paul
On Wed, May 11, 2022 at 05:22:07PM -0700, Paul E. McKenney wrote:
> On Wed, May 11, 2022 at 05:12:32PM -0700, John Hubbard wrote:
> > On 5/11/22 16:57, John Hubbard wrote:
> > > On 5/11/22 16:45, Paul E. McKenney wrote:
> > > > >
> > > > > Well no, because the "&" operation is a single operation on the CPU, and
> > > > > isn't going to get split up like that.
> > > >
> > > > Chiming in a bit late...
> > >
> > > Much appreciated!
> > >
> > > >
> > > > The usual way that this sort of thing causes trouble is if there is a
> > > > single store instruction that changes the value from MIGRATE_ISOLATE
> > > > to MIGRATE_CMA, and if the compiler decides to fetch twice, AND twice,
> > >
> > > Doing an AND twice for "x & constant" this definitely blows my mind. Is
> > > nothing sacred? :)
> > >
> > > > and then combine the results.? This could give a zero outcome where the
> > > > underlying variable never had the value zero.
> > > >
> > > > Is this sort of thing low probability?
> > > >
> > > > Definitely.
> > > >
> > > > Isn't this sort of thing prohibited?
> > > >
> > > > Definitely not.
> > > >
> > > > So what you have will likely work for at least a while longer, but it
> > > > is not guaranteed and it forces you to think a lot harder about what
> > > > the current implementations of the compiler can and cannot do to you.
> > > >
> > > > The following LWN article goes through some of the possible optimizations
> > > > (vandalisms?) in this area: https://lwn.net/Articles/793253/
> > > >
> > >
> > > hmm, I don't think we hit any of those? cases, do we? Because here, the
> > > "write" side is via a non-inline function that I just don't believe the
> > > compiler is allowed to call twice. Or is it?
> > >
> > > Minchan's earlier summary:
> > >
> > > CPU 0???????????????????????? CPU1
> > >
> > >
> > > ????????????????????????????? set_pageblock_migratetype(MIGRATE_ISOLATE)
> > >
> > > if (get_pageblock_migrate(page) & MIGRATE_CMA)
> > >
> > > ????????????????????????????? set_pageblock_migratetype(MIGRATE_CMA)
> > >
> > > if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
> > >
> > > ...where set_pageblock_migratetype() is not inline.
> > >
> > > thanks,
> >
> > Let me try to say this more clearly: I don't think that the following
> > __READ_ONCE() statement can actually help anything, given that
> > get_pageblock_migratetype() is non-inlined:
> >
> > + int __mt = get_pageblock_migratetype(page);
> > + int mt = __READ_ONCE(__mt);
> > +
> > + if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> > + return false;
> >
> >
> > Am I missing anything here?
>
> In the absence of future aggression from link-time optimizations (LTO),
> you are missing nothing.
A thing I want to note is Android kernel uses LTO full mode.
On 5/11/22 16:28, Minchan Kim wrote:
> This is delta to confirm before posting next revision.
>
> Are you okay with this one?
Yes, just maybe delete the comment:
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cbf79eb790e0..7b2df6780552 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1626,14 +1626,14 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
> static inline bool is_pinnable_page(struct page *page)
> {
> #ifdef CONFIG_CMA
> + int mt = get_pageblock_migratetype(page);
> +
> /*
> - * use volatile to use local variable mt instead of
> - * refetching mt value.
> + * "&" operation would prevent compiler split up
> + * get_pageblock_migratetype two times for each
> + * condition check: refetching mt value two times.
> */
If you wanted a useful comment here, I think a description of
what is running at the same time would be good. But otherwise,
I'd just delete the entire comment, because a micro-comment
about "&" is not really worth the vertical space here IMHO.
> - int __mt = get_pageblock_migratetype(page);
> - int mt = __READ_ONCE(__mt);
> -
> - if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
> + if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
> return false;
> #endif
>
Anyway, I'm comfortable with this now:
Reviewed-by: John Hubbard <[email protected]>
thanks,
--
John Hubbard
NVIDIA
On Wed, May 11, 2022 at 05:34:52PM -0700, John Hubbard wrote:
> On 5/11/22 17:26, Minchan Kim wrote:
> > > > Let me try to say this more clearly: I don't think that the following
> > > > __READ_ONCE() statement can actually help anything, given that
> > > > get_pageblock_migratetype() is non-inlined:
> > > >
> > > > + int __mt = get_pageblock_migratetype(page);
> > > > + int mt = __READ_ONCE(__mt);
> > > > +
> > > > + if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> > > > + return false;
> > > >
> > > >
> > > > Am I missing anything here?
> > >
> > > In the absence of future aggression from link-time optimizations (LTO),
> > > you are missing nothing.
> >
> > A thing I want to note is Android kernel uses LTO full mode.
>
> Thanks Paul for explaining the state of things.
>
> Minchan, how about something like very close to your original draft,
> then, but with a little note, and the "&" as well:
>
> int __mt = get_pageblock_migratetype(page);
>
> /*
> * Defend against future compiler LTO features, or code refactoring
> * that inlines the above function, by forcing a single read. Because, this
> * routine races with set_pageblock_migratetype(), and we want to avoid
> * reading zero, when actually one or the other flags was set.
> */
> int mt = __READ_ONCE(__mt);
>
> if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> return false;
>
>
> ...which should make everyone comfortable and protected from the
> future sins of the compiler and linker teams? :)
This would work, but it would force a store to the stack and an immediate
reload. Which might be OK on this code path.
But using READ_ONCE() in (I think?) __get_pfnblock_flags_mask()
would likely generate the same code that is produced today.
word = READ_ONCE(bitmap[word_bitidx]);
But I could easily have missed a turn in that cascade of functions. ;-)
Or there might be some code path that really hates a READ_ONCE() in
that place.
Thanx, Paul
On Wed, May 11, 2022 at 05:26:55PM -0700, Minchan Kim wrote:
> On Wed, May 11, 2022 at 05:22:07PM -0700, Paul E. McKenney wrote:
> > On Wed, May 11, 2022 at 05:12:32PM -0700, John Hubbard wrote:
> > > On 5/11/22 16:57, John Hubbard wrote:
> > > > On 5/11/22 16:45, Paul E. McKenney wrote:
> > > > > >
> > > > > > Well no, because the "&" operation is a single operation on the CPU, and
> > > > > > isn't going to get split up like that.
> > > > >
> > > > > Chiming in a bit late...
> > > >
> > > > Much appreciated!
> > > >
> > > > >
> > > > > The usual way that this sort of thing causes trouble is if there is a
> > > > > single store instruction that changes the value from MIGRATE_ISOLATE
> > > > > to MIGRATE_CMA, and if the compiler decides to fetch twice, AND twice,
> > > >
> > > > Doing an AND twice for "x & constant" this definitely blows my mind. Is
> > > > nothing sacred? :)
> > > >
> > > > > and then combine the results.? This could give a zero outcome where the
> > > > > underlying variable never had the value zero.
> > > > >
> > > > > Is this sort of thing low probability?
> > > > >
> > > > > Definitely.
> > > > >
> > > > > Isn't this sort of thing prohibited?
> > > > >
> > > > > Definitely not.
> > > > >
> > > > > So what you have will likely work for at least a while longer, but it
> > > > > is not guaranteed and it forces you to think a lot harder about what
> > > > > the current implementations of the compiler can and cannot do to you.
> > > > >
> > > > > The following LWN article goes through some of the possible optimizations
> > > > > (vandalisms?) in this area: https://lwn.net/Articles/793253/
> > > > >
> > > >
> > > > hmm, I don't think we hit any of those? cases, do we? Because here, the
> > > > "write" side is via a non-inline function that I just don't believe the
> > > > compiler is allowed to call twice. Or is it?
> > > >
> > > > Minchan's earlier summary:
> > > >
> > > > CPU 0???????????????????????? CPU1
> > > >
> > > >
> > > > ????????????????????????????? set_pageblock_migratetype(MIGRATE_ISOLATE)
> > > >
> > > > if (get_pageblock_migrate(page) & MIGRATE_CMA)
> > > >
> > > > ????????????????????????????? set_pageblock_migratetype(MIGRATE_CMA)
> > > >
> > > > if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
> > > >
> > > > ...where set_pageblock_migratetype() is not inline.
> > > >
> > > > thanks,
> > >
> > > Let me try to say this more clearly: I don't think that the following
> > > __READ_ONCE() statement can actually help anything, given that
> > > get_pageblock_migratetype() is non-inlined:
> > >
> > > + int __mt = get_pageblock_migratetype(page);
> > > + int mt = __READ_ONCE(__mt);
> > > +
> > > + if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> > > + return false;
> > >
> > >
> > > Am I missing anything here?
> >
> > In the absence of future aggression from link-time optimizations (LTO),
> > you are missing nothing.
>
> A thing I want to note is Android kernel uses LTO full mode.
I doubt that current LTO can do this sort of optimized inlining, at least,
not yet.
Thanx, Paul
On 5/11/22 16:57, John Hubbard wrote:
> On 5/11/22 16:45, Paul E. McKenney wrote:
>>>
>>> Well no, because the "&" operation is a single operation on the CPU, and
>>> isn't going to get split up like that.
>>
>> Chiming in a bit late...
>
> Much appreciated!
>
>>
>> The usual way that this sort of thing causes trouble is if there is a
>> single store instruction that changes the value from MIGRATE_ISOLATE
>> to MIGRATE_CMA, and if the compiler decides to fetch twice, AND twice,
>
> Doing an AND twice for "x & constant" this definitely blows my mind. Is
> nothing sacred? :)
>
>> and then combine the results. This could give a zero outcome where the
>> underlying variable never had the value zero.
>>
>> Is this sort of thing low probability?
>>
>> Definitely.
>>
>> Isn't this sort of thing prohibited?
>>
>> Definitely not.
>>
>> So what you have will likely work for at least a while longer, but it
>> is not guaranteed and it forces you to think a lot harder about what
>> the current implementations of the compiler can and cannot do to you.
>>
>> The following LWN article goes through some of the possible optimizations
>> (vandalisms?) in this area: https://lwn.net/Articles/793253/
>>
>
> hmm, I don't think we hit any of those cases, do we? Because here, the
> "write" side is via a non-inline function that I just don't believe the
> compiler is allowed to call twice. Or is it?
>
> Minchan's earlier summary:
>
> CPU 0 CPU1
>
>
> set_pageblock_migratetype(MIGRATE_ISOLATE)
>
> if (get_pageblock_migrate(page) & MIGRATE_CMA)
>
> set_pageblock_migratetype(MIGRATE_CMA)
>
> if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
>
> ...where set_pageblock_migratetype() is not inline.
>
> thanks,
Let me try to say this more clearly: I don't think that the following
__READ_ONCE() statement can actually help anything, given that
get_pageblock_migratetype() is non-inlined:
+ int __mt = get_pageblock_migratetype(page);
+ int mt = __READ_ONCE(__mt);
+
+ if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
+ return false;
Am I missing anything here?
thanks,
--
John Hubbard
NVIDIA
On 5/11/22 18:03, Minchan Kim wrote:
>>
>> Or there might be some code path that really hates a READ_ONCE() in
>> that place.
>
> My worry about chaning __get_pfnblock_flags_mask is it's called
> multiple hot places in mm codes so I didn't want to add overhead
> to them.
...unless it really does generate the same code as is already there,
right? Let me check that real quick.
thanks,
--
John Hubbard
NVIDIA
On Wed, May 11, 2022 at 03:49:06PM -0700, John Hubbard wrote:
> On 5/11/22 15:37, Minchan Kim wrote:
> > > Yes. But one thing that is still unanswered, that I think you can
> > > answer, is: even if the compiler *did* re-read the mt variable, what
> > > problems could that cause? I claim "no problems", because there is
> > > no combination of 0, _CMA, _ISOLATE, _CMA|ISOLATE that will cause
> > > problems here.
> >
> > What scenario I am concerning with __READ_ONCE so compiler
> > inlining get_pageblock_migratetype two times are
> >
> > CPU 0 CPU 1
> > alloc_contig_range
> > is_pinnable_page start_isolate_page_range
> > set_pageblock_migratetype(MIGRATE_ISOLATE)
> > if (get_pageeblock_migratetype(page) == MIGRATE_CMA)
> > so it's false
> > undo:
> > set_pageblock_migratetype(MIGRATE_CMA)
> > if (get_pageeblock_migratetype(page) == MIGRATE_ISOLATE)
> > so it's false
> >
> > In the end, CMA memory would be pinned by CPU 0 process
> > so CMA allocation keep failed until the process release the
> > refcount.
> >
>
> OK, so the code checks the wrong item each time. But the code really
> only needs to know "is either _CMA or _ISOLATE set?". And so you
Yes.
> can just sidestep the entire question by writing it like this:
>
> int mt = get_pageblock_migratetype(page);
>
> if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
> return false;
I am confused. Isn't it same question?
set_pageblock_migratetype(MIGRATE_ISOLATE)
if (get_pageblock_migrate(page) & MIGRATE_CMA)
set_pageblock_migratetype(MIGRATE_CMA)
if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
On 5/11/22 2:46 PM, Minchan Kim wrote:
>> I read that, but there was never any real justification there for needing
>> to prevent a re-read of mt, just a preference: "I'd like to keep use the local
>> variable mt's value in folloing conditions checks instead of refetching
>> the value from get_pageblock_migratetype."
>>
>> But I don't believe that there is any combination of values of mt that
>> will cause a problem here.
>>
>> I also think that once we pull in experts, they will tell us that the
>> compiler is not going to re-run a non-trivial function to re-fetch a
>> value, but I'm not one of those experts, so that's still arguable. But
>> imagine what the kernel code would look like if every time we call
>> a large function, we have to consider if it actually gets called some
>> arbitrary number of times, due to (anti-) optimizations by the compiler.
>> This seems like something that is not really happening.
>
> Maybe, I might be paranoid since I have heard too subtle things
> about how compiler could changes high level language code so wanted
> be careful especially when we do lockless-stuff.
>
> Who cares when we change the large(?) function to small(?) function
> later on? I'd like to hear from experts to decide it.
>
Yes. But one thing that is still unanswered, that I think you can
answer, is: even if the compiler *did* re-read the mt variable, what
problems could that cause? I claim "no problems", because there is
no combination of 0, _CMA, _ISOLATE, _CMA|ISOLATE that will cause
problems here.
Any if that's true, then we can leave the experts alone, because
the answer is there without knowing what happens exactly to mt.
thanks,
--
John Hubbard
NVIDIA
On 5/11/22 20:44, Minchan Kim wrote:
> Thanks for checking, John.
>
> I don't want to have the READ_ONCE in __get_pfnblock_flags_mask
> atm even though it's an extra memory dereference for specific
> architecutre and specific compiler unless other callsites *do*
> need it.
>
> We choose the choice(i.e., having __READ_ONCE in is_pinanble_
> page) for *potential* cases(e.g., aggressive LTO for future)
> and if it's an extra memory dereference atm, it would be multiple
> instructions *potentailly* as having more change or different
> compiler and/or arches now or later. It's out of scope in
> this patch.
Agreed!
thanks,
--
John Hubbard
NVIDIA
On Wed, May 11, 2022 at 07:18:56PM -0700, John Hubbard wrote:
> On 5/11/22 18:08, John Hubbard wrote:
> > On 5/11/22 18:03, Minchan Kim wrote:
> > > >
> > > > Or there might be some code path that really hates a READ_ONCE() in
> > > > that place.
> > >
> > > My worry about chaning __get_pfnblock_flags_mask is it's called
> > > multiple hot places in mm codes so I didn't want to add overhead
> > > to them.
> >
> > ...unless it really does generate the same code as is already there,
> > right? Let me check that real quick.
> >
>
> It does change the generated code slightly. I don't know if this will
> affect performance here or not. But just for completeness, here you go:
>
> free_one_page() originally has this (just showing the changed parts):
>
> mov 0x8(%rdx,%rax,8),%rbx
> and $0x3f,%ecx
> shr %cl,%rbx
> and $0x7,
>
>
> And after applying this diff:
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0e42038382c1..df1f8e9a294f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -482,7 +482,7 @@ unsigned long __get_pfnblock_flags_mask(const struct
> page *page,
> word_bitidx = bitidx / BITS_PER_LONG;
> bitidx &= (BITS_PER_LONG-1);
>
> - word = bitmap[word_bitidx];
> + word = READ_ONCE(bitmap[word_bitidx]);
> return (word >> bitidx) & mask;
> }
>
>
> ...it now does an extra memory dereference:
>
> lea 0x8(%rdx,%rax,8),%rax
> and $0x3f,%ecx
> mov (%rax),%rbx
> shr %cl,%rbx
> and $0x7,%ebx
>
Thanks for checking, John.
I don't want to have the READ_ONCE in __get_pfnblock_flags_mask
atm even though it's an extra memory dereference for specific
architecutre and specific compiler unless other callsites *do*
need it.
We choose the choice(i.e., having __READ_ONCE in is_pinanble_
page) for *potential* cases(e.g., aggressive LTO for future)
and if it's an extra memory dereference atm, it would be multiple
instructions *potentailly* as having more change or different
compiler and/or arches now or later. It's out of scope in
this patch.
On Wed, May 11, 2022 at 03:25:49PM -0700, John Hubbard wrote:
> On 5/11/22 2:46 PM, Minchan Kim wrote:
> > > I read that, but there was never any real justification there for needing
> > > to prevent a re-read of mt, just a preference: "I'd like to keep use the local
> > > variable mt's value in folloing conditions checks instead of refetching
> > > the value from get_pageblock_migratetype."
> > >
> > > But I don't believe that there is any combination of values of mt that
> > > will cause a problem here.
> > >
> > > I also think that once we pull in experts, they will tell us that the
> > > compiler is not going to re-run a non-trivial function to re-fetch a
> > > value, but I'm not one of those experts, so that's still arguable. But
> > > imagine what the kernel code would look like if every time we call
> > > a large function, we have to consider if it actually gets called some
> > > arbitrary number of times, due to (anti-) optimizations by the compiler.
> > > This seems like something that is not really happening.
> >
> > Maybe, I might be paranoid since I have heard too subtle things
> > about how compiler could changes high level language code so wanted
> > be careful especially when we do lockless-stuff.
> >
> > Who cares when we change the large(?) function to small(?) function
> > later on? I'd like to hear from experts to decide it.
> >
>
> Yes. But one thing that is still unanswered, that I think you can
> answer, is: even if the compiler *did* re-read the mt variable, what
> problems could that cause? I claim "no problems", because there is
> no combination of 0, _CMA, _ISOLATE, _CMA|ISOLATE that will cause
> problems here.
What scenario I am concerning with __READ_ONCE so compiler
inlining get_pageblock_migratetype two times are
CPU 0 CPU 1
alloc_contig_range
is_pinnable_page start_isolate_page_range
set_pageblock_migratetype(MIGRATE_ISOLATE)
if (get_pageeblock_migratetype(page) == MIGRATE_CMA)
so it's false
undo:
set_pageblock_migratetype(MIGRATE_CMA)
if (get_pageeblock_migratetype(page) == MIGRATE_ISOLATE)
so it's false
In the end, CMA memory would be pinned by CPU 0 process
so CMA allocation keep failed until the process release the
refcount.
>
> Any if that's true, then we can leave the experts alone, because
> the answer is there without knowing what happens exactly to mt.
>
> thanks,
>
> --
> John Hubbard
> NVIDIA
>
On 5/11/22 17:49, Paul E. McKenney wrote:
>> Thanks Paul for explaining the state of things.
>>
>> Minchan, how about something like very close to your original draft,
>> then, but with a little note, and the "&" as well:
>>
>> int __mt = get_pageblock_migratetype(page);
>>
>> /*
>> * Defend against future compiler LTO features, or code refactoring
>> * that inlines the above function, by forcing a single read. Because, this
>> * routine races with set_pageblock_migratetype(), and we want to avoid
>> * reading zero, when actually one or the other flags was set.
>> */
>> int mt = __READ_ONCE(__mt);
>>
>> if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
>> return false;
>>
>>
>> ...which should make everyone comfortable and protected from the
>> future sins of the compiler and linker teams? :)
>
> This would work, but it would force a store to the stack and an immediate
> reload. Which might be OK on this code path.
>
> But using READ_ONCE() in (I think?) __get_pfnblock_flags_mask()
> would likely generate the same code that is produced today.
>
> word = READ_ONCE(bitmap[word_bitidx]);
Ah right, I like that much, much better. The READ_ONCE is placed where
it actually clearly matters, rather than several layers removed.
>
> But I could easily have missed a turn in that cascade of functions. ;-)
>
> Or there might be some code path that really hates a READ_ONCE() in
> that place.
I certainly hope not. I see free_one_page(), among other things, calls
this. But having the correct READ_ONCE() in a central place seems worth
it, unless we know that this will cause a measurable slowdown somewhere.
thanks,
--
John Hubbard
NVIDIA
>
> Thanx, Paul
On Wed, May 11, 2022 at 05:12:32PM -0700, John Hubbard wrote:
> On 5/11/22 16:57, John Hubbard wrote:
> > On 5/11/22 16:45, Paul E. McKenney wrote:
> > > >
> > > > Well no, because the "&" operation is a single operation on the CPU, and
> > > > isn't going to get split up like that.
> > >
> > > Chiming in a bit late...
> >
> > Much appreciated!
> >
> > >
> > > The usual way that this sort of thing causes trouble is if there is a
> > > single store instruction that changes the value from MIGRATE_ISOLATE
> > > to MIGRATE_CMA, and if the compiler decides to fetch twice, AND twice,
> >
> > Doing an AND twice for "x & constant" this definitely blows my mind. Is
> > nothing sacred? :)
> >
> > > and then combine the results.? This could give a zero outcome where the
> > > underlying variable never had the value zero.
> > >
> > > Is this sort of thing low probability?
> > >
> > > Definitely.
> > >
> > > Isn't this sort of thing prohibited?
> > >
> > > Definitely not.
> > >
> > > So what you have will likely work for at least a while longer, but it
> > > is not guaranteed and it forces you to think a lot harder about what
> > > the current implementations of the compiler can and cannot do to you.
> > >
> > > The following LWN article goes through some of the possible optimizations
> > > (vandalisms?) in this area: https://lwn.net/Articles/793253/
> > >
> >
> > hmm, I don't think we hit any of those? cases, do we? Because here, the
> > "write" side is via a non-inline function that I just don't believe the
> > compiler is allowed to call twice. Or is it?
> >
> > Minchan's earlier summary:
> >
> > CPU 0???????????????????????? CPU1
> >
> >
> > ????????????????????????????? set_pageblock_migratetype(MIGRATE_ISOLATE)
> >
> > if (get_pageblock_migrate(page) & MIGRATE_CMA)
> >
> > ????????????????????????????? set_pageblock_migratetype(MIGRATE_CMA)
> >
> > if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
> >
> > ...where set_pageblock_migratetype() is not inline.
> >
> > thanks,
>
> Let me try to say this more clearly: I don't think that the following
> __READ_ONCE() statement can actually help anything, given that
> get_pageblock_migratetype() is non-inlined:
>
> + int __mt = get_pageblock_migratetype(page);
> + int mt = __READ_ONCE(__mt);
> +
> + if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> + return false;
>
>
> Am I missing anything here?
In the absence of future aggression from link-time optimizations (LTO),
you are missing nothing.
Thanx, Paul
On Tue, May 10, 2022 at 09:32:05PM -0700, John Hubbard wrote:
> On 5/10/22 17:09, Minchan Kim wrote:
> > On Tue, May 10, 2022 at 04:58:13PM -0700, John Hubbard wrote:
> > > On 5/10/22 4:31 PM, Minchan Kim wrote:
> > > > > > + int __mt = get_pageblock_migratetype(page);
> > > > > > + int mt = __READ_ONCE(__mt);
> > > > >
> > > > > Although I saw the email discussion about this in v2, that discussion
> > > > > didn't go far enough. It started with "don't use volatile", and went
> > > > > on to "try __READ_ONCE() instead", but it should have continued on
> > > > > to "you don't need this at all".
> > > >
> > > > That's really what I want to hear from experts so wanted to learn
> > > > "Why". How could we prevent refetching of the mt if we don't use
> > > > __READ_ONCE or volatile there?
> > > >
> > > > >
> > > > > Because you don't. There is nothing you are racing with, and adding
> > > > > __READ_ONCE() in order to avoid a completely not-going-to-happen
> > > > > compiler re-invocation of a significant code block is just very wrong.
> > > > >
> > > > > So let's just let it go entirely. :)
> > > >
> > > > Yeah, once it's clear for everyone, I am happy to remove the
> > > > unnecessary lines.
> > > >
> > > > >
> > > > > > +
> > > > > > + if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
> > > > >
> > >
> > > With or without __READ_ONCE() or volatile or anything else,
> > > this code will do what you want. Which is: loosely check
> > > for either of the above.
> > >
> > > What functional problem do you think you are preventing
> > > with __READ_ONCE()? Because I don't see one.
> >
> > I discussed the issue at v1 so please take a look.
> >
> > https://lore.kernel.org/all/[email protected]/
>
> I read that, but there was never any real justification there for needing
> to prevent a re-read of mt, just a preference: "I'd like to keep use the local
> variable mt's value in folloing conditions checks instead of refetching
> the value from get_pageblock_migratetype."
>
> But I don't believe that there is any combination of values of mt that
> will cause a problem here.
>
> I also think that once we pull in experts, they will tell us that the
> compiler is not going to re-run a non-trivial function to re-fetch a
> value, but I'm not one of those experts, so that's still arguable. But
> imagine what the kernel code would look like if every time we call
> a large function, we have to consider if it actually gets called some
> arbitrary number of times, due to (anti-) optimizations by the compiler.
> This seems like something that is not really happening.
Maybe, I might be paranoid since I have heard too subtle things
about how compiler could changes high level language code so wanted
be careful especially when we do lockless-stuff.
Who cares when we change the large(?) function to small(?) function
later on? I'd like to hear from experts to decide it.
On Wed, May 11, 2022 at 05:49:49PM -0700, Paul E. McKenney wrote:
> On Wed, May 11, 2022 at 05:34:52PM -0700, John Hubbard wrote:
> > On 5/11/22 17:26, Minchan Kim wrote:
> > > > > Let me try to say this more clearly: I don't think that the following
> > > > > __READ_ONCE() statement can actually help anything, given that
> > > > > get_pageblock_migratetype() is non-inlined:
> > > > >
> > > > > + int __mt = get_pageblock_migratetype(page);
> > > > > + int mt = __READ_ONCE(__mt);
> > > > > +
> > > > > + if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> > > > > + return false;
> > > > >
> > > > >
> > > > > Am I missing anything here?
> > > >
> > > > In the absence of future aggression from link-time optimizations (LTO),
> > > > you are missing nothing.
> > >
> > > A thing I want to note is Android kernel uses LTO full mode.
> >
> > Thanks Paul for explaining the state of things.
> >
> > Minchan, how about something like very close to your original draft,
> > then, but with a little note, and the "&" as well:
> >
> > int __mt = get_pageblock_migratetype(page);
> >
> > /*
> > * Defend against future compiler LTO features, or code refactoring
> > * that inlines the above function, by forcing a single read. Because, this
> > * routine races with set_pageblock_migratetype(), and we want to avoid
> > * reading zero, when actually one or the other flags was set.
> > */
> > int mt = __READ_ONCE(__mt);
> >
> > if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> > return false;
> >
> >
> > ...which should make everyone comfortable and protected from the
> > future sins of the compiler and linker teams? :)
>
> This would work, but it would force a store to the stack and an immediate
> reload. Which might be OK on this code path.
>
> But using READ_ONCE() in (I think?) __get_pfnblock_flags_mask()
> would likely generate the same code that is produced today.
>
> word = READ_ONCE(bitmap[word_bitidx]);
>
> But I could easily have missed a turn in that cascade of functions. ;-)
>
> Or there might be some code path that really hates a READ_ONCE() in
> that place.
My worry about chaning __get_pfnblock_flags_mask is it's called
multiple hot places in mm codes so I didn't want to add overhead
to them.
On 5/11/22 18:08, John Hubbard wrote:
> On 5/11/22 18:03, Minchan Kim wrote:
>>>
>>> Or there might be some code path that really hates a READ_ONCE() in
>>> that place.
>>
>> My worry about chaning __get_pfnblock_flags_mask is it's called
>> multiple hot places in mm codes so I didn't want to add overhead
>> to them.
>
> ...unless it really does generate the same code as is already there,
> right? Let me check that real quick.
>
It does change the generated code slightly. I don't know if this will
affect performance here or not. But just for completeness, here you go:
free_one_page() originally has this (just showing the changed parts):
mov 0x8(%rdx,%rax,8),%rbx
and $0x3f,%ecx
shr %cl,%rbx
and $0x7,
And after applying this diff:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e42038382c1..df1f8e9a294f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -482,7 +482,7 @@ unsigned long __get_pfnblock_flags_mask(const struct
page *page,
word_bitidx = bitidx / BITS_PER_LONG;
bitidx &= (BITS_PER_LONG-1);
- word = bitmap[word_bitidx];
+ word = READ_ONCE(bitmap[word_bitidx]);
return (word >> bitidx) & mask;
}
...it now does an extra memory dereference:
lea 0x8(%rdx,%rax,8),%rax
and $0x3f,%ecx
mov (%rax),%rbx
shr %cl,%rbx
and $0x7,%ebx
thanks,
--
John Hubbard
NVIDIA
On Wed, May 11, 2022 at 07:18:56PM -0700, John Hubbard wrote:
> On 5/11/22 18:08, John Hubbard wrote:
> > On 5/11/22 18:03, Minchan Kim wrote:
> > > >
> > > > Or there might be some code path that really hates a READ_ONCE() in
> > > > that place.
> > >
> > > My worry about chaning __get_pfnblock_flags_mask is it's called
> > > multiple hot places in mm codes so I didn't want to add overhead
> > > to them.
> >
> > ...unless it really does generate the same code as is already there,
> > right? Let me check that real quick.
> >
>
> It does change the generated code slightly. I don't know if this will
> affect performance here or not. But just for completeness, here you go:
>
> free_one_page() originally has this (just showing the changed parts):
>
> mov 0x8(%rdx,%rax,8),%rbx
> and $0x3f,%ecx
> shr %cl,%rbx
> and $0x7,
>
>
> And after applying this diff:
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0e42038382c1..df1f8e9a294f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -482,7 +482,7 @@ unsigned long __get_pfnblock_flags_mask(const struct
> page *page,
> word_bitidx = bitidx / BITS_PER_LONG;
> bitidx &= (BITS_PER_LONG-1);
>
> - word = bitmap[word_bitidx];
> + word = READ_ONCE(bitmap[word_bitidx]);
> return (word >> bitidx) & mask;
> }
>
>
> ...it now does an extra memory dereference:
>
> lea 0x8(%rdx,%rax,8),%rax
> and $0x3f,%ecx
> mov (%rax),%rbx
> shr %cl,%rbx
> and $0x7,%ebx
That could indeed be a bad thing on a fastpath.
Thanx, Paul
On 5/11/22 15:37, Minchan Kim wrote:
>> Yes. But one thing that is still unanswered, that I think you can
>> answer, is: even if the compiler *did* re-read the mt variable, what
>> problems could that cause? I claim "no problems", because there is
>> no combination of 0, _CMA, _ISOLATE, _CMA|ISOLATE that will cause
>> problems here.
>
> What scenario I am concerning with __READ_ONCE so compiler
> inlining get_pageblock_migratetype two times are
>
> CPU 0 CPU 1
> alloc_contig_range
> is_pinnable_page start_isolate_page_range
> set_pageblock_migratetype(MIGRATE_ISOLATE)
> if (get_pageeblock_migratetype(page) == MIGRATE_CMA)
> so it's false
> undo:
> set_pageblock_migratetype(MIGRATE_CMA)
>
> if (get_pageeblock_migratetype(page) == MIGRATE_ISOLATE)
> so it's false
>
> In the end, CMA memory would be pinned by CPU 0 process
> so CMA allocation keep failed until the process release the
> refcount.
>
OK, so the code checks the wrong item each time. But the code really
only needs to know "is either _CMA or _ISOLATE set?". And so you
can just sidestep the entire question by writing it like this:
int mt = get_pageblock_migratetype(page);
if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
return false;
...yes?
thanks,
--
John Hubbard
NVIDIA
On Wed, May 11, 2022 at 04:13:10PM -0700, John Hubbard wrote:
> On 5/11/22 16:08, Minchan Kim wrote:
> > > OK, so the code checks the wrong item each time. But the code really
> > > only needs to know "is either _CMA or _ISOLATE set?". And so you
> >
> > Yes.
> >
> > > can just sidestep the entire question by writing it like this:
> > >
> > > int mt = get_pageblock_migratetype(page);
> > >
> > > if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
> > > return false;
> >
> > I am confused. Isn't it same question?
> >
> > set_pageblock_migratetype(MIGRATE_ISOLATE)
> > if (get_pageblock_migrate(page) & MIGRATE_CMA)
> >
> > set_pageblock_migratetype(MIGRATE_CMA)
> >
> > if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
>
> Well no, because the "&" operation is a single operation on the CPU, and
> isn't going to get split up like that.
Oh, if that's true, yeah, I could live with it.
Thanks, let me post next revision with commenting about that.
On Wed, May 11, 2022 at 04:15:17PM -0700, Minchan Kim wrote:
> On Wed, May 11, 2022 at 04:13:10PM -0700, John Hubbard wrote:
> > On 5/11/22 16:08, Minchan Kim wrote:
> > > > OK, so the code checks the wrong item each time. But the code really
> > > > only needs to know "is either _CMA or _ISOLATE set?". And so you
> > >
> > > Yes.
> > >
> > > > can just sidestep the entire question by writing it like this:
> > > >
> > > > int mt = get_pageblock_migratetype(page);
> > > >
> > > > if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
> > > > return false;
> > >
> > > I am confused. Isn't it same question?
> > >
> > > set_pageblock_migratetype(MIGRATE_ISOLATE)
> > > if (get_pageblock_migrate(page) & MIGRATE_CMA)
> > >
> > > set_pageblock_migratetype(MIGRATE_CMA)
> > >
> > > if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
> >
> > Well no, because the "&" operation is a single operation on the CPU, and
> > isn't going to get split up like that.
>
> Oh, if that's true, yeah, I could live with it.
>
> Thanks, let me post next revision with commenting about that.
This is delta to confirm before posting next revision.
Are you okay with this one?
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cbf79eb790e0..7b2df6780552 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1626,14 +1626,14 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
static inline bool is_pinnable_page(struct page *page)
{
#ifdef CONFIG_CMA
+ int mt = get_pageblock_migratetype(page);
+
/*
- * use volatile to use local variable mt instead of
- * refetching mt value.
+ * "&" operation would prevent compiler split up
+ * get_pageblock_migratetype two times for each
+ * condition check: refetching mt value two times.
*/
- int __mt = get_pageblock_migratetype(page);
- int mt = __READ_ONCE(__mt);
-
- if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
+ if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
return false;
#endif
On Wed, May 11, 2022 at 04:13:10PM -0700, John Hubbard wrote:
> On 5/11/22 16:08, Minchan Kim wrote:
> > > OK, so the code checks the wrong item each time. But the code really
> > > only needs to know "is either _CMA or _ISOLATE set?". And so you
> >
> > Yes.
> >
> > > can just sidestep the entire question by writing it like this:
> > >
> > > int mt = get_pageblock_migratetype(page);
> > >
> > > if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
> > > return false;
> >
> > I am confused. Isn't it same question?
> >
> > set_pageblock_migratetype(MIGRATE_ISOLATE)
> > if (get_pageblock_migrate(page) & MIGRATE_CMA)
> >
> > set_pageblock_migratetype(MIGRATE_CMA)
> >
> > if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
>
> Well no, because the "&" operation is a single operation on the CPU, and
> isn't going to get split up like that.
Chiming in a bit late...
The usual way that this sort of thing causes trouble is if there is a
single store instruction that changes the value from MIGRATE_ISOLATE
to MIGRATE_CMA, and if the compiler decides to fetch twice, AND twice,
and then combine the results. This could give a zero outcome where the
underlying variable never had the value zero.
Is this sort of thing low probability?
Definitely.
Isn't this sort of thing prohibited?
Definitely not.
So what you have will likely work for at least a while longer, but it
is not guaranteed and it forces you to think a lot harder about what
the current implementations of the compiler can and cannot do to you.
The following LWN article goes through some of the possible optimizations
(vandalisms?) in this area: https://lwn.net/Articles/793253/
In the end, it is your code, so you get to decide how much you would
like to keep track of what compilers get up to over time. ;-)
Thanx, Paul
On 5/11/22 16:08, Minchan Kim wrote:
>> OK, so the code checks the wrong item each time. But the code really
>> only needs to know "is either _CMA or _ISOLATE set?". And so you
>
> Yes.
>
>> can just sidestep the entire question by writing it like this:
>>
>> int mt = get_pageblock_migratetype(page);
>>
>> if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
>> return false;
>
> I am confused. Isn't it same question?
>
> set_pageblock_migratetype(MIGRATE_ISOLATE)
> if (get_pageblock_migrate(page) & MIGRATE_CMA)
>
> set_pageblock_migratetype(MIGRATE_CMA)
>
> if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
Well no, because the "&" operation is a single operation on the CPU, and
isn't going to get split up like that.
thanks,
--
John Hubbard
NVIDIA
On Wed, May 11, 2022 at 05:34:52PM -0700, John Hubbard wrote:
> On 5/11/22 17:26, Minchan Kim wrote:
> > > > Let me try to say this more clearly: I don't think that the following
> > > > __READ_ONCE() statement can actually help anything, given that
> > > > get_pageblock_migratetype() is non-inlined:
> > > >
> > > > + int __mt = get_pageblock_migratetype(page);
> > > > + int mt = __READ_ONCE(__mt);
> > > > +
> > > > + if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> > > > + return false;
> > > >
> > > >
> > > > Am I missing anything here?
> > >
> > > In the absence of future aggression from link-time optimizations (LTO),
> > > you are missing nothing.
> >
> > A thing I want to note is Android kernel uses LTO full mode.
>
> Thanks Paul for explaining the state of things.
>
> Minchan, how about something like very close to your original draft,
> then, but with a little note, and the "&" as well:
>
> int __mt = get_pageblock_migratetype(page);
>
> /*
> * Defend against future compiler LTO features, or code refactoring
> * that inlines the above function, by forcing a single read. Because, this
> * routine races with set_pageblock_migratetype(), and we want to avoid
> * reading zero, when actually one or the other flags was set.
> */
> int mt = __READ_ONCE(__mt);
>
> if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> return false;
>
>
> ...which should make everyone comfortable and protected from the
> future sins of the compiler and linker teams? :)
Will take. Thanks.
On 5/11/22 17:26, Minchan Kim wrote:
>>> Let me try to say this more clearly: I don't think that the following
>>> __READ_ONCE() statement can actually help anything, given that
>>> get_pageblock_migratetype() is non-inlined:
>>>
>>> + int __mt = get_pageblock_migratetype(page);
>>> + int mt = __READ_ONCE(__mt);
>>> +
>>> + if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
>>> + return false;
>>>
>>>
>>> Am I missing anything here?
>>
>> In the absence of future aggression from link-time optimizations (LTO),
>> you are missing nothing.
>
> A thing I want to note is Android kernel uses LTO full mode.
Thanks Paul for explaining the state of things.
Minchan, how about something like very close to your original draft,
then, but with a little note, and the "&" as well:
int __mt = get_pageblock_migratetype(page);
/*
* Defend against future compiler LTO features, or code refactoring
* that inlines the above function, by forcing a single read. Because, this
* routine races with set_pageblock_migratetype(), and we want to avoid
* reading zero, when actually one or the other flags was set.
*/
int mt = __READ_ONCE(__mt);
if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
return false;
...which should make everyone comfortable and protected from the
future sins of the compiler and linker teams? :)
thanks,
--
John Hubbard
NVIDIA
On Tue, May 17, 2022 at 11:12:37AM -0700, John Hubbard wrote:
> On 5/17/22 07:00, Jason Gunthorpe wrote:
> > > > It does change the generated code slightly. I don't know if this will
> > > > affect performance here or not. But just for completeness, here you go:
> > > >
> > > > free_one_page() originally has this (just showing the changed parts):
> > > >
> > > > mov 0x8(%rdx,%rax,8),%rbx
> > > > and $0x3f,%ecx
> > > > shr %cl,%rbx
> > > > and $0x7,
> > > >
> > > >
> > > > And after applying this diff:
> > > >
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 0e42038382c1..df1f8e9a294f 100644
> > > > +++ b/mm/page_alloc.c
> > > > @@ -482,7 +482,7 @@ unsigned long __get_pfnblock_flags_mask(const struct
> > > > page *page,
> > > > word_bitidx = bitidx / BITS_PER_LONG;
> > > > bitidx &= (BITS_PER_LONG-1);
> > > >
> > > > - word = bitmap[word_bitidx];
> > > > + word = READ_ONCE(bitmap[word_bitidx]);
> > > > return (word >> bitidx) & mask;
> > > > }
> > > >
> > > >
> > > > ...it now does an extra memory dereference:
> > > >
> > > > lea 0x8(%rdx,%rax,8),%rax
> > > > and $0x3f,%ecx
> > > > mov (%rax),%rbx
> > > > shr %cl,%rbx
> > > > and $0x7,%ebx
> >
> > Where is the extra memory reference? 'lea' is not a memory reference,
> > it is just some maths?
>
> If you compare this to the snippet above, you'll see that there is
> an extra mov statement, and that one dereferences a pointer from
> %rax:
>
> mov (%rax),%rbx
That is the same move as:
mov 0x8(%rdx,%rax,8),%rbx
Except that the EA calculation was done in advance and stored in rax.
lea isn't a memory reference, it is just computing the pointer value
that 0x8(%rdx,%rax,8) represents. ie the lea computes
%rax = %rdx + %rax*8 + 6
Which is then fed into the mov. Maybe it is an optimization to allow
one pipe to do the shr and an other to the EA - IDK, seems like a
random thing for the compiler to do.
> > IMHO putting a READ_ONCE on something that is not a memory load from
> > shared data is nonsense - if a simple == has a stability risk then so
> > does the '(word >> bitidx) & mask'.
>
> Doing something like this:
>
> int __x = y();
> int x = READ_ONCE(__x);
>
> is just awful! I agree. Really, y() should handle any barriers, because
> otherwise it really does look pointless, and people reading the code
> need something that is clearer. My first reaction was that this was
> pointless and wrong, and it turns out that that's only about 80% true:
> as long as LTO-of-the-future doesn't arrive, and as long as no one
> refactors y() to be inline.
It is always pointless, because look at the whole flow:
y():
word = bitmap[word_bitidx];
return (word >> bitidx) & mask;
int __x = y()
int x = READ_ONCE(__x)
The requirement here is for a coherent read of bitmap[word_bitidx]
that is not mangled by multi-load, load tearing or other theoritcal
compiler problems. Stated more clearly if bitmap[] has values {A,B,C}
then the result of all this in x should always be {A',B',C'} according
to the given maths, never, ever an impossible value D.
The nature of the compiler problems we are trying to prevent is that
the compiler my take a computation that seems deterministic and
corrupt it somehow by making an assumption that the memory is stable
when it is not.
For instance == may not work right if the compiler decides to do:
if (READ_ONCE(a) == 1 && READ_ONCE(a) == 1)
This also means that >> and & could malfunction. So when LTO expands
and inlines everything and you get this:
if (READ_ONCE((bitmap[word_bitidx] >> bitidx) & mask) == MIGRATE_CMA)
It is still not safe because the >> and & could have multi-loaded or
torn the read in some way that it is no longer a sane
calculation. (for instance imagine that the compiler decides to read
the value byte-at-a-time without READ_ONCE)
ie the memory may have had A racing turning into C but x computed into
D not A'.
Paul can correct me, but I understand we do not have a list of allowed
operations that are exempted from the READ_ONCE() requirement. ie it
is not just conditional branching that requires READ_ONCE().
This is why READ_ONCE() must always be on the memory load, because the
point is to sanitize away the uncertainty that comes with an unlocked
read of unstable memory contents. READ_ONCE() samples the value in
memory, and removes all tearing, multiload, etc "instability" that may
effect down stream computations. In this way down stream compulations
become reliable.
Jason
On Tue, May 17, 2022 at 01:12:02PM -0700, John Hubbard wrote:
> On 5/17/22 12:28, Jason Gunthorpe wrote:
> > > If you compare this to the snippet above, you'll see that there is
> > > an extra mov statement, and that one dereferences a pointer from
> > > %rax:
> > >
> > > mov (%rax),%rbx
> >
> > That is the same move as:
> >
> > mov 0x8(%rdx,%rax,8),%rbx
> >
> > Except that the EA calculation was done in advance and stored in rax.
> >
> > lea isn't a memory reference, it is just computing the pointer value
> > that 0x8(%rdx,%rax,8) represents. ie the lea computes
> >
> > %rax = %rdx + %rax*8 + 6
> >
> > Which is then fed into the mov. Maybe it is an optimization to allow
> > one pipe to do the shr and an other to the EA - IDK, seems like a
> > random thing for the compiler to do.
Maybe an optimization suppressed due to the volatile nature of the
load? If so, perhaps it might be considered a compiler bug. Though
it is quite difficult to get optimization bugs involving volatile
to be taken seriously.
> Apologies for getting that wrong, and thanks for walking me through the
> asm.
>
> [...]
> >
> > Paul can correct me, but I understand we do not have a list of allowed
> > operations that are exempted from the READ_ONCE() requirement. ie it
> > is not just conditional branching that requires READ_ONCE().
> >
> > This is why READ_ONCE() must always be on the memory load, because the
> > point is to sanitize away the uncertainty that comes with an unlocked
> > read of unstable memory contents. READ_ONCE() samples the value in
> > memory, and removes all tearing, multiload, etc "instability" that may
> > effect down stream computations. In this way down stream compulations
> > become reliable.
> >
> > Jason
>
> So then:
Works for me!
Acked-by: Paul E. McKenney <[email protected]>
Thanx, Paul
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0e42038382c1..b404f87e2682 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
> word_bitidx = bitidx / BITS_PER_LONG;
> bitidx &= (BITS_PER_LONG-1);
>
> - word = bitmap[word_bitidx];
> + /*
> + * This races, without locks, with set_pageblock_migratetype(). Ensure
> + * a consistent (non-tearing) read of the memory array, so that results,
> + * even though racy, are not corrupted.
> + */
> + word = READ_ONCE(bitmap[word_bitidx]);
> return (word >> bitidx) & mask;
> }
>
>
> thanks,
> --
> John Hubbard
> NVIDIA
On 5/17/22 12:28, Jason Gunthorpe wrote:
>> If you compare this to the snippet above, you'll see that there is
>> an extra mov statement, and that one dereferences a pointer from
>> %rax:
>>
>> mov (%rax),%rbx
>
> That is the same move as:
>
> mov 0x8(%rdx,%rax,8),%rbx
>
> Except that the EA calculation was done in advance and stored in rax.
>
> lea isn't a memory reference, it is just computing the pointer value
> that 0x8(%rdx,%rax,8) represents. ie the lea computes
>
> %rax = %rdx + %rax*8 + 6
>
> Which is then fed into the mov. Maybe it is an optimization to allow
> one pipe to do the shr and an other to the EA - IDK, seems like a
> random thing for the compiler to do.
Apologies for getting that wrong, and thanks for walking me through the
asm.
[...]
>
> Paul can correct me, but I understand we do not have a list of allowed
> operations that are exempted from the READ_ONCE() requirement. ie it
> is not just conditional branching that requires READ_ONCE().
>
> This is why READ_ONCE() must always be on the memory load, because the
> point is to sanitize away the uncertainty that comes with an unlocked
> read of unstable memory contents. READ_ONCE() samples the value in
> memory, and removes all tearing, multiload, etc "instability" that may
> effect down stream computations. In this way down stream compulations
> become reliable.
>
> Jason
So then:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e42038382c1..b404f87e2682 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
word_bitidx = bitidx / BITS_PER_LONG;
bitidx &= (BITS_PER_LONG-1);
- word = bitmap[word_bitidx];
+ /*
+ * This races, without locks, with set_pageblock_migratetype(). Ensure
+ * a consistent (non-tearing) read of the memory array, so that results,
+ * even though racy, are not corrupted.
+ */
+ word = READ_ONCE(bitmap[word_bitidx]);
return (word >> bitidx) & mask;
}
thanks,
--
John Hubbard
NVIDIA
On 5/17/22 07:00, Jason Gunthorpe wrote:
>>> It does change the generated code slightly. I don't know if this will
>>> affect performance here or not. But just for completeness, here you go:
>>>
>>> free_one_page() originally has this (just showing the changed parts):
>>>
>>> mov 0x8(%rdx,%rax,8),%rbx
>>> and $0x3f,%ecx
>>> shr %cl,%rbx
>>> and $0x7,
>>>
>>>
>>> And after applying this diff:
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 0e42038382c1..df1f8e9a294f 100644
>>> +++ b/mm/page_alloc.c
>>> @@ -482,7 +482,7 @@ unsigned long __get_pfnblock_flags_mask(const struct
>>> page *page,
>>> word_bitidx = bitidx / BITS_PER_LONG;
>>> bitidx &= (BITS_PER_LONG-1);
>>>
>>> - word = bitmap[word_bitidx];
>>> + word = READ_ONCE(bitmap[word_bitidx]);
>>> return (word >> bitidx) & mask;
>>> }
>>>
>>>
>>> ...it now does an extra memory dereference:
>>>
>>> lea 0x8(%rdx,%rax,8),%rax
>>> and $0x3f,%ecx
>>> mov (%rax),%rbx
>>> shr %cl,%rbx
>>> and $0x7,%ebx
>
> Where is the extra memory reference? 'lea' is not a memory reference,
> it is just some maths?
If you compare this to the snippet above, you'll see that there is
an extra mov statement, and that one dereferences a pointer from
%rax:
mov (%rax),%rbx
>
>> Thanks for checking, John.
>>
>> I don't want to have the READ_ONCE in __get_pfnblock_flags_mask
>> atm even though it's an extra memory dereference for specific
>> architecutre and specific compiler unless other callsites *do*
>> need it.
>
> If a callpath is called under locking or not under locking then I
> would expect to have two call chains clearly marked what their locking
> conditions are. ie __get_pfn_block_flags_mask_unlocked() - and
__get_pfn_block_flags_mask_unlocked() would definitely clarify things,
and allow some clear documentation, good idea.
I haven't checked to see if some code could keep using the normal
__get_pfn_block_flags_mask(), but if it could, that would help with the
problem of keeping the fast path fast.
> obviously clearly document and check what the locking requirements are
> of the locked path.
>
> IMHO putting a READ_ONCE on something that is not a memory load from
> shared data is nonsense - if a simple == has a stability risk then so
> does the '(word >> bitidx) & mask'.
>
> Jason
Doing something like this:
int __x = y();
int x = READ_ONCE(__x);
is just awful! I agree. Really, y() should handle any barriers, because
otherwise it really does look pointless, and people reading the code
need something that is clearer. My first reaction was that this was
pointless and wrong, and it turns out that that's only about 80% true:
as long as LTO-of-the-future doesn't arrive, and as long as no one
refactors y() to be inline.
thanks,
--
John Hubbard
NVIDIA
On Wed, May 11, 2022 at 08:44:43PM -0700, Minchan Kim wrote:
> On Wed, May 11, 2022 at 07:18:56PM -0700, John Hubbard wrote:
> > On 5/11/22 18:08, John Hubbard wrote:
> > > On 5/11/22 18:03, Minchan Kim wrote:
> > > > >
> > > > > Or there might be some code path that really hates a READ_ONCE() in
> > > > > that place.
> > > >
> > > > My worry about chaning __get_pfnblock_flags_mask is it's called
> > > > multiple hot places in mm codes so I didn't want to add overhead
> > > > to them.
> > >
> > > ...unless it really does generate the same code as is already there,
> > > right? Let me check that real quick.
> > >
> >
> > It does change the generated code slightly. I don't know if this will
> > affect performance here or not. But just for completeness, here you go:
> >
> > free_one_page() originally has this (just showing the changed parts):
> >
> > mov 0x8(%rdx,%rax,8),%rbx
> > and $0x3f,%ecx
> > shr %cl,%rbx
> > and $0x7,
> >
> >
> > And after applying this diff:
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 0e42038382c1..df1f8e9a294f 100644
> > +++ b/mm/page_alloc.c
> > @@ -482,7 +482,7 @@ unsigned long __get_pfnblock_flags_mask(const struct
> > page *page,
> > word_bitidx = bitidx / BITS_PER_LONG;
> > bitidx &= (BITS_PER_LONG-1);
> >
> > - word = bitmap[word_bitidx];
> > + word = READ_ONCE(bitmap[word_bitidx]);
> > return (word >> bitidx) & mask;
> > }
> >
> >
> > ...it now does an extra memory dereference:
> >
> > lea 0x8(%rdx,%rax,8),%rax
> > and $0x3f,%ecx
> > mov (%rax),%rbx
> > shr %cl,%rbx
> > and $0x7,%ebx
Where is the extra memory reference? 'lea' is not a memory reference,
it is just some maths?
> Thanks for checking, John.
>
> I don't want to have the READ_ONCE in __get_pfnblock_flags_mask
> atm even though it's an extra memory dereference for specific
> architecutre and specific compiler unless other callsites *do*
> need it.
If a callpath is called under locking or not under locking then I
would expect to have two call chains clearly marked what their locking
conditions are. ie __get_pfn_block_flags_mask_unlocked() - and
obviously clearly document and check what the locking requirements are
of the locked path.
IMHO putting a READ_ONCE on something that is not a memory load from
shared data is nonsense - if a simple == has a stability risk then so
does the '(word >> bitidx) & mask'.
Jason
On Tue, May 17, 2022 at 01:12:02PM -0700, John Hubbard wrote:
> On 5/17/22 12:28, Jason Gunthorpe wrote:
> > > If you compare this to the snippet above, you'll see that there is
> > > an extra mov statement, and that one dereferences a pointer from
> > > %rax:
> > >
> > > mov (%rax),%rbx
> >
> > That is the same move as:
> >
> > mov 0x8(%rdx,%rax,8),%rbx
> >
> > Except that the EA calculation was done in advance and stored in rax.
> >
> > lea isn't a memory reference, it is just computing the pointer value
> > that 0x8(%rdx,%rax,8) represents. ie the lea computes
> >
> > %rax = %rdx + %rax*8 + 6
> >
> > Which is then fed into the mov. Maybe it is an optimization to allow
> > one pipe to do the shr and an other to the EA - IDK, seems like a
> > random thing for the compiler to do.
>
> Apologies for getting that wrong, and thanks for walking me through the
> asm.
>
> [...]
> >
> > Paul can correct me, but I understand we do not have a list of allowed
> > operations that are exempted from the READ_ONCE() requirement. ie it
> > is not just conditional branching that requires READ_ONCE().
> >
> > This is why READ_ONCE() must always be on the memory load, because the
> > point is to sanitize away the uncertainty that comes with an unlocked
> > read of unstable memory contents. READ_ONCE() samples the value in
> > memory, and removes all tearing, multiload, etc "instability" that may
> > effect down stream computations. In this way down stream compulations
> > become reliable.
> >
> > Jason
>
> So then:
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0e42038382c1..b404f87e2682 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
> word_bitidx = bitidx / BITS_PER_LONG;
> bitidx &= (BITS_PER_LONG-1);
>
> - word = bitmap[word_bitidx];
> + /*
> + * This races, without locks, with set_pageblock_migratetype(). Ensure
set_pfnblock_flags_mask would be better?
> + * a consistent (non-tearing) read of the memory array, so that results,
Thanks for proceeding and suggestion, John.
IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
The concern in our dicussion was aggressive compiler(e.g., LTO) or code refactoring
to make the code inline in *future* could potentially cause forcing refetching(i.e.,
re-read) tie bitmap[word_bitidx].
If so, shouldn't the comment be the one you helped before?
/*
* Defend against future compiler LTO features, or code refactoring
* that inlines the above function, by forcing a single read. Because,
* re-reads of bitmap[word_bitidx] by inlining could cause trouble
* for whom believe they use a local variable for the value.
*/
[1] e58469bafd05, mm: page_alloc: use word-based accesses for get/set pageblock bitmaps
> + * even though racy, are not corrupted.
> + */
> + word = READ_ONCE(bitmap[word_bitidx]);
> return (word >> bitidx) & mask;
> }
>
>
> thanks,
> --
> John Hubbard
> NVIDIA
On 5/23/22 09:33, Minchan Kim wrote:
...
>> So then:
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 0e42038382c1..b404f87e2682 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
>> word_bitidx = bitidx / BITS_PER_LONG;
>> bitidx &= (BITS_PER_LONG-1);
>>
>> - word = bitmap[word_bitidx];
>> + /*
>> + * This races, without locks, with set_pageblock_migratetype(). Ensure
>
> set_pfnblock_flags_mask would be better?
>
>> + * a consistent (non-tearing) read of the memory array, so that results,
>
> Thanks for proceeding and suggestion, John.
>
> IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
Did it? [1] fixed something, but I'm not sure we can claim that that
code is now safe against tearing in all possible cases, especially given
the recent discussion here. Specifically, having this code do a read,
then follow that up with calculations, seems correct. Anything else is
sketchy...
>
> The concern in our dicussion was aggressive compiler(e.g., LTO) or code refactoring
> to make the code inline in *future* could potentially cause forcing refetching(i.e.,
> re-read) tie bitmap[word_bitidx].
>
> If so, shouldn't the comment be the one you helped before?
Well, maybe updated to something like this?
/*
* This races, without locks, with set_pageblock_migratetype(). Ensure
* a consistent (non-tearing) read of the memory array, so that results,
* even though racy, are not corrupted--even if this function is
* refactored and/or inlined.
*/
thanks,
--
John Hubbard
NVIDIA
On Mon, May 23, 2022 at 07:55:25PM -0700, John Hubbard wrote:
> On 5/23/22 09:33, Minchan Kim wrote:
> ...
> > > So then:
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 0e42038382c1..b404f87e2682 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
> > > word_bitidx = bitidx / BITS_PER_LONG;
> > > bitidx &= (BITS_PER_LONG-1);
> > >
> > > - word = bitmap[word_bitidx];
> > > + /*
> > > + * This races, without locks, with set_pageblock_migratetype(). Ensure
> > set_pfnblock_flags_mask would be better?
> > > + * a consistent (non-tearing) read of the memory array, so that results,
> >
> > Thanks for proceeding and suggestion, John.
> >
> > IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
>
> Did it? [1] fixed something, but I'm not sure we can claim that that
> code is now safe against tearing in all possible cases, especially given
> the recent discussion here. Specifically, having this code do a read,
> then follow that up with calculations, seems correct. Anything else is
The load tearing you are trying to explain in the comment would be
solved by [1] since the bits will always align on a word and accessing
word size based on word aligned address is always atomic so there is
no load tearing problem IIUC.
Instead of the tearing problem, what we are trying to solve with
READ_ONCE is to prevent refetching when the function would be
inlined in the future.
> sketchy...
>
> >
> > The concern in our dicussion was aggressive compiler(e.g., LTO) or code refactoring
> > to make the code inline in *future* could potentially cause forcing refetching(i.e.,
> > re-read) tie bitmap[word_bitidx].
> >
> > If so, shouldn't the comment be the one you helped before?
>
> Well, maybe updated to something like this?
>
> /*
> * This races, without locks, with set_pageblock_migratetype(). Ensure
set_pageblock_migratetype is more upper level function so it would
be better fit to say set_pfnblock_flags_mask.
> * a consistent (non-tearing) read of the memory array, so that results,
So tearing problem should't already happen by [1] so I am trying to
explain refetching(or re-read) problem in the comment.
> * even though racy, are not corrupted--even if this function is
The value is already atomic so I don't think it could be corrupted
even though it would be inlined in the future.
Please correct me if I miss something.
> * refactored and/or inlined.
> */
On 5/23/22 10:16 PM, Minchan Kim wrote:
> On Mon, May 23, 2022 at 07:55:25PM -0700, John Hubbard wrote:
>> On 5/23/22 09:33, Minchan Kim wrote:
>> ...
>>>> So then:
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 0e42038382c1..b404f87e2682 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
>>>> word_bitidx = bitidx / BITS_PER_LONG;
>>>> bitidx &= (BITS_PER_LONG-1);
>>>>
>>>> - word = bitmap[word_bitidx];
>>>> + /*
>>>> + * This races, without locks, with set_pageblock_migratetype(). Ensure
>>> set_pfnblock_flags_mask would be better?
>>>> + * a consistent (non-tearing) read of the memory array, so that results,
>>>
>>> Thanks for proceeding and suggestion, John.
>>>
>>> IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
>>
>> Did it? [1] fixed something, but I'm not sure we can claim that that
>> code is now safe against tearing in all possible cases, especially given
>> the recent discussion here. Specifically, having this code do a read,
>> then follow that up with calculations, seems correct. Anything else is
>
> The load tearing you are trying to explain in the comment would be
> solved by [1] since the bits will always align on a word and accessing
> word size based on word aligned address is always atomic so there is
> no load tearing problem IIUC.
>
> Instead of the tearing problem, what we are trying to solve with
> READ_ONCE is to prevent refetching when the function would be
> inlined in the future.
>
I'm perhaps using "tearing" as too broad of a term, maybe just removing
the "(non-tearing)" part would fix up the comment.
>> sketchy...
>>
>>>
>>> The concern in our dicussion was aggressive compiler(e.g., LTO) or code refactoring
>>> to make the code inline in *future* could potentially cause forcing refetching(i.e.,
>>> re-read) tie bitmap[word_bitidx].
>>>
>>> If so, shouldn't the comment be the one you helped before?
>>
>> Well, maybe updated to something like this?
>>
>> /*
>> * This races, without locks, with set_pageblock_migratetype(). Ensure
>
> set_pageblock_migratetype is more upper level function so it would
> be better fit to say set_pfnblock_flags_mask.
OK
>
>> * a consistent (non-tearing) read of the memory array, so that results,
>
> So tearing problem should't already happen by [1] so I am trying to
> explain refetching(or re-read) problem in the comment.
>
>> * even though racy, are not corrupted--even if this function is
>
> The value is already atomic so I don't think it could be corrupted
> even though it would be inlined in the future.
>
> Please correct me if I miss something.
>
>> * refactored and/or inlined.
>> */
>
thanks,
--
John Hubbard
NVIDIA
On Mon, May 23, 2022 at 10:16:58PM -0700, Minchan Kim wrote:
> On Mon, May 23, 2022 at 07:55:25PM -0700, John Hubbard wrote:
> > On 5/23/22 09:33, Minchan Kim wrote:
> > ...
> > > > So then:
> > > >
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 0e42038382c1..b404f87e2682 100644
> > > > +++ b/mm/page_alloc.c
> > > > @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
> > > > word_bitidx = bitidx / BITS_PER_LONG;
> > > > bitidx &= (BITS_PER_LONG-1);
> > > >
> > > > - word = bitmap[word_bitidx];
> > > > + /*
> > > > + * This races, without locks, with set_pageblock_migratetype(). Ensure
> > > set_pfnblock_flags_mask would be better?
> > > > + * a consistent (non-tearing) read of the memory array, so that results,
> > >
> > > Thanks for proceeding and suggestion, John.
> > >
> > > IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
> >
> > Did it? [1] fixed something, but I'm not sure we can claim that that
> > code is now safe against tearing in all possible cases, especially given
> > the recent discussion here. Specifically, having this code do a read,
> > then follow that up with calculations, seems correct. Anything else is
>
> The load tearing you are trying to explain in the comment would be
> solved by [1] since the bits will always align on a word and accessing
> word size based on word aligned address is always atomic so there is
> no load tearing problem IIUC.
That is not technically true. It is exactly the sort of thing
READ_ONCE is intended to guard against.
> Instead of the tearing problem, what we are trying to solve with
> READ_ONCE is to prevent refetching when the function would be
> inlined in the future.
It is the same problem, who is to say it doesn't refetch while doing
the maths?
Jason
On Tue, May 24, 2022 at 12:48:31PM -0300, Jason Gunthorpe wrote:
> On Tue, May 24, 2022 at 08:43:27AM -0700, Minchan Kim wrote:
> > On Tue, May 24, 2022 at 11:19:37AM -0300, Jason Gunthorpe wrote:
> > > On Mon, May 23, 2022 at 10:16:58PM -0700, Minchan Kim wrote:
> > > > On Mon, May 23, 2022 at 07:55:25PM -0700, John Hubbard wrote:
> > > > > On 5/23/22 09:33, Minchan Kim wrote:
> > > > > ...
> > > > > > > So then:
> > > > > > >
> > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > > index 0e42038382c1..b404f87e2682 100644
> > > > > > > +++ b/mm/page_alloc.c
> > > > > > > @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
> > > > > > > word_bitidx = bitidx / BITS_PER_LONG;
> > > > > > > bitidx &= (BITS_PER_LONG-1);
> > > > > > >
> > > > > > > - word = bitmap[word_bitidx];
> > > > > > > + /*
> > > > > > > + * This races, without locks, with set_pageblock_migratetype(). Ensure
> > > > > > set_pfnblock_flags_mask would be better?
> > > > > > > + * a consistent (non-tearing) read of the memory array, so that results,
> > > > > >
> > > > > > Thanks for proceeding and suggestion, John.
> > > > > >
> > > > > > IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
> > > > >
> > > > > Did it? [1] fixed something, but I'm not sure we can claim that that
> > > > > code is now safe against tearing in all possible cases, especially given
> > > > > the recent discussion here. Specifically, having this code do a read,
> > > > > then follow that up with calculations, seems correct. Anything else is
> > > >
> > > > The load tearing you are trying to explain in the comment would be
> > > > solved by [1] since the bits will always align on a word and accessing
> > > > word size based on word aligned address is always atomic so there is
> > > > no load tearing problem IIUC.
> > >
> > > That is not technically true. It is exactly the sort of thing
> > > READ_ONCE is intended to guard against.
> >
> > Oh, does word access based on the aligned address still happen
> > load tearing?
> >
> > I just referred to
> > https://elixir.bootlin.com/linux/latest/source/Documentation/memory-barriers.txt#L1759
>
> I read that as saying load tearing is technically allowed but doesn't
> happen in gcc, and so must use the _ONCE macros.
This is in fact the intent, except...
And as that passage goes on to state, there really are compilers (such
as GCC) that tear stores of constants to machine aligned/sized locations.
In short, use of the _ONCE() macros can save you a lot of pain.
> > I didn't say it doesn't refetch the value without the READ_ONCE.
> >
> > What I am saying is READ_ONCE(bitmap_word_bitidx] prevents "refetching"
> > issue rather than "tearing" issue in specific __get_pfnblock_flags_mask
> > context because I though there is no load-tearing issue there since
> > bitmap is word-aligned/accessed. No?
>
> It does both. AFAIK our memory model has no guarentees on what naked C
> statements will do. Tearing, multi-load, etc - it is all technically
> permitted. Use the proper accessors.
I am with Jason on this one.
In fact, I believe that any naked C-language access to mutable shared
variables should have a comment stating why the compiler cannot mangle
that access.
Thanx, Paul
On Tue, May 24, 2022 at 11:19:37AM -0300, Jason Gunthorpe wrote:
> On Mon, May 23, 2022 at 10:16:58PM -0700, Minchan Kim wrote:
> > On Mon, May 23, 2022 at 07:55:25PM -0700, John Hubbard wrote:
> > > On 5/23/22 09:33, Minchan Kim wrote:
> > > ...
> > > > > So then:
> > > > >
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index 0e42038382c1..b404f87e2682 100644
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
> > > > > word_bitidx = bitidx / BITS_PER_LONG;
> > > > > bitidx &= (BITS_PER_LONG-1);
> > > > >
> > > > > - word = bitmap[word_bitidx];
> > > > > + /*
> > > > > + * This races, without locks, with set_pageblock_migratetype(). Ensure
> > > > set_pfnblock_flags_mask would be better?
> > > > > + * a consistent (non-tearing) read of the memory array, so that results,
> > > >
> > > > Thanks for proceeding and suggestion, John.
> > > >
> > > > IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
> > >
> > > Did it? [1] fixed something, but I'm not sure we can claim that that
> > > code is now safe against tearing in all possible cases, especially given
> > > the recent discussion here. Specifically, having this code do a read,
> > > then follow that up with calculations, seems correct. Anything else is
> >
> > The load tearing you are trying to explain in the comment would be
> > solved by [1] since the bits will always align on a word and accessing
> > word size based on word aligned address is always atomic so there is
> > no load tearing problem IIUC.
>
> That is not technically true. It is exactly the sort of thing
> READ_ONCE is intended to guard against.
Oh, does word access based on the aligned address still happen
load tearing?
I just referred to
https://elixir.bootlin.com/linux/latest/source/Documentation/memory-barriers.txt#L1759
>
> > Instead of the tearing problem, what we are trying to solve with
> > READ_ONCE is to prevent refetching when the function would be
> > inlined in the future.
>
> It is the same problem, who is to say it doesn't refetch while doing
> the maths?
I didn't say it doesn't refetch the value without the READ_ONCE.
What I am saying is READ_ONCE(bitmap_word_bitidx] prevents "refetching"
issue rather than "tearing" issue in specific __get_pfnblock_flags_mask
context because I though there is no load-tearing issue there since
bitmap is word-aligned/accessed. No?
If the load tearing would still happens in the bitmap, it would be
better to keep last suggestion from John.
+ /*
+ * This races, without locks, with set_pfnblock_flags_mask(). Ensure
+ * a consistent read of the memory array, so that results, even though
+ * racy, are not corrupted.
+ */
On Tue, May 24, 2022 at 09:37:28AM -0700, Paul E. McKenney wrote:
> On Tue, May 24, 2022 at 12:48:31PM -0300, Jason Gunthorpe wrote:
> > On Tue, May 24, 2022 at 08:43:27AM -0700, Minchan Kim wrote:
> > > On Tue, May 24, 2022 at 11:19:37AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, May 23, 2022 at 10:16:58PM -0700, Minchan Kim wrote:
> > > > > On Mon, May 23, 2022 at 07:55:25PM -0700, John Hubbard wrote:
> > > > > > On 5/23/22 09:33, Minchan Kim wrote:
> > > > > > ...
> > > > > > > > So then:
> > > > > > > >
> > > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > > > index 0e42038382c1..b404f87e2682 100644
> > > > > > > > +++ b/mm/page_alloc.c
> > > > > > > > @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
> > > > > > > > word_bitidx = bitidx / BITS_PER_LONG;
> > > > > > > > bitidx &= (BITS_PER_LONG-1);
> > > > > > > >
> > > > > > > > - word = bitmap[word_bitidx];
> > > > > > > > + /*
> > > > > > > > + * This races, without locks, with set_pageblock_migratetype(). Ensure
> > > > > > > set_pfnblock_flags_mask would be better?
> > > > > > > > + * a consistent (non-tearing) read of the memory array, so that results,
> > > > > > >
> > > > > > > Thanks for proceeding and suggestion, John.
> > > > > > >
> > > > > > > IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
> > > > > >
> > > > > > Did it? [1] fixed something, but I'm not sure we can claim that that
> > > > > > code is now safe against tearing in all possible cases, especially given
> > > > > > the recent discussion here. Specifically, having this code do a read,
> > > > > > then follow that up with calculations, seems correct. Anything else is
> > > > >
> > > > > The load tearing you are trying to explain in the comment would be
> > > > > solved by [1] since the bits will always align on a word and accessing
> > > > > word size based on word aligned address is always atomic so there is
> > > > > no load tearing problem IIUC.
> > > >
> > > > That is not technically true. It is exactly the sort of thing
> > > > READ_ONCE is intended to guard against.
> > >
> > > Oh, does word access based on the aligned address still happen
> > > load tearing?
> > >
> > > I just referred to
> > > https://elixir.bootlin.com/linux/latest/source/Documentation/memory-barriers.txt#L1759
> >
> > I read that as saying load tearing is technically allowed but doesn't
> > happen in gcc, and so must use the _ONCE macros.
>
> This is in fact the intent, except...
>
> And as that passage goes on to state, there really are compilers (such
> as GCC) that tear stores of constants to machine aligned/sized locations.
>
> In short, use of the _ONCE() macros can save you a lot of pain.
Thanks for the correction, Jason and Paul
>
> > > I didn't say it doesn't refetch the value without the READ_ONCE.
> > >
> > > What I am saying is READ_ONCE(bitmap_word_bitidx] prevents "refetching"
> > > issue rather than "tearing" issue in specific __get_pfnblock_flags_mask
> > > context because I though there is no load-tearing issue there since
> > > bitmap is word-aligned/accessed. No?
> >
> > It does both. AFAIK our memory model has no guarentees on what naked C
> > statements will do. Tearing, multi-load, etc - it is all technically
> > permitted. Use the proper accessors.
Seems like there was some misunderstanding here.
I didn't mean not to use READ_ONCE for the bitmap but wanted to have
more concrete comment. Since you guys corrected "even though word-alinged
access could be wrong without READ_ONCE", I would keep the comment John
suggested.
>
> I am with Jason on this one.
>
> In fact, I believe that any naked C-language access to mutable shared
> variables should have a comment stating why the compiler cannot mangle
> that access.
Agreed.
Thanks!
On Tue, May 24, 2022 at 08:43:27AM -0700, Minchan Kim wrote:
> On Tue, May 24, 2022 at 11:19:37AM -0300, Jason Gunthorpe wrote:
> > On Mon, May 23, 2022 at 10:16:58PM -0700, Minchan Kim wrote:
> > > On Mon, May 23, 2022 at 07:55:25PM -0700, John Hubbard wrote:
> > > > On 5/23/22 09:33, Minchan Kim wrote:
> > > > ...
> > > > > > So then:
> > > > > >
> > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > index 0e42038382c1..b404f87e2682 100644
> > > > > > +++ b/mm/page_alloc.c
> > > > > > @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
> > > > > > word_bitidx = bitidx / BITS_PER_LONG;
> > > > > > bitidx &= (BITS_PER_LONG-1);
> > > > > >
> > > > > > - word = bitmap[word_bitidx];
> > > > > > + /*
> > > > > > + * This races, without locks, with set_pageblock_migratetype(). Ensure
> > > > > set_pfnblock_flags_mask would be better?
> > > > > > + * a consistent (non-tearing) read of the memory array, so that results,
> > > > >
> > > > > Thanks for proceeding and suggestion, John.
> > > > >
> > > > > IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
> > > >
> > > > Did it? [1] fixed something, but I'm not sure we can claim that that
> > > > code is now safe against tearing in all possible cases, especially given
> > > > the recent discussion here. Specifically, having this code do a read,
> > > > then follow that up with calculations, seems correct. Anything else is
> > >
> > > The load tearing you are trying to explain in the comment would be
> > > solved by [1] since the bits will always align on a word and accessing
> > > word size based on word aligned address is always atomic so there is
> > > no load tearing problem IIUC.
> >
> > That is not technically true. It is exactly the sort of thing
> > READ_ONCE is intended to guard against.
>
> Oh, does word access based on the aligned address still happen
> load tearing?
>
> I just referred to
> https://elixir.bootlin.com/linux/latest/source/Documentation/memory-barriers.txt#L1759
I read that as saying load tearing is technically allowed but doesn't
happen in gcc, and so must use the _ONCE macros.
> I didn't say it doesn't refetch the value without the READ_ONCE.
>
> What I am saying is READ_ONCE(bitmap_word_bitidx] prevents "refetching"
> issue rather than "tearing" issue in specific __get_pfnblock_flags_mask
> context because I though there is no load-tearing issue there since
> bitmap is word-aligned/accessed. No?
It does both. AFAIK our memory model has no guarentees on what naked C
statements will do. Tearing, multi-load, etc - it is all technically
permitted. Use the proper accessors.
Jason