Since commit 5232c63f46fd ("mm: Make compound_pincount always available"),
compound_pincount_ptr is stored at first tail page now. So we should call
prep_compound_head() after the first tail page is initialized to take
advantage of the likelihood of that tail struct page being cached given
that we will read them right after in prep_compound_head().
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Joao Martins <[email protected]>
---
v2:
Don't move prep_compound_head() outside loop per Joao.
---
mm/page_alloc.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4c7d99ee58b4..048df5d78add 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6771,13 +6771,18 @@ static void __ref memmap_init_compound(struct page *head,
set_page_count(page, 0);
/*
- * The first tail page stores compound_mapcount_ptr() and
- * compound_order() and the second tail page stores
- * compound_pincount_ptr(). Call prep_compound_head() after
- * the first and second tail pages have been initialized to
- * not have the data overwritten.
+ * The first tail page stores compound_mapcount_ptr(),
+ * compound_order() and compound_pincount_ptr(). Call
+ * prep_compound_head() after the first tail page have
+ * been initialized to not have the data overwritten.
+ *
+ * Note the idea to make this right after we initialize
+ * the offending tail pages is trying to take advantage
+ * of the likelihood of those tail struct pages being
+ * cached given that we will read them right after in
+ * prep_compound_head().
*/
- if (pfn == head_pfn + 2)
+ if (unlikely(pfn == head_pfn + 1))
prep_compound_head(head, order);
}
}
--
2.23.0
On Sat, Jun 11, 2022 at 10:13:52AM +0800, Miaohe Lin wrote:
> Since commit 5232c63f46fd ("mm: Make compound_pincount always available"),
> compound_pincount_ptr is stored at first tail page now. So we should call
> prep_compound_head() after the first tail page is initialized to take
> advantage of the likelihood of that tail struct page being cached given
> that we will read them right after in prep_compound_head().
>
> Signed-off-by: Miaohe Lin <[email protected]>
> Cc: Joao Martins <[email protected]>
> ---
> v2:
> Don't move prep_compound_head() outside loop per Joao.
> ---
> mm/page_alloc.c | 17 +++++++++++------
> 1 file changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4c7d99ee58b4..048df5d78add 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6771,13 +6771,18 @@ static void __ref memmap_init_compound(struct page *head,
> set_page_count(page, 0);
>
> /*
> - * The first tail page stores compound_mapcount_ptr() and
> - * compound_order() and the second tail page stores
> - * compound_pincount_ptr(). Call prep_compound_head() after
> - * the first and second tail pages have been initialized to
> - * not have the data overwritten.
> + * The first tail page stores compound_mapcount_ptr(),
> + * compound_order() and compound_pincount_ptr(). Call
> + * prep_compound_head() after the first tail page have
> + * been initialized to not have the data overwritten.
> + *
> + * Note the idea to make this right after we initialize
> + * the offending tail pages is trying to take advantage
> + * of the likelihood of those tail struct pages being
> + * cached given that we will read them right after in
> + * prep_compound_head().
> */
> - if (pfn == head_pfn + 2)
> + if (unlikely(pfn == head_pfn + 1))
> prep_compound_head(head, order);
For me it is weird not to put this out of the loop. I saw the reason
is because of the caching suggested by Joao. But I think this is not
a hot path and putting it out of the loop may be more intuitive at least
for me. Maybe this optimization is unnecessary (maybe I am wrong).
And it will be consistent with prep_compound_page() (at least it does
not do the similar optimization) if we drop this optimization.
Hi Joao,
I am wondering is it a significant optimization for zone device memory?
I found this code existed from the 1st version you introduced. So
I suspect maybe you have some numbers, would you like to share with us?
Thanks.
On 2022/6/12 23:44, Muchun Song wrote:
> On Sat, Jun 11, 2022 at 10:13:52AM +0800, Miaohe Lin wrote:
>> Since commit 5232c63f46fd ("mm: Make compound_pincount always available"),
>> compound_pincount_ptr is stored at first tail page now. So we should call
>> prep_compound_head() after the first tail page is initialized to take
>> advantage of the likelihood of that tail struct page being cached given
>> that we will read them right after in prep_compound_head().
>>
>> Signed-off-by: Miaohe Lin <[email protected]>
>> Cc: Joao Martins <[email protected]>
>> ---
>> v2:
>> Don't move prep_compound_head() outside loop per Joao.
>> ---
>> mm/page_alloc.c | 17 +++++++++++------
>> 1 file changed, 11 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 4c7d99ee58b4..048df5d78add 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6771,13 +6771,18 @@ static void __ref memmap_init_compound(struct page *head,
>> set_page_count(page, 0);
>>
>> /*
>> - * The first tail page stores compound_mapcount_ptr() and
>> - * compound_order() and the second tail page stores
>> - * compound_pincount_ptr(). Call prep_compound_head() after
>> - * the first and second tail pages have been initialized to
>> - * not have the data overwritten.
>> + * The first tail page stores compound_mapcount_ptr(),
>> + * compound_order() and compound_pincount_ptr(). Call
>> + * prep_compound_head() after the first tail page have
>> + * been initialized to not have the data overwritten.
>> + *
>> + * Note the idea to make this right after we initialize
>> + * the offending tail pages is trying to take advantage
>> + * of the likelihood of those tail struct pages being
>> + * cached given that we will read them right after in
>> + * prep_compound_head().
>> */
>> - if (pfn == head_pfn + 2)
>> + if (unlikely(pfn == head_pfn + 1))
>> prep_compound_head(head, order);
>
> For me it is weird not to put this out of the loop. I saw the reason
> is because of the caching suggested by Joao. But I think this is not
> a hot path and putting it out of the loop may be more intuitive at least
> for me. Maybe this optimization is unnecessary (maybe I am wrong).
> And it will be consistent with prep_compound_page() (at least it does
> not do the similar optimization) if we drop this optimization.
This is also what I thought in the first version. :)
>
> Hi Joao,
>
> I am wondering is it a significant optimization for zone device memory?
> I found this code existed from the 1st version you introduced. So
> I suspect maybe you have some numbers, would you like to share with us?
Those numbers would be really helpful.
>
> Thanks.
Thanks!
>
> .
>
[was out the past couple days, hence the late response]
On 6/12/22 16:44, Muchun Song wrote:
> On Sat, Jun 11, 2022 at 10:13:52AM +0800, Miaohe Lin wrote:
>> Since commit 5232c63f46fd ("mm: Make compound_pincount always available"),
>> compound_pincount_ptr is stored at first tail page now. So we should call
>> prep_compound_head() after the first tail page is initialized to take
>> advantage of the likelihood of that tail struct page being cached given
>> that we will read them right after in prep_compound_head().
>>
>> Signed-off-by: Miaohe Lin <[email protected]>
>> Cc: Joao Martins <[email protected]>
>> ---
>> v2:
>> Don't move prep_compound_head() outside loop per Joao.
>> ---
>> mm/page_alloc.c | 17 +++++++++++------
>> 1 file changed, 11 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 4c7d99ee58b4..048df5d78add 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6771,13 +6771,18 @@ static void __ref memmap_init_compound(struct page *head,
>> set_page_count(page, 0);
>>
>> /*
>> - * The first tail page stores compound_mapcount_ptr() and
>> - * compound_order() and the second tail page stores
>> - * compound_pincount_ptr(). Call prep_compound_head() after
>> - * the first and second tail pages have been initialized to
>> - * not have the data overwritten.
>> + * The first tail page stores compound_mapcount_ptr(),
>> + * compound_order() and compound_pincount_ptr(). Call
>> + * prep_compound_head() after the first tail page have
>> + * been initialized to not have the data overwritten.
>> + *
>> + * Note the idea to make this right after we initialize
>> + * the offending tail pages is trying to take advantage
>> + * of the likelihood of those tail struct pages being
>> + * cached given that we will read them right after in
>> + * prep_compound_head().
>> */
>> - if (pfn == head_pfn + 2)
>> + if (unlikely(pfn == head_pfn + 1))
>> prep_compound_head(head, order);
>
> For me it is weird not to put this out of the loop. I saw the reason
> is because of the caching suggested by Joao. But I think this is not
> a hot path and putting it out of the loop may be more intuitive at least
> for me. Maybe this optimization is unnecessary (maybe I am wrong).
So, depending on your setup, this might actually sit in the boot path. Yes, it is at
bringup/teardown of new memory, so it does not sit in a 'hot path' and struct pages are
cold. But it is part of a function that initialiazes a whole DIMM worth of struct pages.
And PMEM dimms can be denser than RAM ones IIRC. In my case we usually have 128G PMEM
DIMMs in our servers.
> And it will be consistent with prep_compound_page() (at least it does
> not do the similar optimization) if we drop this optimization.
>
> Hi Joao,
>
> I am wondering is it a significant optimization for zone device memory?
> I found this code existed from the 1st version you introduced.
Not quite. It did not existed in the RFC. As a matter of fact the very first
version was totally ignoring anything cache related (i.e. just calling
prep_compound_page() in the loop for all struct pages after all the struct pages were
inited) until Dan suggested I fix that part because people in the past have spent time
optimizing it.
> So
> I suspect maybe you have some numbers, would you like to share with us?
>
128G DIMMs /with struct pages placed in DRAM/ with 2M page size used to take around
250-400ms. Now once you placed the struct pages in PMEM those numbers go up to 4 secs all
the way up to 8secs (there's a lot of high variance). Now imagine 12 dimms and those
numbers can get in the ranges of 3 - 4.8secs for DRAM-struct-pages, and with
PMEM-struct-pages to more than 48secs.
Note that initializing as compound pages brought those numbers closer in the middle
of the interval given that we need to do more work other than just initializing the
raw struct page. With DRAM struct pages with the vmemmap deduplication trick (which is now
default used) these got decreased down to 80-110ms per DIMM. But I actually got started
with numbers in the order of ~180-190ms per pmem DIMM (ignore cache effects). I should
note that I haven't measured /exactly/ the benefit of prep_compound_head() early calling.
But the other numbers help gauging the cache effects in this path.
Earlier (in v1) I merely expressed a minor concern. /Maybe/ this matters or maybe the cost
of prep_compound_head() outweighs the cache-miss avoidance of the succeeding two tail page
cache-lines per 2M page. Well, now it's one tail page. Nonetheless, I would expect that
this is part of the testing the submitter performs, given that this is not one of those
'no functional change' patches as written in v1 commit message :( Should that be the case,
then let's go with v1 as that certainly brings consistency with prep_compound_page().
On 2022/6/14 18:33, Joao Martins wrote:
> [was out the past couple days, hence the late response]
>
> On 6/12/22 16:44, Muchun Song wrote:
>> On Sat, Jun 11, 2022 at 10:13:52AM +0800, Miaohe Lin wrote:
>>> Since commit 5232c63f46fd ("mm: Make compound_pincount always available"),
>>> compound_pincount_ptr is stored at first tail page now. So we should call
>>> prep_compound_head() after the first tail page is initialized to take
>>> advantage of the likelihood of that tail struct page being cached given
>>> that we will read them right after in prep_compound_head().
>>>
>>> Signed-off-by: Miaohe Lin <[email protected]>
>>> Cc: Joao Martins <[email protected]>
>>> ---
>>> v2:
>>> Don't move prep_compound_head() outside loop per Joao.
>>> ---
>>> mm/page_alloc.c | 17 +++++++++++------
>>> 1 file changed, 11 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 4c7d99ee58b4..048df5d78add 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -6771,13 +6771,18 @@ static void __ref memmap_init_compound(struct page *head,
>>> set_page_count(page, 0);
>>>
>>> /*
>>> - * The first tail page stores compound_mapcount_ptr() and
>>> - * compound_order() and the second tail page stores
>>> - * compound_pincount_ptr(). Call prep_compound_head() after
>>> - * the first and second tail pages have been initialized to
>>> - * not have the data overwritten.
>>> + * The first tail page stores compound_mapcount_ptr(),
>>> + * compound_order() and compound_pincount_ptr(). Call
>>> + * prep_compound_head() after the first tail page have
>>> + * been initialized to not have the data overwritten.
>>> + *
>>> + * Note the idea to make this right after we initialize
>>> + * the offending tail pages is trying to take advantage
>>> + * of the likelihood of those tail struct pages being
>>> + * cached given that we will read them right after in
>>> + * prep_compound_head().
>>> */
>>> - if (pfn == head_pfn + 2)
>>> + if (unlikely(pfn == head_pfn + 1))
>>> prep_compound_head(head, order);
>>
>> For me it is weird not to put this out of the loop. I saw the reason
>> is because of the caching suggested by Joao. But I think this is not
>> a hot path and putting it out of the loop may be more intuitive at least
>> for me. Maybe this optimization is unnecessary (maybe I am wrong).
>
> So, depending on your setup, this might actually sit in the boot path. Yes, it is at
> bringup/teardown of new memory, so it does not sit in a 'hot path' and struct pages are
> cold. But it is part of a function that initialiazes a whole DIMM worth of struct pages.
> And PMEM dimms can be denser than RAM ones IIRC. In my case we usually have 128G PMEM
> DIMMs in our servers.
>
>> And it will be consistent with prep_compound_page() (at least it does
>> not do the similar optimization) if we drop this optimization.
>>
>> Hi Joao,
>>
>> I am wondering is it a significant optimization for zone device memory?
>> I found this code existed from the 1st version you introduced.
>
> Not quite. It did not existed in the RFC. As a matter of fact the very first
> version was totally ignoring anything cache related (i.e. just calling
> prep_compound_page() in the loop for all struct pages after all the struct pages were
> inited) until Dan suggested I fix that part because people in the past have spent time
> optimizing it.
>
>> So
>> I suspect maybe you have some numbers, would you like to share with us?
>>
>
> 128G DIMMs /with struct pages placed in DRAM/ with 2M page size used to take around
> 250-400ms. Now once you placed the struct pages in PMEM those numbers go up to 4 secs all
> the way up to 8secs (there's a lot of high variance). Now imagine 12 dimms and those
> numbers can get in the ranges of 3 - 4.8secs for DRAM-struct-pages, and with
> PMEM-struct-pages to more than 48secs.
>
> Note that initializing as compound pages brought those numbers closer in the middle
> of the interval given that we need to do more work other than just initializing the
> raw struct page. With DRAM struct pages with the vmemmap deduplication trick (which is now
> default used) these got decreased down to 80-110ms per DIMM. But I actually got started
> with numbers in the order of ~180-190ms per pmem DIMM (ignore cache effects). I should
> note that I haven't measured /exactly/ the benefit of prep_compound_head() early calling.
> But the other numbers help gauging the cache effects in this path.
>
> Earlier (in v1) I merely expressed a minor concern. /Maybe/ this matters or maybe the cost
Many thanks for your detailed explanation. In v1, I thought you do have the numbers that show
the cache-miss avoidance of the succeeding two tail page cache-lines per 2M page does matter.
That's my bad. Sorry.
> of prep_compound_head() outweighs the cache-miss avoidance of the succeeding two tail page
> cache-lines per 2M page. Well, now it's one tail page. Nonetheless, I would expect that
> this is part of the testing the submitter performs, given that this is not one of those
Am I supposed to provide the numbers that show how cache effects? The number I can provide now
will be based on the emulated pmem device due to lacking of real pmem device (because we're
under control, that's a pity :( ). That number might not be wanted because the struct pages will
always be placed in DRAM. Any suggestions? It's very kind of you if you can help provide
this number. :)
> 'no functional change' patches as written in v1 commit message :( Should that be the case,
> then let's go with v1 as that certainly brings consistency with prep_compound_page().
Many thanks!
> .
>