From: Barry Song <[email protected]>
madvise and some others might need folio_pte_batch to check if a range
of PTEs are completely mapped to a large folio with contiguous physcial
addresses. Let's export it for others to use.
Cc: Lance Yang <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Yin Fengwei <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
-v1:
at least two jobs madv_free and madv_pageout depend on it. To avoid
conflicts and dependencies, after discussing with Lance, we prefer
this one can land earlier.
mm/internal.h | 13 +++++++++++++
mm/memory.c | 11 +----------
2 files changed, 14 insertions(+), 10 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 13b59d384845..8e2bc304f671 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
}
+/* Flags for folio_pte_batch(). */
+typedef int __bitwise fpb_t;
+
+/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
+#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
+
+/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
+#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
+
+extern int folio_pte_batch(struct folio *folio, unsigned long addr,
+ pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
+ bool *any_writable);
+
void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
int nr_throttled);
static inline void acct_reclaim_writeback(struct folio *folio)
diff --git a/mm/memory.c b/mm/memory.c
index 1c45b6a42a1b..319b3be05e75 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
}
-/* Flags for folio_pte_batch(). */
-typedef int __bitwise fpb_t;
-
-/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
-#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
-
-/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
-#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
-
static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
{
if (flags & FPB_IGNORE_DIRTY)
@@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
* If "any_writable" is set, it will indicate if any other PTE besides the
* first (given) PTE is writable.
*/
-static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
+int folio_pte_batch(struct folio *folio, unsigned long addr,
pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
bool *any_writable)
{
--
2.34.1
On Tue, Feb 27, 2024 at 3:41 PM Barry Song <[email protected]> wrote:
>
> From: Barry Song <[email protected]>
>
> madvise and some others might need folio_pte_batch to check if a range
> of PTEs are completely mapped to a large folio with contiguous physcial
> addresses. Let's export it for others to use.
>
> Cc: Lance Yang <[email protected]>
> Cc: Ryan Roberts <[email protected]>
> Cc: David Hildenbrand <[email protected]>
Hi David, Ryan,
Sorry, I realize I just made a mistake and your tags should be both
Suggested-by. Please feel
free to review the patch and give comments. I will fix the tags
together with addressing your
review comments in v2.
> Cc: Yin Fengwei <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> -v1:
> at least two jobs madv_free and madv_pageout depend on it. To avoid
> conflicts and dependencies, after discussing with Lance, we prefer
> this one can land earlier.
>
> mm/internal.h | 13 +++++++++++++
> mm/memory.c | 11 +----------
> 2 files changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 13b59d384845..8e2bc304f671 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
> }
>
> +/* Flags for folio_pte_batch(). */
> +typedef int __bitwise fpb_t;
> +
> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> +
> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> +
> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> + bool *any_writable);
> +
> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
> int nr_throttled);
> static inline void acct_reclaim_writeback(struct folio *folio)
> diff --git a/mm/memory.c b/mm/memory.c
> index 1c45b6a42a1b..319b3be05e75 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
> }
>
> -/* Flags for folio_pte_batch(). */
> -typedef int __bitwise fpb_t;
> -
> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> -
> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> -
> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> {
> if (flags & FPB_IGNORE_DIRTY)
> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> * If "any_writable" is set, it will indicate if any other PTE besides the
> * first (given) PTE is writable.
> */
> -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> +int folio_pte_batch(struct folio *folio, unsigned long addr,
> pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> bool *any_writable)
> {
> --
> 2.34.1
>
Thanks
Barry
On 27/02/2024 02:40, Barry Song wrote:
> From: Barry Song <[email protected]>
>
> madvise and some others might need folio_pte_batch to check if a range
> of PTEs are completely mapped to a large folio with contiguous physcial
> addresses. Let's export it for others to use.
>
> Cc: Lance Yang <[email protected]>
> Cc: Ryan Roberts <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Yin Fengwei <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> -v1:
> at least two jobs madv_free and madv_pageout depend on it. To avoid
> conflicts and dependencies, after discussing with Lance, we prefer
> this one can land earlier.
I think this will also ultimately be useful for mprotect too, though I haven't
looked at it properly yet.
>
> mm/internal.h | 13 +++++++++++++
> mm/memory.c | 11 +----------
> 2 files changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 13b59d384845..8e2bc304f671 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
> }
>
> +/* Flags for folio_pte_batch(). */
> +typedef int __bitwise fpb_t;
> +
> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> +
> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> +
> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> + bool *any_writable);
> +
> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
> int nr_throttled);
> static inline void acct_reclaim_writeback(struct folio *folio)
> diff --git a/mm/memory.c b/mm/memory.c
> index 1c45b6a42a1b..319b3be05e75 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
> }
>
> -/* Flags for folio_pte_batch(). */
> -typedef int __bitwise fpb_t;
> -
> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> -
> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> -
> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> {
> if (flags & FPB_IGNORE_DIRTY)
> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> * If "any_writable" is set, it will indicate if any other PTE besides the
> * first (given) PTE is writable.
> */
David was talking in Lance's patch thread, about improving the docs for this
function now that its exported. Might be worth syncing on that.
> -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> +int folio_pte_batch(struct folio *folio, unsigned long addr,
fork() is very performance sensitive. Is there a risk we are regressing
performance by making this out-of-line? Although its in the same compilation
unit so the compiler may well inline it anyway?
Either way, perhaps we are better off making it inline in the header? That would
avoid needing to rerun David's micro-benchmarks for fork() and munmap().
Thanks,
Ryan
> pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> bool *any_writable)
> {
On 27/02/2024 03:18, Barry Song wrote:
> On Tue, Feb 27, 2024 at 3:41 PM Barry Song <[email protected]> wrote:
>>
>> From: Barry Song <[email protected]>
>>
>> madvise and some others might need folio_pte_batch to check if a range
>> of PTEs are completely mapped to a large folio with contiguous physcial
>> addresses. Let's export it for others to use.
>>
>> Cc: Lance Yang <[email protected]>
>> Cc: Ryan Roberts <[email protected]>
>> Cc: David Hildenbrand <[email protected]>
>
> Hi David, Ryan,
>
> Sorry, I realize I just made a mistake and your tags should be both
> Suggested-by. Please feel
> free to review the patch and give comments. I will fix the tags
> together with addressing your
> review comments in v2.
Don't worry about it. Don't feel you need to update it on my account.
>
>> Cc: Yin Fengwei <[email protected]>
>> Signed-off-by: Barry Song <[email protected]>
>> ---
>> -v1:
>> at least two jobs madv_free and madv_pageout depend on it. To avoid
>> conflicts and dependencies, after discussing with Lance, we prefer
>> this one can land earlier.
>>
>> mm/internal.h | 13 +++++++++++++
>> mm/memory.c | 11 +----------
>> 2 files changed, 14 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 13b59d384845..8e2bc304f671 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
>> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
>> }
>>
>> +/* Flags for folio_pte_batch(). */
>> +typedef int __bitwise fpb_t;
>> +
>> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
>> +
>> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
>> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
>> +
>> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
>> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>> + bool *any_writable);
>> +
>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>> int nr_throttled);
>> static inline void acct_reclaim_writeback(struct folio *folio)
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 1c45b6a42a1b..319b3be05e75 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
>> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>> }
>>
>> -/* Flags for folio_pte_batch(). */
>> -typedef int __bitwise fpb_t;
>> -
>> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
>> -
>> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
>> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
>> -
>> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>> {
>> if (flags & FPB_IGNORE_DIRTY)
>> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>> * If "any_writable" is set, it will indicate if any other PTE besides the
>> * first (given) PTE is writable.
>> */
>> -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>> +int folio_pte_batch(struct folio *folio, unsigned long addr,
>> pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>> bool *any_writable)
>> {
>> --
>> 2.34.1
>>
>
> Thanks
> Barry
On 27.02.24 10:07, Ryan Roberts wrote:
> On 27/02/2024 02:40, Barry Song wrote:
>> From: Barry Song <[email protected]>
>>
>> madvise and some others might need folio_pte_batch to check if a range
>> of PTEs are completely mapped to a large folio with contiguous physcial
>> addresses. Let's export it for others to use.
>>
>> Cc: Lance Yang <[email protected]>
>> Cc: Ryan Roberts <[email protected]>
>> Cc: David Hildenbrand <[email protected]>
>> Cc: Yin Fengwei <[email protected]>
>> Signed-off-by: Barry Song <[email protected]>
>> ---
>> -v1:
>> at least two jobs madv_free and madv_pageout depend on it. To avoid
>> conflicts and dependencies, after discussing with Lance, we prefer
>> this one can land earlier.
>
> I think this will also ultimately be useful for mprotect too, though I haven't
> looked at it properly yet.
>
Yes, I think we briefly discussed that.
>>
>> mm/internal.h | 13 +++++++++++++
>> mm/memory.c | 11 +----------
>> 2 files changed, 14 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 13b59d384845..8e2bc304f671 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
>> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
>> }
>>
>> +/* Flags for folio_pte_batch(). */
>> +typedef int __bitwise fpb_t;
>> +
>> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
>> +
>> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
>> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
>> +
>> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
>> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>> + bool *any_writable);
>> +
>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>> int nr_throttled);
>> static inline void acct_reclaim_writeback(struct folio *folio)
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 1c45b6a42a1b..319b3be05e75 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
>> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>> }
>>
>> -/* Flags for folio_pte_batch(). */
>> -typedef int __bitwise fpb_t;
>> -
>> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
>> -
>> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
>> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
>> -
>> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>> {
>> if (flags & FPB_IGNORE_DIRTY)
>> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>> * If "any_writable" is set, it will indicate if any other PTE besides the
>> * first (given) PTE is writable.
>> */
>
> David was talking in Lance's patch thread, about improving the docs for this
> function now that its exported. Might be worth syncing on that.
Here is my take:
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index d0b855a1837a8..098356b8805ae 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -971,16 +971,28 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
return pte_wrprotect(pte_mkold(pte));
}
-/*
+/**
+ * folio_pte_batch - detect a PTE batch for a large folio
+ * @folio: The large folio to detect a PTE batch for.
+ * @addr: The user virtual address the first page is mapped at.
+ * @start_ptep: Page table pointer for the first entry.
+ * @pte: Page table entry for the first page.
+ * @max_nr: The maximum number of table entries to consider.
+ * @flags: Flags to modify the PTE batch semantics.
+ * @any_writable: Optional pointer to indicate whether any entry except the
+ * first one is writable.
+ *
* Detect a PTE batch: consecutive (present) PTEs that map consecutive
- * pages of the same folio.
+ * pages of the same large folio.
*
* All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
* the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
* soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
*
- * If "any_writable" is set, it will indicate if any other PTE besides the
- * first (given) PTE is writable.
+ * start_ptep must map any page of the folio. max_nr must be at least one and
+ * must be limited by the caller so scanning cannot exceed a single page table.
+ *
+ * Return: the number of table entries in the batch.
*/
static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
@@ -996,6 +1008,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
*any_writable = false;
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
+ VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
+ VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
nr = pte_batch_hint(start_ptep, pte);
expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
--
2.43.2
>
>> -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>> +int folio_pte_batch(struct folio *folio, unsigned long addr,
>
> fork() is very performance sensitive. Is there a risk we are regressing
> performance by making this out-of-line? Although its in the same compilation
> unit so the compiler may well inline it anyway?
Easy to verify by looking at the generated asm I guess?
>
> Either way, perhaps we are better off making it inline in the header? That would
> avoid needing to rerun David's micro-benchmarks for fork() and munmap().
That way, the compiler can most certainly better optimize it also outside of mm/memory.c
--
Cheers,
David / dhildenb
On 27/02/2024 09:14, David Hildenbrand wrote:
> On 27.02.24 10:07, Ryan Roberts wrote:
>> On 27/02/2024 02:40, Barry Song wrote:
>>> From: Barry Song <[email protected]>
>>>
>>> madvise and some others might need folio_pte_batch to check if a range
>>> of PTEs are completely mapped to a large folio with contiguous physcial
>>> addresses. Let's export it for others to use.
>>>
>>> Cc: Lance Yang <[email protected]>
>>> Cc: Ryan Roberts <[email protected]>
>>> Cc: David Hildenbrand <[email protected]>
>>> Cc: Yin Fengwei <[email protected]>
>>> Signed-off-by: Barry Song <[email protected]>
>>> ---
>>> -v1:
>>> at least two jobs madv_free and madv_pageout depend on it. To avoid
>>> conflicts and dependencies, after discussing with Lance, we prefer
>>> this one can land earlier.
>>
>> I think this will also ultimately be useful for mprotect too, though I haven't
>> looked at it properly yet.
>>
>
> Yes, I think we briefly discussed that.
>
>>>
>>> mm/internal.h | 13 +++++++++++++
>>> mm/memory.c | 11 +----------
>>> 2 files changed, 14 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 13b59d384845..8e2bc304f671 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
>>> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
>>> }
>>> +/* Flags for folio_pte_batch(). */
>>> +typedef int __bitwise fpb_t;
>>> +
>>> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>>> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
>>> +
>>> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
>>> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
>>> +
>>> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
>>> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>>> + bool *any_writable);
>>> +
>>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>>> int nr_throttled);
>>> static inline void acct_reclaim_writeback(struct folio *folio)
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 1c45b6a42a1b..319b3be05e75 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct
>>> vm_area_struct *dst_vma,
>>> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>>> }
>>> -/* Flags for folio_pte_batch(). */
>>> -typedef int __bitwise fpb_t;
>>> -
>>> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>>> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
>>> -
>>> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
>>> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
>>> -
>>> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>> {
>>> if (flags & FPB_IGNORE_DIRTY)
>>> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte,
>>> fpb_t flags)
>>> * If "any_writable" is set, it will indicate if any other PTE besides the
>>> * first (given) PTE is writable.
>>> */
>>
>> David was talking in Lance's patch thread, about improving the docs for this
>> function now that its exported. Might be worth syncing on that.
>
> Here is my take:
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/memory.c | 22 ++++++++++++++++++----
> 1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index d0b855a1837a8..098356b8805ae 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -971,16 +971,28 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte,
> fpb_t flags)
> return pte_wrprotect(pte_mkold(pte));
> }
>
> -/*
> +/**
> + * folio_pte_batch - detect a PTE batch for a large folio
> + * @folio: The large folio to detect a PTE batch for.
> + * @addr: The user virtual address the first page is mapped at.
> + * @start_ptep: Page table pointer for the first entry.
> + * @pte: Page table entry for the first page.
> + * @max_nr: The maximum number of table entries to consider.
> + * @flags: Flags to modify the PTE batch semantics.
> + * @any_writable: Optional pointer to indicate whether any entry except the
> + * first one is writable.
> + *
> * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> - * pages of the same folio.
> + * pages of the same large folio.
> *
> * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
> * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
> * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
> *
> - * If "any_writable" is set, it will indicate if any other PTE besides the
> - * first (given) PTE is writable.
> + * start_ptep must map any page of the folio. max_nr must be at least one and
> + * must be limited by the caller so scanning cannot exceed a single page table.
> + *
> + * Return: the number of table entries in the batch.
> */
LGTM!
> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> @@ -996,6 +1008,8 @@ static inline int folio_pte_batch(struct folio *folio,
> unsigned long addr,
> *any_writable = false;
>
> VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> + VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
> + VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
>
> nr = pte_batch_hint(start_ptep, pte);
> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
On Tue, Feb 27, 2024 at 10:14 PM David Hildenbrand <[email protected]> wrote:
>
> On 27.02.24 10:07, Ryan Roberts wrote:
> > On 27/02/2024 02:40, Barry Song wrote:
> >> From: Barry Song <[email protected]>
> >>
> >> madvise and some others might need folio_pte_batch to check if a range
> >> of PTEs are completely mapped to a large folio with contiguous physcial
> >> addresses. Let's export it for others to use.
> >>
> >> Cc: Lance Yang <[email protected]>
> >> Cc: Ryan Roberts <[email protected]>
> >> Cc: David Hildenbrand <[email protected]>
> >> Cc: Yin Fengwei <[email protected]>
> >> Signed-off-by: Barry Song <[email protected]>
> >> ---
> >> -v1:
> >> at least two jobs madv_free and madv_pageout depend on it. To avoid
> >> conflicts and dependencies, after discussing with Lance, we prefer
> >> this one can land earlier.
> >
> > I think this will also ultimately be useful for mprotect too, though I haven't
> > looked at it properly yet.
> >
>
> Yes, I think we briefly discussed that.
>
> >>
> >> mm/internal.h | 13 +++++++++++++
> >> mm/memory.c | 11 +----------
> >> 2 files changed, 14 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/mm/internal.h b/mm/internal.h
> >> index 13b59d384845..8e2bc304f671 100644
> >> --- a/mm/internal.h
> >> +++ b/mm/internal.h
> >> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
> >> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
> >> }
> >>
> >> +/* Flags for folio_pte_batch(). */
> >> +typedef int __bitwise fpb_t;
> >> +
> >> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> >> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> >> +
> >> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> >> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> >> +
> >> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
> >> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> >> + bool *any_writable);
> >> +
> >> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
> >> int nr_throttled);
> >> static inline void acct_reclaim_writeback(struct folio *folio)
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 1c45b6a42a1b..319b3be05e75 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
> >> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
> >> }
> >>
> >> -/* Flags for folio_pte_batch(). */
> >> -typedef int __bitwise fpb_t;
> >> -
> >> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> >> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> >> -
> >> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> >> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> >> -
> >> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> >> {
> >> if (flags & FPB_IGNORE_DIRTY)
> >> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> >> * If "any_writable" is set, it will indicate if any other PTE besides the
> >> * first (given) PTE is writable.
> >> */
> >
> > David was talking in Lance's patch thread, about improving the docs for this
> > function now that its exported. Might be worth syncing on that.
>
> Here is my take:
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/memory.c | 22 ++++++++++++++++++----
> 1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index d0b855a1837a8..098356b8805ae 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -971,16 +971,28 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> return pte_wrprotect(pte_mkold(pte));
> }
>
> -/*
> +/**
> + * folio_pte_batch - detect a PTE batch for a large folio
> + * @folio: The large folio to detect a PTE batch for.
> + * @addr: The user virtual address the first page is mapped at.
> + * @start_ptep: Page table pointer for the first entry.
> + * @pte: Page table entry for the first page.
> + * @max_nr: The maximum number of table entries to consider.
> + * @flags: Flags to modify the PTE batch semantics.
> + * @any_writable: Optional pointer to indicate whether any entry except the
> + * first one is writable.
> + *
> * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> - * pages of the same folio.
> + * pages of the same large folio.
> *
> * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
> * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
> * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
> *
> - * If "any_writable" is set, it will indicate if any other PTE besides the
> - * first (given) PTE is writable.
> + * start_ptep must map any page of the folio. max_nr must be at least one and
> + * must be limited by the caller so scanning cannot exceed a single page table.
> + *
> + * Return: the number of table entries in the batch.
> */
> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> @@ -996,6 +1008,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> *any_writable = false;
>
> VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> + VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
> + VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
>
> nr = pte_batch_hint(start_ptep, pte);
> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
> --
> 2.43.2
>
>
> >
> >> -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >> +int folio_pte_batch(struct folio *folio, unsigned long addr,
> >
> > fork() is very performance sensitive. Is there a risk we are regressing
> > performance by making this out-of-line? Although its in the same compilation
> > unit so the compiler may well inline it anyway?
>
> Easy to verify by looking at the generated asm I guess?
my aarch64-linux-gnu-gcc didn't inline it
$ aarch64-linux-gnu-gcc --version
aarch64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
$ nm -S -s vmlinux.a | grep folio_pte_batch
0000000000003818 0000000000000204 T folio_pte_batch
>
> >
> > Either way, perhaps we are better off making it inline in the header? That would
> > avoid needing to rerun David's micro-benchmarks for fork() and munmap().
actually tried this before trying extern, the problem is that we have to add
others into internal.h, for example __pte_batch_clear_ignored, which
seems not API. are we comfortable to move that one to internal.h too?
>
> That way, the compiler can most certainly better optimize it also outside of mm/memory.c
>
> --
> Cheers,
>
> David / dhildenb
Thanks
Barry
On 27.02.24 10:27, Barry Song wrote:
> On Tue, Feb 27, 2024 at 10:14 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 27.02.24 10:07, Ryan Roberts wrote:
>>> On 27/02/2024 02:40, Barry Song wrote:
>>>> From: Barry Song <[email protected]>
>>>>
>>>> madvise and some others might need folio_pte_batch to check if a range
>>>> of PTEs are completely mapped to a large folio with contiguous physcial
>>>> addresses. Let's export it for others to use.
>>>>
>>>> Cc: Lance Yang <[email protected]>
>>>> Cc: Ryan Roberts <[email protected]>
>>>> Cc: David Hildenbrand <[email protected]>
>>>> Cc: Yin Fengwei <[email protected]>
>>>> Signed-off-by: Barry Song <[email protected]>
>>>> ---
>>>> -v1:
>>>> at least two jobs madv_free and madv_pageout depend on it. To avoid
>>>> conflicts and dependencies, after discussing with Lance, we prefer
>>>> this one can land earlier.
>>>
>>> I think this will also ultimately be useful for mprotect too, though I haven't
>>> looked at it properly yet.
>>>
>>
>> Yes, I think we briefly discussed that.
>>
>>>>
>>>> mm/internal.h | 13 +++++++++++++
>>>> mm/memory.c | 11 +----------
>>>> 2 files changed, 14 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index 13b59d384845..8e2bc304f671 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
>>>> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
>>>> }
>>>>
>>>> +/* Flags for folio_pte_batch(). */
>>>> +typedef int __bitwise fpb_t;
>>>> +
>>>> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>>>> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
>>>> +
>>>> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
>>>> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
>>>> +
>>>> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>>>> + bool *any_writable);
>>>> +
>>>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>>>> int nr_throttled);
>>>> static inline void acct_reclaim_writeback(struct folio *folio)
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 1c45b6a42a1b..319b3be05e75 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
>>>> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>>>> }
>>>>
>>>> -/* Flags for folio_pte_batch(). */
>>>> -typedef int __bitwise fpb_t;
>>>> -
>>>> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>>>> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
>>>> -
>>>> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
>>>> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
>>>> -
>>>> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>>> {
>>>> if (flags & FPB_IGNORE_DIRTY)
>>>> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>>> * If "any_writable" is set, it will indicate if any other PTE besides the
>>>> * first (given) PTE is writable.
>>>> */
>>>
>>> David was talking in Lance's patch thread, about improving the docs for this
>>> function now that its exported. Might be worth syncing on that.
>>
>> Here is my take:
>>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> mm/memory.c | 22 ++++++++++++++++++----
>> 1 file changed, 18 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index d0b855a1837a8..098356b8805ae 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -971,16 +971,28 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>> return pte_wrprotect(pte_mkold(pte));
>> }
>>
>> -/*
>> +/**
>> + * folio_pte_batch - detect a PTE batch for a large folio
>> + * @folio: The large folio to detect a PTE batch for.
>> + * @addr: The user virtual address the first page is mapped at.
>> + * @start_ptep: Page table pointer for the first entry.
>> + * @pte: Page table entry for the first page.
>> + * @max_nr: The maximum number of table entries to consider.
>> + * @flags: Flags to modify the PTE batch semantics.
>> + * @any_writable: Optional pointer to indicate whether any entry except the
>> + * first one is writable.
>> + *
>> * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>> - * pages of the same folio.
>> + * pages of the same large folio.
>> *
>> * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
>> * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
>> * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
>> *
>> - * If "any_writable" is set, it will indicate if any other PTE besides the
>> - * first (given) PTE is writable.
>> + * start_ptep must map any page of the folio. max_nr must be at least one and
>> + * must be limited by the caller so scanning cannot exceed a single page table.
>> + *
>> + * Return: the number of table entries in the batch.
>> */
>> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>> pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>> @@ -996,6 +1008,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>> *any_writable = false;
>>
>> VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>> + VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
>> + VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
>>
>> nr = pte_batch_hint(start_ptep, pte);
>> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
>> --
>> 2.43.2
>>
>>
>>>
>>>> -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>> +int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>
>>> fork() is very performance sensitive. Is there a risk we are regressing
>>> performance by making this out-of-line? Although its in the same compilation
>>> unit so the compiler may well inline it anyway?
>>
>> Easy to verify by looking at the generated asm I guess?
>
> my aarch64-linux-gnu-gcc didn't inline it
I think on x86-64 it would inline it with "gcc (GCC) 13.2.1 20231205
(Red Hat 13.2.1-6)"
>
> $ aarch64-linux-gnu-gcc --version
> aarch64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
> Copyright (C) 2021 Free Software Foundation, Inc.
>
> $ nm -S -s vmlinux.a | grep folio_pte_batch
> 0000000000003818 0000000000000204 T folio_pte_batch
>
As it's only used on the folio_test_large() "slower" paths, likely
optimizing out the "writable" check (and possibly the flags) might not
be that important.
>>
>>>
>>> Either way, perhaps we are better off making it inline in the header? That would
>>> avoid needing to rerun David's micro-benchmarks for fork() and munmap().
>
> actually tried this before trying extern, the problem is that we have to add
> others into internal.h, for example __pte_batch_clear_ignored, which
> seems not API. are we comfortable to move that one to internal.h too?
Yes, that shouldn't stop us.
--
Cheers,
David / dhildenb
On 27.02.24 03:40, Barry Song wrote:
> From: Barry Song <[email protected]>
>
Nit: "mm: make folio_pte_batch available outside of mm/memory.c"
> madvise and some others might need folio_pte_batch to check if a range
> of PTEs are completely mapped to a large folio with contiguous physcial
s/physcial/physical/
> addresses. Let's export it for others to use.
>
Nit: "Let's make it available in mm/internal.h"
(talking about exporting and modules sounds like EXPORT_SYMBOL)
> Cc: Lance Yang <[email protected]>
> Cc: Ryan Roberts <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Yin Fengwei <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
--
Cheers,
David / dhildenb
On Tue, Feb 27, 2024 at 5:14 PM David Hildenbrand <[email protected]> wrote:
>
> On 27.02.24 10:07, Ryan Roberts wrote:
> > On 27/02/2024 02:40, Barry Song wrote:
> >> From: Barry Song <[email protected]>
> >>
> >> madvise and some others might need folio_pte_batch to check if a range
> >> of PTEs are completely mapped to a large folio with contiguous physcial
> >> addresses. Let's export it for others to use.
> >>
> >> Cc: Lance Yang <[email protected]>
> >> Cc: Ryan Roberts <[email protected]>
> >> Cc: David Hildenbrand <[email protected]>
> >> Cc: Yin Fengwei <[email protected]>
> >> Signed-off-by: Barry Song <[email protected]>
> >> ---
> >> -v1:
> >> at least two jobs madv_free and madv_pageout depend on it. To avoid
> >> conflicts and dependencies, after discussing with Lance, we prefer
> >> this one can land earlier.
> >
> > I think this will also ultimately be useful for mprotect too, though I haven't
> > looked at it properly yet.
> >
>
> Yes, I think we briefly discussed that.
>
> >>
> >> mm/internal.h | 13 +++++++++++++
> >> mm/memory.c | 11 +----------
> >> 2 files changed, 14 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/mm/internal.h b/mm/internal.h
> >> index 13b59d384845..8e2bc304f671 100644
> >> --- a/mm/internal.h
> >> +++ b/mm/internal.h
> >> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
> >> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
> >> }
> >>
> >> +/* Flags for folio_pte_batch(). */
> >> +typedef int __bitwise fpb_t;
> >> +
> >> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> >> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> >> +
> >> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> >> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> >> +
> >> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
> >> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> >> + bool *any_writable);
> >> +
> >> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
> >> int nr_throttled);
> >> static inline void acct_reclaim_writeback(struct folio *folio)
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 1c45b6a42a1b..319b3be05e75 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
> >> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
> >> }
> >>
> >> -/* Flags for folio_pte_batch(). */
> >> -typedef int __bitwise fpb_t;
> >> -
> >> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> >> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> >> -
> >> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> >> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> >> -
> >> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> >> {
> >> if (flags & FPB_IGNORE_DIRTY)
> >> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> >> * If "any_writable" is set, it will indicate if any other PTE besides the
> >> * first (given) PTE is writable.
> >> */
> >
> > David was talking in Lance's patch thread, about improving the docs for this
> > function now that its exported. Might be worth syncing on that.
>
> Here is my take:
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/memory.c | 22 ++++++++++++++++++----
> 1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index d0b855a1837a8..098356b8805ae 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -971,16 +971,28 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> return pte_wrprotect(pte_mkold(pte));
> }
>
> -/*
> +/**
> + * folio_pte_batch - detect a PTE batch for a large folio
> + * @folio: The large folio to detect a PTE batch for.
> + * @addr: The user virtual address the first page is mapped at.
> + * @start_ptep: Page table pointer for the first entry.
> + * @pte: Page table entry for the first page.
> + * @max_nr: The maximum number of table entries to consider.
> + * @flags: Flags to modify the PTE batch semantics.
> + * @any_writable: Optional pointer to indicate whether any entry except the
> + * first one is writable.
> + *
> * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> - * pages of the same folio.
> + * pages of the same large folio.
> *
> * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
> * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
> * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
> *
> - * If "any_writable" is set, it will indicate if any other PTE besides the
> - * first (given) PTE is writable.
> + * start_ptep must map any page of the folio. max_nr must be at least one and
> + * must be limited by the caller so scanning cannot exceed a single page table.
> + *
> + * Return: the number of table entries in the batch.
> */
> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> @@ -996,6 +1008,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> *any_writable = false;
>
> VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> + VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
> + VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
Nit:
IIUC, the pte that maps to the first page.
- VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) !=
folio, folio);
+ VM_WARN_ON_FOLIO(pte_pfn(pte) != folio_pfn(folio), folio);
> nr = pte_batch_hint(start_ptep, pte);
> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
> --
> 2.43.2
>
>
> >
> >> -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >> +int folio_pte_batch(struct folio *folio, unsigned long addr,
> >
> > fork() is very performance sensitive. Is there a risk we are regressing
> > performance by making this out-of-line? Although its in the same compilation
> > unit so the compiler may well inline it anyway?
>
> Easy to verify by looking at the generated asm I guess?
>
> >
> > Either way, perhaps we are better off making it inline in the header? That would
> > avoid needing to rerun David's micro-benchmarks for fork() and munmap().
>
> That way, the compiler can most certainly better optimize it also outside of mm/memory.c
>
> --
> Cheers,
>
> David / dhildenb
>
On 27.02.24 10:51, Lance Yang wrote:
> On Tue, Feb 27, 2024 at 5:14 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 27.02.24 10:07, Ryan Roberts wrote:
>>> On 27/02/2024 02:40, Barry Song wrote:
>>>> From: Barry Song <[email protected]>
>>>>
>>>> madvise and some others might need folio_pte_batch to check if a range
>>>> of PTEs are completely mapped to a large folio with contiguous physcial
>>>> addresses. Let's export it for others to use.
>>>>
>>>> Cc: Lance Yang <[email protected]>
>>>> Cc: Ryan Roberts <[email protected]>
>>>> Cc: David Hildenbrand <[email protected]>
>>>> Cc: Yin Fengwei <[email protected]>
>>>> Signed-off-by: Barry Song <[email protected]>
>>>> ---
>>>> -v1:
>>>> at least two jobs madv_free and madv_pageout depend on it. To avoid
>>>> conflicts and dependencies, after discussing with Lance, we prefer
>>>> this one can land earlier.
>>>
>>> I think this will also ultimately be useful for mprotect too, though I haven't
>>> looked at it properly yet.
>>>
>>
>> Yes, I think we briefly discussed that.
>>
>>>>
>>>> mm/internal.h | 13 +++++++++++++
>>>> mm/memory.c | 11 +----------
>>>> 2 files changed, 14 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index 13b59d384845..8e2bc304f671 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
>>>> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
>>>> }
>>>>
>>>> +/* Flags for folio_pte_batch(). */
>>>> +typedef int __bitwise fpb_t;
>>>> +
>>>> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>>>> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
>>>> +
>>>> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
>>>> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
>>>> +
>>>> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>>>> + bool *any_writable);
>>>> +
>>>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>>>> int nr_throttled);
>>>> static inline void acct_reclaim_writeback(struct folio *folio)
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 1c45b6a42a1b..319b3be05e75 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
>>>> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>>>> }
>>>>
>>>> -/* Flags for folio_pte_batch(). */
>>>> -typedef int __bitwise fpb_t;
>>>> -
>>>> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>>>> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
>>>> -
>>>> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
>>>> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
>>>> -
>>>> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>>> {
>>>> if (flags & FPB_IGNORE_DIRTY)
>>>> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>>> * If "any_writable" is set, it will indicate if any other PTE besides the
>>>> * first (given) PTE is writable.
>>>> */
>>>
>>> David was talking in Lance's patch thread, about improving the docs for this
>>> function now that its exported. Might be worth syncing on that.
>>
>> Here is my take:
>>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> mm/memory.c | 22 ++++++++++++++++++----
>> 1 file changed, 18 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index d0b855a1837a8..098356b8805ae 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -971,16 +971,28 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>> return pte_wrprotect(pte_mkold(pte));
>> }
>>
>> -/*
>> +/**
>> + * folio_pte_batch - detect a PTE batch for a large folio
>> + * @folio: The large folio to detect a PTE batch for.
>> + * @addr: The user virtual address the first page is mapped at.
>> + * @start_ptep: Page table pointer for the first entry.
>> + * @pte: Page table entry for the first page.
>> + * @max_nr: The maximum number of table entries to consider.
>> + * @flags: Flags to modify the PTE batch semantics.
>> + * @any_writable: Optional pointer to indicate whether any entry except the
>> + * first one is writable.
>> + *
>> * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>> - * pages of the same folio.
>> + * pages of the same large folio.
>> *
>> * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
>> * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
>> * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
>> *
>> - * If "any_writable" is set, it will indicate if any other PTE besides the
>> - * first (given) PTE is writable.
>> + * start_ptep must map any page of the folio. max_nr must be at least one and
>> + * must be limited by the caller so scanning cannot exceed a single page table.
>> + *
>> + * Return: the number of table entries in the batch.
>> */
>> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>> pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>> @@ -996,6 +1008,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>> *any_writable = false;
>>
>> VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>> + VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
>> + VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
>
> Nit:
> IIUC, the pte that maps to the first page.
> - VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) !=
> folio, folio);
> + VM_WARN_ON_FOLIO(pte_pfn(pte) != folio_pfn(folio), folio);
That would only work if the PTE would map the very first subpage of the
folio, not any subpage?
--
Cheers,
David / dhildenb
On Tue, Feb 27, 2024 at 5:53 PM David Hildenbrand <[email protected]> wrote:
>
> On 27.02.24 10:51, Lance Yang wrote:
> > On Tue, Feb 27, 2024 at 5:14 PM David Hildenbrand <david@redhatcom> wrote:
> >>
> >> On 27.02.24 10:07, Ryan Roberts wrote:
> >>> On 27/02/2024 02:40, Barry Song wrote:
> >>>> From: Barry Song <[email protected]>
> >>>>
> >>>> madvise and some others might need folio_pte_batch to check if a range
> >>>> of PTEs are completely mapped to a large folio with contiguous physcial
> >>>> addresses. Let's export it for others to use.
> >>>>
> >>>> Cc: Lance Yang <[email protected]>
> >>>> Cc: Ryan Roberts <[email protected]>
> >>>> Cc: David Hildenbrand <[email protected]>
> >>>> Cc: Yin Fengwei <[email protected]>
> >>>> Signed-off-by: Barry Song <[email protected]>
> >>>> ---
> >>>> -v1:
> >>>> at least two jobs madv_free and madv_pageout depend on it. To avoid
> >>>> conflicts and dependencies, after discussing with Lance, we prefer
> >>>> this one can land earlier.
> >>>
> >>> I think this will also ultimately be useful for mprotect too, though I haven't
> >>> looked at it properly yet.
> >>>
> >>
> >> Yes, I think we briefly discussed that.
> >>
> >>>>
> >>>> mm/internal.h | 13 +++++++++++++
> >>>> mm/memory.c | 11 +----------
> >>>> 2 files changed, 14 insertions(+), 10 deletions(-)
> >>>>
> >>>> diff --git a/mm/internal.h b/mm/internal.h
> >>>> index 13b59d384845..8e2bc304f671 100644
> >>>> --- a/mm/internal.h
> >>>> +++ b/mm/internal.h
> >>>> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
> >>>> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
> >>>> }
> >>>>
> >>>> +/* Flags for folio_pte_batch(). */
> >>>> +typedef int __bitwise fpb_t;
> >>>> +
> >>>> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> >>>> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> >>>> +
> >>>> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> >>>> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> >>>> +
> >>>> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
> >>>> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> >>>> + bool *any_writable);
> >>>> +
> >>>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
> >>>> int nr_throttled);
> >>>> static inline void acct_reclaim_writeback(struct folio *folio)
> >>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>> index 1c45b6a42a1b..319b3be05e75 100644
> >>>> --- a/mm/memory.c
> >>>> +++ b/mm/memory.c
> >>>> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
> >>>> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
> >>>> }
> >>>>
> >>>> -/* Flags for folio_pte_batch(). */
> >>>> -typedef int __bitwise fpb_t;
> >>>> -
> >>>> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> >>>> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> >>>> -
> >>>> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> >>>> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> >>>> -
> >>>> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> >>>> {
> >>>> if (flags & FPB_IGNORE_DIRTY)
> >>>> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> >>>> * If "any_writable" is set, it will indicate if any other PTE besides the
> >>>> * first (given) PTE is writable.
> >>>> */
> >>>
> >>> David was talking in Lance's patch thread, about improving the docs for this
> >>> function now that its exported. Might be worth syncing on that.
> >>
> >> Here is my take:
> >>
> >> Signed-off-by: David Hildenbrand <[email protected]>
> >> ---
> >> mm/memory.c | 22 ++++++++++++++++++----
> >> 1 file changed, 18 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index d0b855a1837a8..098356b8805ae 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -971,16 +971,28 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> >> return pte_wrprotect(pte_mkold(pte));
> >> }
> >>
> >> -/*
> >> +/**
> >> + * folio_pte_batch - detect a PTE batch for a large folio
> >> + * @folio: The large folio to detect a PTE batch for.
> >> + * @addr: The user virtual address the first page is mapped at.
> >> + * @start_ptep: Page table pointer for the first entry.
> >> + * @pte: Page table entry for the first page.
> >> + * @max_nr: The maximum number of table entries to consider.
> >> + * @flags: Flags to modify the PTE batch semantics.
> >> + * @any_writable: Optional pointer to indicate whether any entry except the
> >> + * first one is writable.
> >> + *
> >> * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> >> - * pages of the same folio.
> >> + * pages of the same large folio.
> >> *
> >> * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
> >> * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
> >> * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
> >> *
> >> - * If "any_writable" is set, it will indicate if any other PTE besides the
> >> - * first (given) PTE is writable.
> >> + * start_ptep must map any page of the folio. max_nr must be at least one and
> >> + * must be limited by the caller so scanning cannot exceed a single page table.
> >> + *
> >> + * Return: the number of table entries in the batch.
> >> */
> >> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >> pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> >> @@ -996,6 +1008,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >> *any_writable = false;
> >>
> >> VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> >> + VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
> >> + VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
> >
> > Nit:
> > IIUC, the pte that maps to the first page.
> > - VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) !=
> > folio, folio);
> > + VM_WARN_ON_FOLIO(pte_pfn(pte) != folio_pfn(folio), folio);
>
> That would only work if the PTE would map the very first subpage of the
> folio, not any subpage?
You're right. I got it.
>
> --
> Cheers,
>
> David / dhildenb
>
On Tue, Feb 27, 2024 at 5:14 PM David Hildenbrand <[email protected]> wrote:
>
> On 27.02.24 10:07, Ryan Roberts wrote:
> > On 27/02/2024 02:40, Barry Song wrote:
> >> From: Barry Song <[email protected]>
> >>
> >> madvise and some others might need folio_pte_batch to check if a range
> >> of PTEs are completely mapped to a large folio with contiguous physcial
> >> addresses. Let's export it for others to use.
> >>
> >> Cc: Lance Yang <[email protected]>
> >> Cc: Ryan Roberts <[email protected]>
> >> Cc: David Hildenbrand <[email protected]>
> >> Cc: Yin Fengwei <[email protected]>
> >> Signed-off-by: Barry Song <[email protected]>
> >> ---
> >> -v1:
> >> at least two jobs madv_free and madv_pageout depend on it. To avoid
> >> conflicts and dependencies, after discussing with Lance, we prefer
> >> this one can land earlier.
> >
> > I think this will also ultimately be useful for mprotect too, though I haven't
> > looked at it properly yet.
> >
>
> Yes, I think we briefly discussed that.
>
> >>
> >> mm/internal.h | 13 +++++++++++++
> >> mm/memory.c | 11 +----------
> >> 2 files changed, 14 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/mm/internal.h b/mm/internal.h
> >> index 13b59d384845..8e2bc304f671 100644
> >> --- a/mm/internal.h
> >> +++ b/mm/internal.h
> >> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
> >> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
> >> }
> >>
> >> +/* Flags for folio_pte_batch(). */
> >> +typedef int __bitwise fpb_t;
> >> +
> >> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> >> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> >> +
> >> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> >> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> >> +
> >> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
> >> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> >> + bool *any_writable);
> >> +
> >> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
> >> int nr_throttled);
> >> static inline void acct_reclaim_writeback(struct folio *folio)
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 1c45b6a42a1b..319b3be05e75 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
> >> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
> >> }
> >>
> >> -/* Flags for folio_pte_batch(). */
> >> -typedef int __bitwise fpb_t;
> >> -
> >> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> >> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> >> -
> >> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> >> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> >> -
> >> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> >> {
> >> if (flags & FPB_IGNORE_DIRTY)
> >> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> >> * If "any_writable" is set, it will indicate if any other PTE besides the
> >> * first (given) PTE is writable.
> >> */
> >
> > David was talking in Lance's patch thread, about improving the docs for this
> > function now that its exported. Might be worth syncing on that.
>
> Here is my take:
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/memory.c | 22 ++++++++++++++++++----
> 1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index d0b855a1837a8..098356b8805ae 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -971,16 +971,28 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> return pte_wrprotect(pte_mkold(pte));
> }
>
> -/*
> +/**
> + * folio_pte_batch - detect a PTE batch for a large folio
> + * @folio: The large folio to detect a PTE batch for.
> + * @addr: The user virtual address the first page is mapped at.
> + * @start_ptep: Page table pointer for the first entry.
> + * @pte: Page table entry for the first page.
Nit:
- * @pte: Page table entry for the first page.
+ * @pte: Page table entry for the first page that must be the first subpage of
+ * the folio excluding arm64 for now.
IIUC, pte_batch_hint is always 1 excluding arm64 for now.
I'm not sure if this modification will be helpful?
Thanks,
Lance
> + * @max_nr: The maximum number of table entries to consider.
> + * @flags: Flags to modify the PTE batch semantics.
> + * @any_writable: Optional pointer to indicate whether any entry except the
> + * first one is writable.
> + *
> * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> - * pages of the same folio.
> + * pages of the same large folio.
> *
> * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
> * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
> * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
> *
> - * If "any_writable" is set, it will indicate if any other PTE besides the
> - * first (given) PTE is writable.
> + * start_ptep must map any page of the folio. max_nr must be at least one and
> + * must be limited by the caller so scanning cannot exceed a single page table.
> + *
> + * Return: the number of table entries in the batch.
> */
> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> @@ -996,6 +1008,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> *any_writable = false;
>
> VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> + VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
> + VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
>
> nr = pte_batch_hint(start_ptep, pte);
> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
> --
> 2.43.2
>
>
> >
> >> -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >> +int folio_pte_batch(struct folio *folio, unsigned long addr,
> >
> > fork() is very performance sensitive. Is there a risk we are regressing
> > performance by making this out-of-line? Although its in the same compilation
> > unit so the compiler may well inline it anyway?
>
> Easy to verify by looking at the generated asm I guess?
>
> >
> > Either way, perhaps we are better off making it inline in the header? That would
> > avoid needing to rerun David's micro-benchmarks for fork() and munmap().
>
> That way, the compiler can most certainly better optimize it also outside of mm/memory.c
>
> --
> Cheers,
>
> David / dhildenb
>
On 27.02.24 11:21, Lance Yang wrote:
> On Tue, Feb 27, 2024 at 5:14 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 27.02.24 10:07, Ryan Roberts wrote:
>>> On 27/02/2024 02:40, Barry Song wrote:
>>>> From: Barry Song <[email protected]>
>>>>
>>>> madvise and some others might need folio_pte_batch to check if a range
>>>> of PTEs are completely mapped to a large folio with contiguous physcial
>>>> addresses. Let's export it for others to use.
>>>>
>>>> Cc: Lance Yang <[email protected]>
>>>> Cc: Ryan Roberts <[email protected]>
>>>> Cc: David Hildenbrand <[email protected]>
>>>> Cc: Yin Fengwei <[email protected]>
>>>> Signed-off-by: Barry Song <[email protected]>
>>>> ---
>>>> -v1:
>>>> at least two jobs madv_free and madv_pageout depend on it. To avoid
>>>> conflicts and dependencies, after discussing with Lance, we prefer
>>>> this one can land earlier.
>>>
>>> I think this will also ultimately be useful for mprotect too, though I haven't
>>> looked at it properly yet.
>>>
>>
>> Yes, I think we briefly discussed that.
>>
>>>>
>>>> mm/internal.h | 13 +++++++++++++
>>>> mm/memory.c | 11 +----------
>>>> 2 files changed, 14 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index 13b59d384845..8e2bc304f671 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
>>>> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
>>>> }
>>>>
>>>> +/* Flags for folio_pte_batch(). */
>>>> +typedef int __bitwise fpb_t;
>>>> +
>>>> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>>>> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
>>>> +
>>>> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
>>>> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
>>>> +
>>>> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>>>> + bool *any_writable);
>>>> +
>>>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>>>> int nr_throttled);
>>>> static inline void acct_reclaim_writeback(struct folio *folio)
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 1c45b6a42a1b..319b3be05e75 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
>>>> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>>>> }
>>>>
>>>> -/* Flags for folio_pte_batch(). */
>>>> -typedef int __bitwise fpb_t;
>>>> -
>>>> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>>>> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
>>>> -
>>>> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
>>>> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
>>>> -
>>>> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>>> {
>>>> if (flags & FPB_IGNORE_DIRTY)
>>>> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>>> * If "any_writable" is set, it will indicate if any other PTE besides the
>>>> * first (given) PTE is writable.
>>>> */
>>>
>>> David was talking in Lance's patch thread, about improving the docs for this
>>> function now that its exported. Might be worth syncing on that.
>>
>> Here is my take:
>>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> mm/memory.c | 22 ++++++++++++++++++----
>> 1 file changed, 18 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index d0b855a1837a8..098356b8805ae 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -971,16 +971,28 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>> return pte_wrprotect(pte_mkold(pte));
>> }
>>
>> -/*
>> +/**
>> + * folio_pte_batch - detect a PTE batch for a large folio
>> + * @folio: The large folio to detect a PTE batch for.
>> + * @addr: The user virtual address the first page is mapped at.
>> + * @start_ptep: Page table pointer for the first entry.
>> + * @pte: Page table entry for the first page.
>
> Nit:
>
> - * @pte: Page table entry for the first page.
> + * @pte: Page table entry for the first page that must be the first subpage of
> + * the folio excluding arm64 for now.
>
> IIUC, pte_batch_hint is always 1 excluding arm64 for now.
> I'm not sure if this modification will be helpful?
IIRC, Ryan made sure that this also works when passing another subpage,
after when cont-pte is set. Otherwise this would already be broken for
fork/zap.
So I don't think this comment would actually be correct.
--
Cheers,
David / dhildenb
On 27/02/2024 10:30, David Hildenbrand wrote:
> On 27.02.24 11:21, Lance Yang wrote:
>> On Tue, Feb 27, 2024 at 5:14 PM David Hildenbrand <[email protected]> wrote:
>>>
>>> On 27.02.24 10:07, Ryan Roberts wrote:
>>>> On 27/02/2024 02:40, Barry Song wrote:
>>>>> From: Barry Song <[email protected]>
>>>>>
>>>>> madvise and some others might need folio_pte_batch to check if a range
>>>>> of PTEs are completely mapped to a large folio with contiguous physcial
>>>>> addresses. Let's export it for others to use.
>>>>>
>>>>> Cc: Lance Yang <[email protected]>
>>>>> Cc: Ryan Roberts <[email protected]>
>>>>> Cc: David Hildenbrand <[email protected]>
>>>>> Cc: Yin Fengwei <[email protected]>
>>>>> Signed-off-by: Barry Song <[email protected]>
>>>>> ---
>>>>> -v1:
>>>>> at least two jobs madv_free and madv_pageout depend on it. To avoid
>>>>> conflicts and dependencies, after discussing with Lance, we prefer
>>>>> this one can land earlier.
>>>>
>>>> I think this will also ultimately be useful for mprotect too, though I haven't
>>>> looked at it properly yet.
>>>>
>>>
>>> Yes, I think we briefly discussed that.
>>>
>>>>>
>>>>> mm/internal.h | 13 +++++++++++++
>>>>> mm/memory.c | 11 +----------
>>>>> 2 files changed, 14 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>> index 13b59d384845..8e2bc304f671 100644
>>>>> --- a/mm/internal.h
>>>>> +++ b/mm/internal.h
>>>>> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
>>>>> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
>>>>> }
>>>>>
>>>>> +/* Flags for folio_pte_batch(). */
>>>>> +typedef int __bitwise fpb_t;
>>>>> +
>>>>> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>>>>> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
>>>>> +
>>>>> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
>>>>> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
>>>>> +
>>>>> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>>> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>>>>> + bool *any_writable);
>>>>> +
>>>>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>>>>> int nr_throttled);
>>>>> static inline void acct_reclaim_writeback(struct folio *folio)
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 1c45b6a42a1b..319b3be05e75 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct
>>>>> vm_area_struct *dst_vma,
>>>>> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>>>>> }
>>>>>
>>>>> -/* Flags for folio_pte_batch(). */
>>>>> -typedef int __bitwise fpb_t;
>>>>> -
>>>>> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>>>>> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
>>>>> -
>>>>> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
>>>>> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
>>>>> -
>>>>> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>>>> {
>>>>> if (flags & FPB_IGNORE_DIRTY)
>>>>> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t
>>>>> pte, fpb_t flags)
>>>>> * If "any_writable" is set, it will indicate if any other PTE besides the
>>>>> * first (given) PTE is writable.
>>>>> */
>>>>
>>>> David was talking in Lance's patch thread, about improving the docs for this
>>>> function now that its exported. Might be worth syncing on that.
>>>
>>> Here is my take:
>>>
>>> Signed-off-by: David Hildenbrand <[email protected]>
>>> ---
>>> mm/memory.c | 22 ++++++++++++++++++----
>>> 1 file changed, 18 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index d0b855a1837a8..098356b8805ae 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -971,16 +971,28 @@ static inline pte_t __pte_batch_clear_ignored(pte_t
>>> pte, fpb_t flags)
>>> return pte_wrprotect(pte_mkold(pte));
>>> }
>>>
>>> -/*
>>> +/**
>>> + * folio_pte_batch - detect a PTE batch for a large folio
>>> + * @folio: The large folio to detect a PTE batch for.
>>> + * @addr: The user virtual address the first page is mapped at.
>>> + * @start_ptep: Page table pointer for the first entry.
>>> + * @pte: Page table entry for the first page.
>>
>> Nit:
>>
>> - * @pte: Page table entry for the first page.
>> + * @pte: Page table entry for the first page that must be the first subpage of
>> + * the folio excluding arm64 for now.
>>
>> IIUC, pte_batch_hint is always 1 excluding arm64 for now.
>> I'm not sure if this modification will be helpful?
>
> IIRC, Ryan made sure that this also works when passing another subpage, after
> when cont-pte is set. Otherwise this would already be broken for fork/zap.
>
> So I don't think this comment would actually be correct.
Indeed, the spec for the function is exactly the same for arm64 as for other
arches. It's just that arm64 can accelerate the implementation by skipping
forward to the next contpte boundary when the current pte is part of a contpte
block.
There is no requirement for pte (or addr or start_ptep) to point to the first
subpage of a folio - they can point to any subpage.
pte, addr and start_ptep must all refer to the same entry, but I think that's
clear from the existing text.
Thanks, Ryan, Barry, David!
Best,
Lance
On Tue, Feb 27, 2024 at 6:53 PM Ryan Roberts <[email protected]> wrote:
>
> On 27/02/2024 10:30, David Hildenbrand wrote:
> > On 27.02.24 11:21, Lance Yang wrote:
> >> On Tue, Feb 27, 2024 at 5:14 PM David Hildenbrand <[email protected]> wrote:
> >>>
> >>> On 27.02.24 10:07, Ryan Roberts wrote:
> >>>> On 27/02/2024 02:40, Barry Song wrote:
> >>>>> From: Barry Song <[email protected]>
> >>>>>
> >>>>> madvise and some others might need folio_pte_batch to check if a range
> >>>>> of PTEs are completely mapped to a large folio with contiguous physcial
> >>>>> addresses. Let's export it for others to use.
> >>>>>
> >>>>> Cc: Lance Yang <[email protected]>
> >>>>> Cc: Ryan Roberts <[email protected]>
> >>>>> Cc: David Hildenbrand <[email protected]>
> >>>>> Cc: Yin Fengwei <[email protected]>
> >>>>> Signed-off-by: Barry Song <[email protected]>
> >>>>> ---
> >>>>> -v1:
> >>>>> at least two jobs madv_free and madv_pageout depend on it. To avoid
> >>>>> conflicts and dependencies, after discussing with Lance, we prefer
> >>>>> this one can land earlier.
> >>>>
> >>>> I think this will also ultimately be useful for mprotect too, though I haven't
> >>>> looked at it properly yet.
> >>>>
> >>>
> >>> Yes, I think we briefly discussed that.
> >>>
> >>>>>
> >>>>> mm/internal.h | 13 +++++++++++++
> >>>>> mm/memory.c | 11 +----------
> >>>>> 2 files changed, 14 insertions(+), 10 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/internal.h b/mm/internal.h
> >>>>> index 13b59d384845..8e2bc304f671 100644
> >>>>> --- a/mm/internal.h
> >>>>> +++ b/mm/internal.h
> >>>>> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
> >>>>> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
> >>>>> }
> >>>>>
> >>>>> +/* Flags for folio_pte_batch(). */
> >>>>> +typedef int __bitwise fpb_t;
> >>>>> +
> >>>>> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> >>>>> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> >>>>> +
> >>>>> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> >>>>> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> >>>>> +
> >>>>> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
> >>>>> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> >>>>> + bool *any_writable);
> >>>>> +
> >>>>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
> >>>>> int nr_throttled);
> >>>>> static inline void acct_reclaim_writeback(struct folio *folio)
> >>>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>>> index 1c45b6a42a1b..319b3be05e75 100644
> >>>>> --- a/mm/memory.c
> >>>>> +++ b/mm/memory.c
> >>>>> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct
> >>>>> vm_area_struct *dst_vma,
> >>>>> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
> >>>>> }
> >>>>>
> >>>>> -/* Flags for folio_pte_batch(). */
> >>>>> -typedef int __bitwise fpb_t;
> >>>>> -
> >>>>> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> >>>>> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> >>>>> -
> >>>>> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> >>>>> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> >>>>> -
> >>>>> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> >>>>> {
> >>>>> if (flags & FPB_IGNORE_DIRTY)
> >>>>> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t
> >>>>> pte, fpb_t flags)
> >>>>> * If "any_writable" is set, it will indicate if any other PTE besides the
> >>>>> * first (given) PTE is writable.
> >>>>> */
> >>>>
> >>>> David was talking in Lance's patch thread, about improving the docs for this
> >>>> function now that its exported. Might be worth syncing on that.
> >>>
> >>> Here is my take:
> >>>
> >>> Signed-off-by: David Hildenbrand <[email protected]>
> >>> ---
> >>> mm/memory.c | 22 ++++++++++++++++++----
> >>> 1 file changed, 18 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index d0b855a1837a8..098356b8805ae 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -971,16 +971,28 @@ static inline pte_t __pte_batch_clear_ignored(pte_t
> >>> pte, fpb_t flags)
> >>> return pte_wrprotect(pte_mkold(pte));
> >>> }
> >>>
> >>> -/*
> >>> +/**
> >>> + * folio_pte_batch - detect a PTE batch for a large folio
> >>> + * @folio: The large folio to detect a PTE batch for.
> >>> + * @addr: The user virtual address the first page is mapped at.
> >>> + * @start_ptep: Page table pointer for the first entry.
> >>> + * @pte: Page table entry for the first page.
> >>
> >> Nit:
> >>
> >> - * @pte: Page table entry for the first page.
> >> + * @pte: Page table entry for the first page that must be the first subpage of
> >> + * the folio excluding arm64 for now.
> >>
> >> IIUC, pte_batch_hint is always 1 excluding arm64 for now.
> >> I'm not sure if this modification will be helpful?
> >
> > IIRC, Ryan made sure that this also works when passing another subpage, after
> > when cont-pte is set. Otherwise this would already be broken for fork/zap.
> >
> > So I don't think this comment would actually be correct.
>
> Indeed, the spec for the function is exactly the same for arm64 as for other
> arches. It's just that arm64 can accelerate the implementation by skipping
> forward to the next contpte boundary when the current pte is part of a contpte
> block.
>
> There is no requirement for pte (or addr or start_ptep) to point to the first
> subpage of a folio - they can point to any subpage.
>
> pte, addr and start_ptep must all refer to the same entry, but I think that's
> clear from the existing text.
>
>
On Tue, Feb 27, 2024 at 11:22 PM Lance Yang <[email protected]> wrote:
>
> On Tue, Feb 27, 2024 at 5:14 PM David Hildenbrand <[email protected]> wrote:
> >
> > On 27.02.24 10:07, Ryan Roberts wrote:
> > > On 27/02/2024 02:40, Barry Song wrote:
> > >> From: Barry Song <[email protected]>
> > >>
> > >> madvise and some others might need folio_pte_batch to check if a range
> > >> of PTEs are completely mapped to a large folio with contiguous physcial
> > >> addresses. Let's export it for others to use.
> > >>
> > >> Cc: Lance Yang <[email protected]>
> > >> Cc: Ryan Roberts <[email protected]>
> > >> Cc: David Hildenbrand <[email protected]>
> > >> Cc: Yin Fengwei <[email protected]>
> > >> Signed-off-by: Barry Song <[email protected]>
> > >> ---
> > >> -v1:
> > >> at least two jobs madv_free and madv_pageout depend on it. To avoid
> > >> conflicts and dependencies, after discussing with Lance, we prefer
> > >> this one can land earlier.
> > >
> > > I think this will also ultimately be useful for mprotect too, though I haven't
> > > looked at it properly yet.
> > >
> >
> > Yes, I think we briefly discussed that.
> >
> > >>
> > >> mm/internal.h | 13 +++++++++++++
> > >> mm/memory.c | 11 +----------
> > >> 2 files changed, 14 insertions(+), 10 deletions(-)
> > >>
> > >> diff --git a/mm/internal.h b/mm/internal.h
> > >> index 13b59d384845..8e2bc304f671 100644
> > >> --- a/mm/internal.h
> > >> +++ b/mm/internal.h
> > >> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
> > >> return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
> > >> }
> > >>
> > >> +/* Flags for folio_pte_batch(). */
> > >> +typedef int __bitwise fpb_t;
> > >> +
> > >> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> > >> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> > >> +
> > >> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> > >> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> > >> +
> > >> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
> > >> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> > >> + bool *any_writable);
> > >> +
> > >> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
> > >> int nr_throttled);
> > >> static inline void acct_reclaim_writeback(struct folio *folio)
> > >> diff --git a/mm/memory.c b/mm/memory.c
> > >> index 1c45b6a42a1b..319b3be05e75 100644
> > >> --- a/mm/memory.c
> > >> +++ b/mm/memory.c
> > >> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
> > >> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
> > >> }
> > >>
> > >> -/* Flags for folio_pte_batch(). */
> > >> -typedef int __bitwise fpb_t;
> > >> -
> > >> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> > >> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> > >> -
> > >> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> > >> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1))
> > >> -
> > >> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> > >> {
> > >> if (flags & FPB_IGNORE_DIRTY)
> > >> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> > >> * If "any_writable" is set, it will indicate if any other PTE besides the
> > >> * first (given) PTE is writable.
> > >> */
> > >
> > > David was talking in Lance's patch thread, about improving the docs for this
> > > function now that its exported. Might be worth syncing on that.
> >
> > Here is my take:
> >
> > Signed-off-by: David Hildenbrand <[email protected]>
> > ---
> > mm/memory.c | 22 ++++++++++++++++++----
> > 1 file changed, 18 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index d0b855a1837a8..098356b8805ae 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -971,16 +971,28 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> > return pte_wrprotect(pte_mkold(pte));
> > }
> >
> > -/*
> > +/**
> > + * folio_pte_batch - detect a PTE batch for a large folio
> > + * @folio: The large folio to detect a PTE batch for.
> > + * @addr: The user virtual address the first page is mapped at.
> > + * @start_ptep: Page table pointer for the first entry.
> > + * @pte: Page table entry for the first page.
>
> Nit:
>
> - * @pte: Page table entry for the first page.
> + * @pte: Page table entry for the first page that must be the first subpage of
> + * the folio excluding arm64 for now.
>
> IIUC, pte_batch_hint is always 1 excluding arm64 for now.
> I'm not sure if this modification will be helpful?
I don't understand how this will be different for arm64 and others.
It seems pte_batch_hint with one value > 1 only helps move the
PTE pointer faster to finish the call.
Thanks
Barry
On Wed, Feb 28, 2024 at 5:00 AM Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Feb 27, 2024 at 03:40:50PM +1300, Barry Song wrote:
> > From: Barry Song <[email protected]>
> >
> > madvise and some others might need folio_pte_batch to check if a range
> > of PTEs are completely mapped to a large folio with contiguous physcial
> > addresses. Let's export it for others to use.
>
> It doesn't look exported to me in the patch (and that's a good thing!).
>
> But even for making it non-static you probably want to include that in
> the series actually making use of it.
at least two parallel jobs[1][2] (maybe more ) need it right now.
Getting this one pulled in early
will help build a common base for them and avoid duplicates&conflicts in them.
[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/20240225123215.86503-1-ioworker0@gmailcom/
>
> > +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
> > + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> > + bool *any_writable);
>
> no need for the extern here.
Yes. this has been moved to internal.h as "static inline" in v2:
https://lore.kernel.org/linux-mm/[email protected]/
Thanks
Barry
On Tue, Feb 27, 2024 at 03:40:50PM +1300, Barry Song wrote:
> From: Barry Song <[email protected]>
>
> madvise and some others might need folio_pte_batch to check if a range
> of PTEs are completely mapped to a large folio with contiguous physcial
> addresses. Let's export it for others to use.
It doesn't look exported to me in the patch (and that's a good thing!).
But even for making it non-static you probably want to include that in
the series actually making use of it.
> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> + bool *any_writable);
no need for the extern here.
On Wed, Feb 28, 2024 at 3:02 AM Barry Song <[email protected]> wrote:
>
> On Wed, Feb 28, 2024 at 5:00 AM Christoph Hellwig <[email protected]> wrote:
> >
> > On Tue, Feb 27, 2024 at 03:40:50PM +1300, Barry Song wrote:
> > > From: Barry Song <[email protected]>
> > >
> > > madvise and some others might need folio_pte_batch to check if a range
> > > of PTEs are completely mapped to a large folio with contiguous physcial
> > > addresses. Let's export it for others to use.
> >
> > It doesn't look exported to me in the patch (and that's a good thing!).
> >
> > But even for making it non-static you probably want to include that in
> > the series actually making use of it.
>
> at least two parallel jobs[1][2] (maybe more ) need it right now.
+1
After kernel support for anonymous multi-size THP, PTE-mapped THP will
no longer be the exception. IMO, folio_pte_batch() is widely useful for checking
whether we're mapping all subpages of the large folio or not.
Thanks,
Lance
> Getting this one pulled in early
> will help build a common base for them and avoid duplicates&conflicts in them.
>
> [1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmailcom/
> [2] https://lore.kernel.org/linux-mm/[email protected]/
> >
> > > +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
> > > + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> > > + bool *any_writable);
> >
> > no need for the extern here.
>
> Yes. this has been moved to internal.h as "static inline" in v2:
> https://lore.kernel.org/linux-mm/[email protected]/
>
> Thanks
> Barry