2023-09-12 07:56:06

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [syzbot] [mm?] BUG: Bad page map (7)

On Mon, Sep 11, 2023 at 01:22:51PM -0700, Dave Hansen wrote:
> On 9/11/23 12:12, Matthew Wilcox wrote:
> > On Mon, Sep 11, 2023 at 09:55:37AM -0700, Dave Hansen wrote:
> >> On 9/11/23 09:44, Matthew Wilcox wrote:
> >>> After fixing your two typos, this assembles to 176 bytes more code than
> >>> my version. Not sure that's great.
> >> Maybe I'm a fool, but 176 bytes of text bloat isn't scaring me off too
> >> much. I'd much rather have that than another window into x86 goofiness
> >> to maintain.
> >>
> >> Does that 176 bytes translate into meaningful performance, or is it just
> >> a bunch of register bit twiddling that the CPU will sail through?
> > I'm ... not sure how to tell. It's 1120 bytes vs 944 bytes and crawling
> > through that much x86 assembly isn't my idea of a great time. I can
> > send you objdump -dr for all three options if you like? Maybe there's
> > a quick way to compare them that I've never known about.
>
> Working patches would be great if you're got 'em handy, plus your
> .config and generally what compiler you're on.

gcc (Debian 13.2.0-2) 13.2.0

I don't think there's anything particularly strange about my .config

If you compile this patch as-is, you'll get your preferred code.
Remove the #define DH and you get mine.

I would say that 176 bytes is 3 cachelines of I$, which isn't free,
even if all the insns in it can be executed while the CPU is waiting
for cache misses. This ought to be a pretty tight loop anyway; we're
just filling in adjacent PTEs. There may not be many spare cycles
for "free" uops to execute.

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index d6ad98ca1288..c9781b8b14af 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -955,6 +955,14 @@ static inline int pte_same(pte_t a, pte_t b)
return a.pte == b.pte;
}

+static inline pte_t pte_next(pte_t pte)
+{
+ if (__pte_needs_invert(pte_val(pte)))
+ return __pte(pte_val(pte) - (1UL << PFN_PTE_SHIFT));
+ return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+}
+#define pte_next pte_next
+
static inline int pte_present(pte_t a)
{
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 1fba072b3dac..25333cf3c865 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -205,6 +205,10 @@ static inline int pmd_young(pmd_t pmd)
#define arch_flush_lazy_mmu_mode() do {} while (0)
#endif

+#ifndef pte_next
+#define pte_next(pte) ((pte) + (1UL << PFN_PTE_SHIFT))
+#endif
+
#ifndef set_ptes
/**
* set_ptes - Map consecutive pages to a contiguous range of addresses.
@@ -223,6 +227,11 @@ static inline int pmd_young(pmd_t pmd)
static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr)
{
+#define DH
+#ifdef DH
+ pgprot_t prot = pte_pgprot(pte);
+ unsigned long pfn = pte_pfn(pte);
+#endif
page_table_check_ptes_set(mm, ptep, pte, nr);

arch_enter_lazy_mmu_mode();
@@ -231,7 +240,12 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
if (--nr == 0)
break;
ptep++;
- pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+#ifdef DH
+ pfn++;
+ pte = pfn_pte(pfn, prot);
+#else
+ pte = pte_next(pte);
+#endif
}
arch_leave_lazy_mmu_mode();
}


2023-09-12 19:57:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [syzbot] [mm?] BUG: Bad page map (7)

On 9/11/23 21:59, Matthew Wilcox wrote:
> On Mon, Sep 11, 2023 at 01:22:51PM -0700, Dave Hansen wrote:
>> On 9/11/23 12:12, Matthew Wilcox wrote:
>>> On Mon, Sep 11, 2023 at 09:55:37AM -0700, Dave Hansen wrote:
>>>> On 9/11/23 09:44, Matthew Wilcox wrote:
>>>>> After fixing your two typos, this assembles to 176 bytes more code than
>>>>> my version. Not sure that's great.
>>>> Maybe I'm a fool, but 176 bytes of text bloat isn't scaring me off too
>>>> much. I'd much rather have that than another window into x86 goofiness
>>>> to maintain.
>>>>
>>>> Does that 176 bytes translate into meaningful performance, or is it just
>>>> a bunch of register bit twiddling that the CPU will sail through?
>>> I'm ... not sure how to tell. It's 1120 bytes vs 944 bytes and crawling
>>> through that much x86 assembly isn't my idea of a great time. I can
>>> send you objdump -dr for all three options if you like? Maybe there's
>>> a quick way to compare them that I've never known about.
>> Working patches would be great if you're got 'em handy, plus your
>> .config and generally what compiler you're on.
> gcc (Debian 13.2.0-2) 13.2.0
>
> I don't think there's anything particularly strange about my .config
>
> If you compile this patch as-is, you'll get your preferred code.
> Remove the #define DH and you get mine.
>
> I would say that 176 bytes is 3 cachelines of I$, which isn't free,
> even if all the insns in it can be executed while the CPU is waiting
> for cache misses. This ought to be a pretty tight loop anyway; we're
> just filling in adjacent PTEs. There may not be many spare cycles
> for "free" uops to execute.

Thanks for that!

I went poking at it a bit. One remarkable thing is how many pv_ops
calls there are. Those are definitely keeping the compiler from helping
is out here too much.

Your version has 9 pv_ops calls while mine has 6. So mine may have more
instructions in _this_ function, but it could easily be made up for by
call overhead and extra instructions in the pv_ops.

Also, I went looking for a way to poke at set_ptes() and profile it a
bit and get some actual numbers. It seems like in most cases it would
be limited to use via fault around. Is there some other way to poke at
it easily?

So, in the end, I see code which is not (as far as I can see) in a hot
path, and (again, to me) there's no compelling performance argument one
way or another.

I still like my version. *Known* simplicity and uniformity win out in
my book over unknown performance benefits.

But, fixing the bug is the most important thing. I don't feel strongly
about it to NAK your version either.

2023-09-13 00:02:32

by Dave Hansen

[permalink] [raw]
Subject: Re: [syzbot] [mm?] BUG: Bad page map (7)

On 9/11/23 21:59, Matthew Wilcox wrote:
> I don't think there's anything particularly strange about my .config

I just saw some DEBUG_VM #ifdefs around the area and wondered if any of
them were to blame for the bloat.

2023-09-14 14:48:50

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [syzbot] [mm?] BUG: Bad page map (7)

Hi Matthew,

On 9/12/23 12:59, Matthew Wilcox wrote:
> On Mon, Sep 11, 2023 at 01:22:51PM -0700, Dave Hansen wrote:
>> On 9/11/23 12:12, Matthew Wilcox wrote:
>>> On Mon, Sep 11, 2023 at 09:55:37AM -0700, Dave Hansen wrote:
>>>> On 9/11/23 09:44, Matthew Wilcox wrote:
>>>>> After fixing your two typos, this assembles to 176 bytes more code than
>>>>> my version. Not sure that's great.
>>>> Maybe I'm a fool, but 176 bytes of text bloat isn't scaring me off too
>>>> much. I'd much rather have that than another window into x86 goofiness
>>>> to maintain.
>>>>
>>>> Does that 176 bytes translate into meaningful performance, or is it just
>>>> a bunch of register bit twiddling that the CPU will sail through?
>>> I'm ... not sure how to tell. It's 1120 bytes vs 944 bytes and crawling
>>> through that much x86 assembly isn't my idea of a great time. I can
>>> send you objdump -dr for all three options if you like? Maybe there's
>>> a quick way to compare them that I've never known about.
>>
>> Working patches would be great if you're got 'em handy, plus your
>> .config and generally what compiler you're on.
>
> gcc (Debian 13.2.0-2) 13.2.0
>
> I don't think there's anything particularly strange about my .config
>
> If you compile this patch as-is, you'll get your preferred code.
> Remove the #define DH and you get mine.
>
> I would say that 176 bytes is 3 cachelines of I$, which isn't free,
> even if all the insns in it can be executed while the CPU is waiting
> for cache misses. This ought to be a pretty tight loop anyway; we're
> just filling in adjacent PTEs. There may not be many spare cycles
> for "free" uops to execute.
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index d6ad98ca1288..c9781b8b14af 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -955,6 +955,14 @@ static inline int pte_same(pte_t a, pte_t b)
> return a.pte == b.pte;
> }
>
> +static inline pte_t pte_next(pte_t pte)
> +{
> + if (__pte_needs_invert(pte_val(pte)))
> + return __pte(pte_val(pte) - (1UL << PFN_PTE_SHIFT));
> + return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
> +}
> +#define pte_next pte_next
> +
> static inline int pte_present(pte_t a)
> {
> return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 1fba072b3dac..25333cf3c865 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -205,6 +205,10 @@ static inline int pmd_young(pmd_t pmd)
> #define arch_flush_lazy_mmu_mode() do {} while (0)
> #endif
>
> +#ifndef pte_next
> +#define pte_next(pte) ((pte) + (1UL << PFN_PTE_SHIFT))
> +#endif
> +
> #ifndef set_ptes
> /**
> * set_ptes - Map consecutive pages to a contiguous range of addresses.
> @@ -223,6 +227,11 @@ static inline int pmd_young(pmd_t pmd)
> static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> pte_t *ptep, pte_t pte, unsigned int nr)
> {
> +#define DH
> +#ifdef DH
> + pgprot_t prot = pte_pgprot(pte);
> + unsigned long pfn = pte_pfn(pte);
> +#endif
> page_table_check_ptes_set(mm, ptep, pte, nr);
>
> arch_enter_lazy_mmu_mode();
> @@ -231,7 +240,12 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> if (--nr == 0)
> break;
> ptep++;
> - pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
> +#ifdef DH
> + pfn++;
> + pte = pfn_pte(pfn, prot);
> +#else
> + pte = pte_next(pte);
> +#endif
> }
> arch_leave_lazy_mmu_mode();
> }

I checked the commit message of 6b28baca9b1f0d4a42b865da7a05b1c81424bd5c:
The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and
pte/pmd/pud_pfn undo it.

This assume that no code path touches the PFN part of a PTE directly
without using these primitives.

So maybe we should always use these APIs even we make x86 specific set_ptes()?

I will find a test machine to measure the performance difference of these two
versions by using xfs + will-it-scale. Will keep you guys updated.


Regards
Yin, Fengwei

2023-09-19 04:28:53

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [syzbot] [mm?] BUG: Bad page map (7)

Hi Matthew,

On 9/14/23 15:33, Yin Fengwei wrote:
> Hi Matthew,
>
> On 9/12/23 12:59, Matthew Wilcox wrote:
>> On Mon, Sep 11, 2023 at 01:22:51PM -0700, Dave Hansen wrote:
>>> On 9/11/23 12:12, Matthew Wilcox wrote:
>>>> On Mon, Sep 11, 2023 at 09:55:37AM -0700, Dave Hansen wrote:
>>>>> On 9/11/23 09:44, Matthew Wilcox wrote:
>>>>>> After fixing your two typos, this assembles to 176 bytes more code than
>>>>>> my version. Not sure that's great.
>>>>> Maybe I'm a fool, but 176 bytes of text bloat isn't scaring me off too
>>>>> much. I'd much rather have that than another window into x86 goofiness
>>>>> to maintain.
>>>>>
>>>>> Does that 176 bytes translate into meaningful performance, or is it just
>>>>> a bunch of register bit twiddling that the CPU will sail through?
>>>> I'm ... not sure how to tell. It's 1120 bytes vs 944 bytes and crawling
>>>> through that much x86 assembly isn't my idea of a great time. I can
>>>> send you objdump -dr for all three options if you like? Maybe there's
>>>> a quick way to compare them that I've never known about.
>>>
>>> Working patches would be great if you're got 'em handy, plus your
>>> .config and generally what compiler you're on.
>>
>> gcc (Debian 13.2.0-2) 13.2.0
>>
>> I don't think there's anything particularly strange about my .config
>>
>> If you compile this patch as-is, you'll get your preferred code.
>> Remove the #define DH and you get mine.
>>
>> I would say that 176 bytes is 3 cachelines of I$, which isn't free,
>> even if all the insns in it can be executed while the CPU is waiting
>> for cache misses. This ought to be a pretty tight loop anyway; we're
>> just filling in adjacent PTEs. There may not be many spare cycles
>> for "free" uops to execute.
>>
>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> index d6ad98ca1288..c9781b8b14af 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -955,6 +955,14 @@ static inline int pte_same(pte_t a, pte_t b)
>> return a.pte == b.pte;
>> }
>>
>> +static inline pte_t pte_next(pte_t pte)
>> +{
>> + if (__pte_needs_invert(pte_val(pte)))
>> + return __pte(pte_val(pte) - (1UL << PFN_PTE_SHIFT));
>> + return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
>> +}
>> +#define pte_next pte_next
>> +
>> static inline int pte_present(pte_t a)
>> {
>> return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 1fba072b3dac..25333cf3c865 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -205,6 +205,10 @@ static inline int pmd_young(pmd_t pmd)
>> #define arch_flush_lazy_mmu_mode() do {} while (0)
>> #endif
>>
>> +#ifndef pte_next
>> +#define pte_next(pte) ((pte) + (1UL << PFN_PTE_SHIFT))
>> +#endif
>> +
>> #ifndef set_ptes
>> /**
>> * set_ptes - Map consecutive pages to a contiguous range of addresses.
>> @@ -223,6 +227,11 @@ static inline int pmd_young(pmd_t pmd)
>> static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>> pte_t *ptep, pte_t pte, unsigned int nr)
>> {
>> +#define DH
>> +#ifdef DH
>> + pgprot_t prot = pte_pgprot(pte);
>> + unsigned long pfn = pte_pfn(pte);
>> +#endif
>> page_table_check_ptes_set(mm, ptep, pte, nr);
>>
>> arch_enter_lazy_mmu_mode();
>> @@ -231,7 +240,12 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>> if (--nr == 0)
>> break;
>> ptep++;
>> - pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
>> +#ifdef DH
>> + pfn++;
>> + pte = pfn_pte(pfn, prot);
>> +#else
>> + pte = pte_next(pte);
>> +#endif
>> }
>> arch_leave_lazy_mmu_mode();
>> }
>
> I checked the commit message of 6b28baca9b1f0d4a42b865da7a05b1c81424bd5c:
> The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and
> pte/pmd/pud_pfn undo it.
>
> This assume that no code path touches the PFN part of a PTE directly
> without using these primitives.
>
> So maybe we should always use these APIs even we make x86 specific set_ptes()?
>
> I will find a test machine to measure the performance difference of these two
> versions by using xfs + will-it-scale. Will keep you guys updated.
I'd like to move this bug fixing forward. Based on the test result here:
https://lore.kernel.org/linux-mm/[email protected]/
There is very small performance delta between your version and Dave's.

What do you think if we propose to merge Dave's version? Or do I need collect
more data? Thanks.


Regards
Yin, Fengwei

>
>
> Regards
> Yin, Fengwei

2023-09-19 17:18:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [syzbot] [mm?] BUG: Bad page map (7)

On 9/18/23 18:11, Yin Fengwei wrote:
>> I will find a test machine to measure the performance difference of these two
>> versions by using xfs + will-it-scale. Will keep you guys updated.
> I'd like to move this bug fixing forward. Based on the test result here:
> https://lore.kernel.org/linux-mm/[email protected]/
> There is very small performance delta between your version and Dave's.
>
> What do you think if we propose to merge Dave's version? Or do I need collect
> more data? Thanks.

I honestly don't feel that strongly about my version versus Matthew's.
I like mine, but I'll happily ack either approach.

The thing I care about the most is getting the bug fixed ... quickly. :)

2023-09-20 04:40:57

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [syzbot] [mm?] BUG: Bad page map (7)



On 9/20/23 00:11, Dave Hansen wrote:
> On 9/18/23 18:11, Yin Fengwei wrote:
>>> I will find a test machine to measure the performance difference of these two
>>> versions by using xfs + will-it-scale. Will keep you guys updated.
>> I'd like to move this bug fixing forward. Based on the test result here:
>> https://lore.kernel.org/linux-mm/[email protected]/
>> There is very small performance delta between your version and Dave's.
>>
>> What do you think if we propose to merge Dave's version? Or do I need collect
>> more data? Thanks.
>
> I honestly don't feel that strongly about my version versus Matthew's.
> I like mine, but I'll happily ack either approach.
>
> The thing I care about the most is getting the bug fixed ... quickly. :)
Same in my side.

Regarding the performance delta is very small, I thought we should follow the
commit message of 6b28baca9b1f0d4a42b865da7a05b1c81424bd5c:
The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and
pte/pmd/pud_pfn undo it.

This assume that no code path touches the PFN part of a PTE directly
without using these primitives.


Regards
Yin, Fengwei