Hi,
Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().
The bug doesn't lead to user-visible misbehaviour in current kernel, but
fixing this would be critical for future work on THP: both huge-ext4 and THP
swap out rely on proper dirty tracking.
Unfortunately, there's no way to address the issue in a generic way. We need to
fix all architectures that support THP one-by-one.
All architectures that have THP supported have to provide atomic
pmdp_invalidate(). If generic implementation of pmdp_invalidate() is used,
architecture needs to provide atomic pmdp_mknonpresent().
I've fixed the issue for x86, but I need help with the rest.
So far THP is supported on 8 architectures. Power and S390 already provides
atomic pmdp_invalidate(). x86 is fixed by this patches, so 5 architectures
left:
- arc;
- arm;
- arm64;
- mips;
- sparc -- it has custom pmdp_invalidate(), but it's racy too;
Please, help me with them.
Kirill A. Shutemov (3):
x86/mm: Provide pmdp_mknotpresent() helper
mm: Do not loose dirty and access bits in pmdp_invalidate()
mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
arch/x86/include/asm/pgtable-3level.h | 17 +++++++++++++++++
arch/x86/include/asm/pgtable.h | 13 +++++++++++++
mm/huge_memory.c | 13 +++++++++----
mm/pgtable-generic.c | 3 +--
4 files changed, 40 insertions(+), 6 deletions(-)
--
2.11.0
Until pmdp_invalidate() pmd entry is present and CPU can update it,
setting dirty. Currently, we tranfer dirty bit to page too early and
there is window when we can miss dirty bit.
Let's call SetPageDirty() after pmdp_invalidate().
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a84909cf20d3..c4ee5c890910 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1928,7 +1928,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
struct page *page;
pgtable_t pgtable;
pmd_t _pmd;
- bool young, write, dirty, soft_dirty;
+ bool young, write, soft_dirty;
unsigned long addr;
int i;
@@ -1965,7 +1965,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
page_ref_add(page, HPAGE_PMD_NR - 1);
write = pmd_write(*pmd);
young = pmd_young(*pmd);
- dirty = pmd_dirty(*pmd);
soft_dirty = pmd_soft_dirty(*pmd);
pmdp_huge_split_prepare(vma, haddr, pmd);
@@ -1995,8 +1994,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (soft_dirty)
entry = pte_mksoft_dirty(entry);
}
- if (dirty)
- SetPageDirty(page + i);
pte = pte_offset_map(&_pmd, addr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, addr, pte, entry);
@@ -2046,6 +2043,14 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
* pmd_populate.
*/
pmdp_invalidate(vma, haddr, pmd);
+
+ /*
+ * Transfer dirty bit to page after pmd invalidated, so CPU would not
+ * be able to set it under us.
+ */
+ if (pmd_dirty(*pmd))
+ SetPageDirty(page);
+
pmd_populate(mm, pmd, pgtable);
if (freeze) {
--
2.11.0
We need an atomic way to make pmd page table entry not-present.
This is required to implement pmdp_invalidate() that doesn't loose dirty
or access bits.
On x86, we need to clear two bits -- _PAGE_PRESENT and _PAGE_PROTNONE --
to make the entry non-present. The implementation uses cmpxchg() loop to
make it atomically.
PAE requires special treatment to avoid expensive cmpxchg8b(). Both
bits are in the lower part of the entry, so we can use 4-byte cmpxchg() on
this part of page table entry.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
arch/x86/include/asm/pgtable-3level.h | 17 +++++++++++++++++
arch/x86/include/asm/pgtable.h | 13 +++++++++++++
2 files changed, 30 insertions(+)
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index 50d35e3185f5..b6efa955ecd0 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -176,8 +176,25 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
return res.pmd;
}
+
+#define pmdp_mknotpresent pmdp_mknotpresent
+static inline void pmdp_mknotpresent(pmd_t *pmdp)
+{
+ union split_pmd *p, old, new;
+
+ p = (union split_pmd *)pmdp;
+ {
+ old = *p;
+ new.pmd = pmd_mknotpresent(old.pmd);
+ } while (cmpxchg(&p->pmd_low, old.pmd_low, new.pmd_low) != old.pmd_low);
+}
#else
#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
+
+static inline void pmdp_mknotpresent(pmd_t *pmdp)
+{
+ *pmdp = pmd_mknotpresent(*pmdp);
+}
#endif
#ifdef CONFIG_SMP
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f5af95a0c6b8..576420df12b8 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1092,6 +1092,19 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
}
+#ifndef pmdp_mknotpresent
+#define pmdp_mknotpresent pmdp_mknotpresent
+static inline void pmdp_mknotpresent(pmd_t *pmdp)
+{
+ pmd_t old, new;
+
+ {
+ old = *pmdp;
+ new = pmd_mknotpresent(old);
+ } while (pmd_val(cmpxchg(pmdp, old, new)) != pmd_val(old));
+}
+#endif
+
/*
* clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
*
--
2.11.0
Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().
The bug doesn't lead to user-visible misbehaviour in current kernel.
Loosing access bit can lead to sub-optimal reclaim behaviour for THP,
but nothing destructive.
Loosing dirty bit is not a big deal too: we would make page dirty
unconditionally on splitting huge page.
The fix is critical for future work on THP: both huge-ext4 and THP swap
out rely on proper dirty tracking.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Vlastimil Babka <[email protected]>
---
mm/pgtable-generic.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c99d9512a45b..68094fa190d1 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -182,8 +182,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp)
{
- pmd_t entry = *pmdp;
- set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
+ pmdp_mknotpresent(pmdp);
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
}
#endif
--
2.11.0
Hi Kirill,
On Wed, 14 Jun 2017 16:51:40 +0300
"Kirill A. Shutemov" <[email protected]> wrote:
> Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
> dirty and access bits if CPU sets them after pmdp dereference, but
> before set_pmd_at().
>
> The bug doesn't lead to user-visible misbehaviour in current kernel, but
> fixing this would be critical for future work on THP: both huge-ext4 and THP
> swap out rely on proper dirty tracking.
>
> Unfortunately, there's no way to address the issue in a generic way. We need to
> fix all architectures that support THP one-by-one.
>
> All architectures that have THP supported have to provide atomic
> pmdp_invalidate(). If generic implementation of pmdp_invalidate() is used,
> architecture needs to provide atomic pmdp_mknonpresent().
>
> I've fixed the issue for x86, but I need help with the rest.
>
> So far THP is supported on 8 architectures. Power and S390 already provides
> atomic pmdp_invalidate(). x86 is fixed by this patches, so 5 architectures
> left:
For s390 the pmdp_invalidate() is atomic only in regard to the dirty and
referenced bits because we use a fault driven approach for this, no?
More specifically the update via the pmdp_xchg_direct() function is protected
by the page table lock, the update on the pmd entry itself does *not* have
to be atomic (for s390).
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Wed, 14 Jun 2017 16:51:43 +0300
"Kirill A. Shutemov" <[email protected]> wrote:
> Until pmdp_invalidate() pmd entry is present and CPU can update it,
> setting dirty. Currently, we tranfer dirty bit to page too early and
> there is window when we can miss dirty bit.
>
> Let's call SetPageDirty() after pmdp_invalidate().
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ...
> @@ -2046,6 +2043,14 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> * pmd_populate.
> */
> pmdp_invalidate(vma, haddr, pmd);
> +
> + /*
> + * Transfer dirty bit to page after pmd invalidated, so CPU would not
> + * be able to set it under us.
> + */
> + if (pmd_dirty(*pmd))
> + SetPageDirty(page);
> +
> pmd_populate(mm, pmd, pgtable);
>
> if (freeze) {
That won't work on s390. After pmdp_invalidate the pmd entry is gone,
it has been replaced with _SEGMENT_ENTRY_EMPTY. This includes the
dirty and referenced bits. The old scheme is
entry = *pmd;
pmdp_invalidate(vma, addr, pmd);
if (pmd_dirty(entry))
...
Could we change pmdp_invalidate to make it return the old pmd entry?
The pmdp_xchg_direct function already returns it, for s390 that would
be an easy change. The above code snippet would change like this:
entry = pmdp_invalidate(vma, addr, pmd);
if (pmd_dirty(entry))
...
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Wednesday 14 June 2017 07:21 PM, Kirill A. Shutemov wrote:
> Hi,
>
> Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
> dirty and access bits if CPU sets them after pmdp dereference, but
> before set_pmd_at().
>
> The bug doesn't lead to user-visible misbehaviour in current kernel, but
> fixing this would be critical for future work on THP: both huge-ext4 and THP
> swap out rely on proper dirty tracking.
>
> Unfortunately, there's no way to address the issue in a generic way. We need to
> fix all architectures that support THP one-by-one.
>
> All architectures that have THP supported have to provide atomic
> pmdp_invalidate(). If generic implementation of pmdp_invalidate() is used,
> architecture needs to provide atomic pmdp_mknonpresent().
>
> I've fixed the issue for x86, but I need help with the rest.
>
> So far THP is supported on 8 architectures. Power and S390 already provides
> atomic pmdp_invalidate(). x86 is fixed by this patches, so 5 architectures
> left:
>
> - arc;
> - arm;
> - arm64;
> - mips;
> - sparc -- it has custom pmdp_invalidate(), but it's racy too;
>
> Please, help me with them.
>
> Kirill A. Shutemov (3):
> x86/mm: Provide pmdp_mknotpresent() helper
> mm: Do not loose dirty and access bits in pmdp_invalidate()
> mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
>
But in __split_huge_pmd_locked() we collected the dirty bit early. So
even if we made pmdp_invalidate() atomic, if we had marked the pmd pte
entry dirty after we collected the dirty bit, we still loose it right ?
May be we should relook at pmd PTE udpate interface. We really need an
interface that can update pmd entries such that we don't clear it in
between. IMHO, we can avoid the pmdp_invalidate() completely, if we can
switch from a pmd PTE entry to a pointer to PTE page (pgtable_t). We
also need this interface to avoid the madvise race fixed by
https://lkml.kernel.org/r/[email protected]
The usage of pmdp_invalidate while splitting the pmd also need updated
documentation. In the earlier version of thp, we were required to keep
the pmd present and marked splitting, so that code paths can wait till
the splitting is done.
With the current design, we can ideally mark the pmdp not present early
on right ? As long as we hold the pmd lock a parallel fault will try to
mark the pmd accessed and wait on the pmd lock. On taking the lock it
will find the pmd modified and we should retry access again ?
-aneesh
On Wednesday 14 June 2017 07:21 PM, Kirill A. Shutemov wrote:
> Until pmdp_invalidate() pmd entry is present and CPU can update it,
> setting dirty. Currently, we tranfer dirty bit to page too early and
> there is window when we can miss dirty bit.
>
> Let's call SetPageDirty() after pmdp_invalidate().
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/huge_memory.c | 13 +++++++++----
> 1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a84909cf20d3..c4ee5c890910 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1928,7 +1928,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> struct page *page;
> pgtable_t pgtable;
> pmd_t _pmd;
> - bool young, write, dirty, soft_dirty;
> + bool young, write, soft_dirty;
> unsigned long addr;
> int i;
>
> @@ -1965,7 +1965,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> page_ref_add(page, HPAGE_PMD_NR - 1);
> write = pmd_write(*pmd);
> young = pmd_young(*pmd);
> - dirty = pmd_dirty(*pmd);
> soft_dirty = pmd_soft_dirty(*pmd);
>
> pmdp_huge_split_prepare(vma, haddr, pmd);
> @@ -1995,8 +1994,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> if (soft_dirty)
> entry = pte_mksoft_dirty(entry);
> }
> - if (dirty)
> - SetPageDirty(page + i);
> pte = pte_offset_map(&_pmd, addr);
> BUG_ON(!pte_none(*pte));
> set_pte_at(mm, addr, pte, entry);
> @@ -2046,6 +2043,14 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> * pmd_populate.
> */
> pmdp_invalidate(vma, haddr, pmd);
> +
> + /*
> + * Transfer dirty bit to page after pmd invalidated, so CPU would not
> + * be able to set it under us.
> + */
> + if (pmd_dirty(*pmd))
> + SetPageDirty(page);
> +
> pmd_populate(mm, pmd, pgtable);
>
you fixed dirty bit loosing i discussed in my previous mail here.
thanks
-aneesh
Hello,
On Wed, Jun 14, 2017 at 04:18:57PM +0200, Martin Schwidefsky wrote:
> Could we change pmdp_invalidate to make it return the old pmd entry?
That to me seems the simplest fix to avoid losing the dirty bit.
I earlier suggested to replace pmdp_invalidate with something like
old_pmd = pmdp_establish(pmd_mknotpresent(pmd)) (then tlb flush could
then be conditional to the old pmd being present). Making
pmdp_invalidate return the old pmd entry would be mostly equivalent to
that.
The advantage of not changing pmdp_invalidate is that we could skip a
xchg which is more costly in __split_huge_pmd_locked and
madvise_free_huge_pmd so perhaps there's a point to keep a variant of
pmdp_invalidate that doesn't use xchg internally (and in turn can't
return the old pmd value atomically).
If we don't want new messy names like pmdp_establish we could have a
__pmdp_invalidate that returns void, and pmdp_invalidate that returns
the old pmd and uses xchg (and it'd also be backwards compatible as
far as the callers are concerned). So those places that don't need the
old value returned and can skip the xchg, could simply
s/pmdp_invalidate/__pmdp_invalidate/ to optimize.
One way or another for change_huge_pmd I think we need a xchg like in
native_pmdp_get_and_clear but that sets the pmd to
pmd_mknotpresent(pmd) instead of zero. And this whole issues
originates because both change_huge_pmd(prot_numa = 1) and
madvise_free_huge_pmd both run concurrently with the mmap_sem for
reading.
In the earlier email on this topic, I also mentioned the concern of
the _notify mmu notifier invalidate that got dropped silently with the
s/pmdp_huge_get_and_clear_notify/pmdp_invalidate/ conversion but I
later noticed the mmu notifier invalidate is already covered by the
caller. So change_huge_pmd should have called pmdp_huge_get_and_clear
in the first place and the _notify prefix in the old code was a
mistake as far as I can tell. So we can focus only on the dirty bit
retention issue.
Thanks,
Andrea
On Wed, Jun 14, 2017 at 04:51:41PM +0300, Kirill A. Shutemov wrote:
> We need an atomic way to make pmd page table entry not-present.
> This is required to implement pmdp_invalidate() that doesn't loose dirty
> or access bits.
What does the cmpxchg() loop achieves compared to xchg() and then
return the old value (potentially with the dirty bit set when it was
not before we called xchg)?
> index f5af95a0c6b8..576420df12b8 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1092,6 +1092,19 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
> }
>
> +#ifndef pmdp_mknotpresent
> +#define pmdp_mknotpresent pmdp_mknotpresent
> +static inline void pmdp_mknotpresent(pmd_t *pmdp)
> +{
> + pmd_t old, new;
> +
> + {
> + old = *pmdp;
> + new = pmd_mknotpresent(old);
> + } while (pmd_val(cmpxchg(pmdp, old, new)) != pmd_val(old));
> +}
> +#endif
Isn't it faster to do xchg(&xp->pmd, pmd_mknotpresent(pmd)) and have
the pmdp_invalidate caller can set the dirty bit in the page if it was
found set in the returned old pmd value (and skip the loop and cmpxchg)?
Thanks,
Andrea
Hi Aneesh,
On Wed, Jun 14, 2017 at 08:55:26PM +0530, Aneesh Kumar K.V wrote:
> On Wednesday 14 June 2017 07:21 PM, Kirill A. Shutemov wrote:
> >Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
> >dirty and access bits if CPU sets them after pmdp dereference, but
> >before set_pmd_at().
> >
> >The bug doesn't lead to user-visible misbehaviour in current kernel, but
> >fixing this would be critical for future work on THP: both huge-ext4 and THP
> >swap out rely on proper dirty tracking.
> >
> >Unfortunately, there's no way to address the issue in a generic way. We need to
> >fix all architectures that support THP one-by-one.
> >
> >All architectures that have THP supported have to provide atomic
> >pmdp_invalidate(). If generic implementation of pmdp_invalidate() is used,
> >architecture needs to provide atomic pmdp_mknonpresent().
> >
> >I've fixed the issue for x86, but I need help with the rest.
> >
> >So far THP is supported on 8 architectures. Power and S390 already provides
> >atomic pmdp_invalidate(). x86 is fixed by this patches, so 5 architectures
> >left:
> >
> > - arc;
> > - arm;
> > - arm64;
> > - mips;
> > - sparc -- it has custom pmdp_invalidate(), but it's racy too;
> >
> >Please, help me with them.
> >
> >Kirill A. Shutemov (3):
> > x86/mm: Provide pmdp_mknotpresent() helper
> > mm: Do not loose dirty and access bits in pmdp_invalidate()
> > mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
> >
>
>
> But in __split_huge_pmd_locked() we collected the dirty bit early. So even
> if we made pmdp_invalidate() atomic, if we had marked the pmd pte entry
> dirty after we collected the dirty bit, we still loose it right ?
>
>
> May be we should relook at pmd PTE udpate interface. We really need an
> interface that can update pmd entries such that we don't clear it in
> between. IMHO, we can avoid the pmdp_invalidate() completely, if we can
> switch from a pmd PTE entry to a pointer to PTE page (pgtable_t). We also
> need this interface to avoid the madvise race fixed by
There's a good chance I'm not following your suggestion here, but it's
probably worth me pointing out that swizzling a page table entry from a
block mapping (e.g. a huge page mapped at the PMD level) to a table entry
(e.g. a pointer to a page of PTEs) can lead to all sorts of horrible
problems on ARM, including amalgamation of TLB entries and fatal aborts.
So we really need to go via an invalid entry, with appropriate TLB
invalidation before installing the new entry.
Will
On 06/14/2017 06:55 PM, Will Deacon wrote:
>>
>> May be we should relook at pmd PTE udpate interface. We really need an
>> interface that can update pmd entries such that we don't clear it in
>> between. IMHO, we can avoid the pmdp_invalidate() completely, if we can
>> switch from a pmd PTE entry to a pointer to PTE page (pgtable_t). We also
>> need this interface to avoid the madvise race fixed by
>
> There's a good chance I'm not following your suggestion here, but it's
> probably worth me pointing out that swizzling a page table entry from a
> block mapping (e.g. a huge page mapped at the PMD level) to a table entry
> (e.g. a pointer to a page of PTEs) can lead to all sorts of horrible
> problems on ARM, including amalgamation of TLB entries and fatal aborts.
AFAIK some AMD x86_64 CPU's had the same problem and generated MCE's,
and on Intel there are some restrictions when you can do that. See the
large comment in __split_huge_pmd_locked().
> So we really need to go via an invalid entry, with appropriate TLB
> invalidation before installing the new entry.
>
> Will
>
On Wednesday 14 June 2017 10:25 PM, Will Deacon wrote:
> Hi Aneesh,
>
> On Wed, Jun 14, 2017 at 08:55:26PM +0530, Aneesh Kumar K.V wrote:
>> On Wednesday 14 June 2017 07:21 PM, Kirill A. Shutemov wrote:
>>> Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
>>> dirty and access bits if CPU sets them after pmdp dereference, but
>>> before set_pmd_at().
>>>
>>> The bug doesn't lead to user-visible misbehaviour in current kernel, but
>>> fixing this would be critical for future work on THP: both huge-ext4 and THP
>>> swap out rely on proper dirty tracking.
>>>
>>> Unfortunately, there's no way to address the issue in a generic way. We need to
>>> fix all architectures that support THP one-by-one.
>>>
>>> All architectures that have THP supported have to provide atomic
>>> pmdp_invalidate(). If generic implementation of pmdp_invalidate() is used,
>>> architecture needs to provide atomic pmdp_mknonpresent().
>>>
>>> I've fixed the issue for x86, but I need help with the rest.
>>>
>>> So far THP is supported on 8 architectures. Power and S390 already provides
>>> atomic pmdp_invalidate(). x86 is fixed by this patches, so 5 architectures
>>> left:
>>>
>>> - arc;
>>> - arm;
>>> - arm64;
>>> - mips;
>>> - sparc -- it has custom pmdp_invalidate(), but it's racy too;
>>>
>>> Please, help me with them.
>>>
>>> Kirill A. Shutemov (3):
>>> x86/mm: Provide pmdp_mknotpresent() helper
>>> mm: Do not loose dirty and access bits in pmdp_invalidate()
>>> mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
>>>
>>
>>
>> But in __split_huge_pmd_locked() we collected the dirty bit early. So even
>> if we made pmdp_invalidate() atomic, if we had marked the pmd pte entry
>> dirty after we collected the dirty bit, we still loose it right ?
>>
>>
>> May be we should relook at pmd PTE udpate interface. We really need an
>> interface that can update pmd entries such that we don't clear it in
>> between. IMHO, we can avoid the pmdp_invalidate() completely, if we can
>> switch from a pmd PTE entry to a pointer to PTE page (pgtable_t). We also
>> need this interface to avoid the madvise race fixed by
>
> There's a good chance I'm not following your suggestion here, but it's
> probably worth me pointing out that swizzling a page table entry from a
> block mapping (e.g. a huge page mapped at the PMD level) to a table entry
> (e.g. a pointer to a page of PTEs) can lead to all sorts of horrible
> problems on ARM, including amalgamation of TLB entries and fatal aborts.
>
> So we really need to go via an invalid entry, with appropriate TLB
> invalidation before installing the new entry.
>
I am not suggesting we don't do the invalidate (the need for that is
documented in __split_huge_pmd_locked(). I am suggesting we need a new
interface, something like Andrea suggested.
old_pmd = pmdp_establish(pmd_mknotpresent());
instead of pmdp_invalidate(). We can then use this in scenarios where we
want to update pmd PTE entries, where right now we go through a
pmdp_clear and set_pmd path. We should really not do that for THP entries.
W.r.t pmdp_invalidate() usage, I was wondering whether we can do that
early in __split_huge_pmd_locked().
-aneesh
On Wednesday 14 June 2017 10:30 PM, Vlastimil Babka wrote:
> On 06/14/2017 06:55 PM, Will Deacon wrote:
>>>
>>> May be we should relook at pmd PTE udpate interface. We really need an
>>> interface that can update pmd entries such that we don't clear it in
>>> between. IMHO, we can avoid the pmdp_invalidate() completely, if we can
>>> switch from a pmd PTE entry to a pointer to PTE page (pgtable_t). We also
>>> need this interface to avoid the madvise race fixed by
>>
>> There's a good chance I'm not following your suggestion here, but it's
>> probably worth me pointing out that swizzling a page table entry from a
>> block mapping (e.g. a huge page mapped at the PMD level) to a table entry
>> (e.g. a pointer to a page of PTEs) can lead to all sorts of horrible
>> problems on ARM, including amalgamation of TLB entries and fatal aborts.
>
> AFAIK some AMD x86_64 CPU's had the same problem and generated MCE's,
> and on Intel there are some restrictions when you can do that. See the
> large comment in __split_huge_pmd_locked().
>
I was wondering whether we can do pmdp_establish(pgtable); and document
all quirks needed for that in the per arch implementation of
pmdp_establish(). We could also then switch all the
pmdp_clear/set_pmd_at() usage to pmdp_establish().
-aneesh
On Thursday 15 June 2017 06:35 AM, Aneesh Kumar K.V wrote:
>
> W.r.t pmdp_invalidate() usage, I was wondering whether we can do that
> early in __split_huge_pmd_locked().
>
BTW by moving pmdp_invalidate early, we can then get rid of
pmdp_huge_split_prepare(vma, haddr, pmd);
-aneesh
Hi Kirill,
[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc5 next-20170614]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/Do-not-loose-dirty-bit-on-THP-pages/20170615-115540
base: git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x071-06130444 (attached as .config)
compiler: gcc-6 (Debian 6.3.0-18) 6.3.0 20170516
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All error/warnings (new ones prefixed by >>):
In file included from include/linux/mm.h:70:0,
from include/linux/memcontrol.h:29,
from include/linux/swap.h:8,
from include/linux/suspend.h:4,
from arch/x86/kernel/asm-offsets.c:12:
>> arch/x86/include/asm/pgtable.h:1096:27: error: redefinition of 'pmdp_mknotpresent'
#define pmdp_mknotpresent pmdp_mknotpresent
^
>> arch/x86/include/asm/pgtable.h:1097:20: note: in expansion of macro 'pmdp_mknotpresent'
static inline void pmdp_mknotpresent(pmd_t *pmdp)
^~~~~~~~~~~~~~~~~
In file included from arch/x86/include/asm/pgtable_32.h:43:0,
from arch/x86/include/asm/pgtable.h:604,
from include/linux/mm.h:70,
from include/linux/memcontrol.h:29,
from include/linux/swap.h:8,
from include/linux/suspend.h:4,
from arch/x86/kernel/asm-offsets.c:12:
arch/x86/include/asm/pgtable-3level.h:194:20: note: previous definition of 'pmdp_mknotpresent' was here
static inline void pmdp_mknotpresent(pmd_t *pmdp)
^~~~~~~~~~~~~~~~~
make[2]: *** [arch/x86/kernel/asm-offsets.s] Error 1
make[2]: Target '__build' not remade because of errors.
make[1]: *** [prepare0] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [sub-make] Error 2
vim +/pmdp_mknotpresent +1096 arch/x86/include/asm/pgtable.h
1090 unsigned long addr, pmd_t *pmdp)
1091 {
1092 clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
1093 }
1094
1095 #ifndef pmdp_mknotpresent
> 1096 #define pmdp_mknotpresent pmdp_mknotpresent
> 1097 static inline void pmdp_mknotpresent(pmd_t *pmdp)
1098 {
1099 pmd_t old, new;
1100
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Wed, Jun 14, 2017 at 05:31:31PM +0200, Andrea Arcangeli wrote:
> Hello,
>
> On Wed, Jun 14, 2017 at 04:18:57PM +0200, Martin Schwidefsky wrote:
> > Could we change pmdp_invalidate to make it return the old pmd entry?
>
> That to me seems the simplest fix to avoid losing the dirty bit.
>
> I earlier suggested to replace pmdp_invalidate with something like
> old_pmd = pmdp_establish(pmd_mknotpresent(pmd)) (then tlb flush could
> then be conditional to the old pmd being present). Making
> pmdp_invalidate return the old pmd entry would be mostly equivalent to
> that.
>
> The advantage of not changing pmdp_invalidate is that we could skip a
> xchg which is more costly in __split_huge_pmd_locked and
> madvise_free_huge_pmd so perhaps there's a point to keep a variant of
> pmdp_invalidate that doesn't use xchg internally (and in turn can't
> return the old pmd value atomically).
>
> If we don't want new messy names like pmdp_establish we could have a
> __pmdp_invalidate that returns void, and pmdp_invalidate that returns
> the old pmd and uses xchg (and it'd also be backwards compatible as
> far as the callers are concerned). So those places that don't need the
> old value returned and can skip the xchg, could simply
> s/pmdp_invalidate/__pmdp_invalidate/ to optimize.
We have few pmdp_invalidate() callers:
- clear_soft_dirty_pmd();
- madvise_free_huge_pmd();
- change_huge_pmd();
- __split_huge_pmd_locked();
Only madvise_free_huge_pmd() doesn't care about old pmd.
__split_huge_pmd_locked() actually needs to check dirty after
pmdp_invalidate(), see patch 3/3 of the patchset.
I don't think it worth introduce one more primitive only for
madvise_free_huge_pmd().
I'll stick with single pmdp_invalidate() that returns old value.
--
Kirill A. Shutemov
On Thu, Jun 15, 2017 at 06:35:21AM +0530, Aneesh Kumar K.V wrote:
>
>
> On Wednesday 14 June 2017 10:25 PM, Will Deacon wrote:
> > Hi Aneesh,
> >
> > On Wed, Jun 14, 2017 at 08:55:26PM +0530, Aneesh Kumar K.V wrote:
> > > On Wednesday 14 June 2017 07:21 PM, Kirill A. Shutemov wrote:
> > > > Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
> > > > dirty and access bits if CPU sets them after pmdp dereference, but
> > > > before set_pmd_at().
> > > >
> > > > The bug doesn't lead to user-visible misbehaviour in current kernel, but
> > > > fixing this would be critical for future work on THP: both huge-ext4 and THP
> > > > swap out rely on proper dirty tracking.
> > > >
> > > > Unfortunately, there's no way to address the issue in a generic way. We need to
> > > > fix all architectures that support THP one-by-one.
> > > >
> > > > All architectures that have THP supported have to provide atomic
> > > > pmdp_invalidate(). If generic implementation of pmdp_invalidate() is used,
> > > > architecture needs to provide atomic pmdp_mknonpresent().
> > > >
> > > > I've fixed the issue for x86, but I need help with the rest.
> > > >
> > > > So far THP is supported on 8 architectures. Power and S390 already provides
> > > > atomic pmdp_invalidate(). x86 is fixed by this patches, so 5 architectures
> > > > left:
> > > >
> > > > - arc;
> > > > - arm;
> > > > - arm64;
> > > > - mips;
> > > > - sparc -- it has custom pmdp_invalidate(), but it's racy too;
> > > >
> > > > Please, help me with them.
> > > >
> > > > Kirill A. Shutemov (3):
> > > > x86/mm: Provide pmdp_mknotpresent() helper
> > > > mm: Do not loose dirty and access bits in pmdp_invalidate()
> > > > mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
> > > >
> > >
> > >
> > > But in __split_huge_pmd_locked() we collected the dirty bit early. So even
> > > if we made pmdp_invalidate() atomic, if we had marked the pmd pte entry
> > > dirty after we collected the dirty bit, we still loose it right ?
> > >
> > >
> > > May be we should relook at pmd PTE udpate interface. We really need an
> > > interface that can update pmd entries such that we don't clear it in
> > > between. IMHO, we can avoid the pmdp_invalidate() completely, if we can
> > > switch from a pmd PTE entry to a pointer to PTE page (pgtable_t). We also
> > > need this interface to avoid the madvise race fixed by
> >
> > There's a good chance I'm not following your suggestion here, but it's
> > probably worth me pointing out that swizzling a page table entry from a
> > block mapping (e.g. a huge page mapped at the PMD level) to a table entry
> > (e.g. a pointer to a page of PTEs) can lead to all sorts of horrible
> > problems on ARM, including amalgamation of TLB entries and fatal aborts.
> >
> > So we really need to go via an invalid entry, with appropriate TLB
> > invalidation before installing the new entry.
> >
>
> I am not suggesting we don't do the invalidate (the need for that is
> documented in __split_huge_pmd_locked(). I am suggesting we need a new
> interface, something like Andrea suggested.
>
> old_pmd = pmdp_establish(pmd_mknotpresent());
>
> instead of pmdp_invalidate(). We can then use this in scenarios where we
> want to update pmd PTE entries, where right now we go through a pmdp_clear
> and set_pmd path. We should really not do that for THP entries.
Which cases are you talking about? When do we need to clear pmd and set
later?
--
Kirill A. Shutemov
Hi Kirill,
[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc5 next-20170614]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/Do-not-loose-dirty-bit-on-THP-pages/20170615-115540
base: git://git.cmpxchg.org/linux-mmotm.git master
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm64
All errors (new ones prefixed by >>):
mm/pgtable-generic.c: In function 'pmdp_invalidate':
>> mm/pgtable-generic.c:185:2: error: implicit declaration of function 'pmdp_mknotpresent' [-Werror=implicit-function-declaration]
pmdp_mknotpresent(pmdp);
^~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/pmdp_mknotpresent +185 mm/pgtable-generic.c
179 #endif
180
181 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
182 void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
183 pmd_t *pmdp)
184 {
> 185 pmdp_mknotpresent(pmdp);
186 flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
187 }
188 #endif
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Thursday 15 June 2017 02:18 PM, Kirill A. Shutemov wrote:
> O
>> I am not suggesting we don't do the invalidate (the need for that is
>> documented in __split_huge_pmd_locked(). I am suggesting we need a new
>> interface, something like Andrea suggested.
>>
>> old_pmd = pmdp_establish(pmd_mknotpresent());
>>
>> instead of pmdp_invalidate(). We can then use this in scenarios where we
>> want to update pmd PTE entries, where right now we go through a pmdp_clear
>> and set_pmd path. We should really not do that for THP entries.
>
> Which cases are you talking about? When do we need to clear pmd and set
> later?
>
With the latest upstream I am finding the usage when we mark pte clean
page_mkclean_one . Also there is a similar usage in
migrate_misplaced_transhuge_page(). I haven't really verified whether
they do cause any race. But my suggestion is, we should avoid the usage
of set_pmd_at() unless we are creating a new pmd PTE entry. If we can
provide pmdp_establish() we can achieve that easily.
-aneesh