LinuxLists.cc - [HELP-NEEDED, PATCHv2 0/3] Do not loose dirty bit on THP pages

[permalink] [raw]

Subject: [PATCHv2 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate()

Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().

The bug doesn't lead to user-visible misbehaviour in current kernel.

Loosing access bit can lead to sub-optimal reclaim behaviour for THP,
but nothing destructive.

Loosing dirty bit is not a big deal too: we would make page dirty
unconditionally on splitting huge page.

The fix is critical for future work on THP: both huge-ext4 and THP swap
out rely on proper dirty tracking.

The patch change pmdp_invalidate() to make the entry non-present atomically and
return previous value of the entry. This value can be used to check if
CPU set dirty/accessed bits under us.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Vlastimil Babka <[email protected]>
---
include/asm-generic/pgtable.h | 2 +-
mm/pgtable-generic.c | 9 +++++----
2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 7dfa767dc680..ece5e399567a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -309,7 +309,7 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
#endif

#ifndef __HAVE_ARCH_PMDP_INVALIDATE
-extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp);
#endif

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c99d9512a45b..148fe36f61a7 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -179,12 +179,13 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
#endif

#ifndef __HAVE_ARCH_PMDP_INVALIDATE
-void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp)
{
- pmd_t entry = *pmdp;
- set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
- flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ pmd_t old = pmdp_establish(pmdp, pmd_mknotpresent(*pmdp));
+ if (pmd_present(old))
+ flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ return old;
}
#endif

--
2.11.0

2017-06-15 14:52:54

[permalink] [raw]

Subject: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
to transfer dirty and accessed bits.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/proc/task_mmu.c | 8 ++++----
mm/huge_memory.c | 29 ++++++++++++-----------------
2 files changed, 16 insertions(+), 21 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f0c8b33d99b1..f2fc1ef5bba2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -906,13 +906,13 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp)
{
- pmd_t pmd = *pmdp;
+ pmd_t old, pmd = *pmdp;

/* See comment in change_huge_pmd() */
- pmdp_invalidate(vma, addr, pmdp);
- if (pmd_dirty(*pmdp))
+ old = pmdp_invalidate(vma, addr, pmdp);
+ if (pmd_dirty(old))
pmd = pmd_mkdirty(pmd);
- if (pmd_young(*pmdp))
+ if (pmd_young(old))
pmd = pmd_mkyoung(pmd);

pmd = pmd_wrprotect(pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a84909cf20d3..0433e73531bf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1777,17 +1777,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
* pmdp_invalidate() is required to make sure we don't miss
* dirty/young flags set by hardware.
*/
- entry = *pmd;
- pmdp_invalidate(vma, addr, pmd);
-
- /*
- * Recover dirty/young flags. It relies on pmdp_invalidate to not
- * corrupt them.
- */
- if (pmd_dirty(*pmd))
- entry = pmd_mkdirty(entry);
- if (pmd_young(*pmd))
- entry = pmd_mkyoung(entry);
+ entry = pmdp_invalidate(vma, addr, pmd);

entry = pmd_modify(entry, newprot);
if (preserve_write)
@@ -1927,8 +1917,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
struct mm_struct *mm = vma->vm_mm;
struct page *page;
pgtable_t pgtable;
- pmd_t _pmd;
- bool young, write, dirty, soft_dirty;
+ pmd_t old, _pmd;
+ bool young, write, soft_dirty;
unsigned long addr;
int i;

@@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
page_ref_add(page, HPAGE_PMD_NR - 1);
write = pmd_write(*pmd);
young = pmd_young(*pmd);
- dirty = pmd_dirty(*pmd);
soft_dirty = pmd_soft_dirty(*pmd);

pmdp_huge_split_prepare(vma, haddr, pmd);
@@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (soft_dirty)
entry = pte_mksoft_dirty(entry);
}
- if (dirty)
- SetPageDirty(page + i);
pte = pte_offset_map(&_pmd, addr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, addr, pte, entry);
@@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
* and finally we write the non-huge version of the pmd entry with
* pmd_populate.
*/
- pmdp_invalidate(vma, haddr, pmd);
+ old = pmdp_invalidate(vma, haddr, pmd);
+
+ /*
+ * Transfer dirty bit using value returned by pmd_invalidate() to be
+ * sure we don't race with CPU that can set the bit under us.
+ */
+ if (pmd_dirty(old))
+ SetPageDirty(page);
+
pmd_populate(mm, pmd, pgtable);

if (freeze) {
--
2.11.0

2017-06-15 14:53:29

[permalink] [raw]

Subject: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

We need an atomic way to setup pmd page table entry, avoiding races with
CPU setting dirty/accessed bits. This is required to implement
pmdp_invalidate() that doesn't loose these bits.

On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
setting it up half-by-half can expose broken corrupted entry to CPU.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
arch/x86/include/asm/pgtable-3level.h | 18 ++++++++++++++++++
arch/x86/include/asm/pgtable.h | 14 ++++++++++++++
2 files changed, 32 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index 50d35e3185f5..471c8a851363 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -180,6 +180,24 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
#endif

+#ifndef pmdp_establish
+#define pmdp_establish pmdp_establish
+static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
+{
+ pmd_t old;
+
+ /*
+ * We cannot assume what is value of pmd here, so there's no easy way
+ * to set if half by half. We have to fall back to cmpxchg64.
+ */
+ {
+ old = *pmdp;
+ } while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
+
+ return old;
+}
+#endif
+
#ifdef CONFIG_SMP
union split_pud {
struct {
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f5af95a0c6b8..a924fc6a96b9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1092,6 +1092,20 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
}

+#ifndef pmdp_establish
+#define pmdp_establish pmdp_establish
+static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
+{
+ if (IS_ENABLED(CONFIG_SMP)) {
+ return xchg(pmdp, pmd);
+ } else {
+ pmd_t old = *pmdp;
+ *pmdp = pmd;
+ return old;
+ }
+}
+#endif
+
/*
* clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
*
--
2.11.0

2017-06-15 21:55:43

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

Hi Kirill,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc5 next-20170615]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/Do-not-loose-dirty-bit-on-THP-pages/20170616-030455
base: git://git.cmpxchg.org/linux-mmotm.git master
config: s390-default_defconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=s390

All errors (new ones prefixed by >>):

fs/proc/task_mmu.c: In function 'clear_soft_dirty_pmd':
>> fs/proc/task_mmu.c:912:6: error: void value not ignored as it ought to be
old = pmdp_invalidate(vma, addr, pmdp);
^

vim +912 fs/proc/task_mmu.c

906 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
907 unsigned long addr, pmd_t *pmdp)
908 {
909 pmd_t old, pmd = *pmdp;
910
911 /* See comment in change_huge_pmd() */
> 912 old = pmdp_invalidate(vma, addr, pmdp);
913 if (pmd_dirty(old))
914 pmd = pmd_mkdirty(pmd);
915 if (pmd_young(old))

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation

Attachments:

(No filename) (1.50 kB)
.config.gz (16.76 kB)
Download all attachments

2017-06-15 22:36:03

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCHv2 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate()

Hi Kirill,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc5 next-20170615]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/Do-not-loose-dirty-bit-on-THP-pages/20170616-030455
base: git://git.cmpxchg.org/linux-mmotm.git master
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm64

All errors (new ones prefixed by >>):

mm/pgtable-generic.c: In function 'pmdp_invalidate':
>> mm/pgtable-generic.c:185:14: error: implicit declaration of function 'pmdp_establish' [-Werror=implicit-function-declaration]
pmd_t old = pmdp_establish(pmdp, pmd_mknotpresent(*pmdp));
^~~~~~~~~~~~~~
>> mm/pgtable-generic.c:185:14: error: invalid initializer
cc1: some warnings being treated as errors

vim +/pmdp_establish +185 mm/pgtable-generic.c

179 #endif
180
181 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
182 pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
183 pmd_t *pmdp)
184 {
> 185 pmd_t old = pmdp_establish(pmdp, pmd_mknotpresent(*pmdp));
186 if (pmd_present(old))
187 flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
188 return old;

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation

Attachments:

(No filename) (1.72 kB)
.config.gz (34.75 kB)
Download all attachments

2017-06-15 23:03:57

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

Hi Kirill,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc5 next-20170615]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/Do-not-loose-dirty-bit-on-THP-pages/20170616-030455
base: git://git.cmpxchg.org/linux-mmotm.git master
config: s390-defconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=s390

All errors (new ones prefixed by >>):

mm/huge_memory.c: In function 'change_huge_pmd':
>> mm/huge_memory.c:1780:8: error: void value not ignored as it ought to be
entry = pmdp_invalidate(vma, addr, pmd);
^
mm/huge_memory.c: In function '__split_huge_pmd_locked':
mm/huge_memory.c:2035:6: error: void value not ignored as it ought to be
old = pmdp_invalidate(vma, haddr, pmd);
^

vim +1780 mm/huge_memory.c

1774 * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
1775 * which may break userspace.
1776 *
1777 * pmdp_invalidate() is required to make sure we don't miss
1778 * dirty/young flags set by hardware.
1779 */
> 1780 entry = pmdp_invalidate(vma, addr, pmd);
1781
1782 entry = pmd_modify(entry, newprot);
1783 if (preserve_write)

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation

Attachments:

(No filename) (1.70 kB)
.config.gz (10.65 kB)
Download all attachments

2017-06-16 03:02:54

[permalink] [raw]

Subject: Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

Hello,

On Thu, Jun 15, 2017 at 05:52:24PM +0300, Kirill A. Shutemov wrote:
> This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
> to transfer dirty and accessed bits.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> fs/proc/task_mmu.c | 8 ++++----
> mm/huge_memory.c | 29 ++++++++++++-----------------
> 2 files changed, 16 insertions(+), 21 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index f0c8b33d99b1..f2fc1ef5bba2 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -906,13 +906,13 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
> unsigned long addr, pmd_t *pmdp)
> {
> - pmd_t pmd = *pmdp;
> + pmd_t old, pmd = *pmdp;
>
> /* See comment in change_huge_pmd() */
> - pmdp_invalidate(vma, addr, pmdp);
> - if (pmd_dirty(*pmdp))
> + old = pmdp_invalidate(vma, addr, pmdp);
> + if (pmd_dirty(old))
> pmd = pmd_mkdirty(pmd);
> - if (pmd_young(*pmdp))
> + if (pmd_young(old))
> pmd = pmd_mkyoung(pmd);
>
> pmd = pmd_wrprotect(pmd);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a84909cf20d3..0433e73531bf 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1777,17 +1777,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> * pmdp_invalidate() is required to make sure we don't miss
> * dirty/young flags set by hardware.
> */
> - entry = *pmd;
> - pmdp_invalidate(vma, addr, pmd);
> -
> - /*
> - * Recover dirty/young flags. It relies on pmdp_invalidate to not
> - * corrupt them.
> - */
> - if (pmd_dirty(*pmd))
> - entry = pmd_mkdirty(entry);
> - if (pmd_young(*pmd))
> - entry = pmd_mkyoung(entry);
> + entry = pmdp_invalidate(vma, addr, pmd);
>
> entry = pmd_modify(entry, newprot);
> if (preserve_write)
> @@ -1927,8 +1917,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> struct mm_struct *mm = vma->vm_mm;
> struct page *page;
> pgtable_t pgtable;
> - pmd_t _pmd;
> - bool young, write, dirty, soft_dirty;
> + pmd_t old, _pmd;
> + bool young, write, soft_dirty;
> unsigned long addr;
> int i;
>
> @@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> page_ref_add(page, HPAGE_PMD_NR - 1);
> write = pmd_write(*pmd);
> young = pmd_young(*pmd);
> - dirty = pmd_dirty(*pmd);
> soft_dirty = pmd_soft_dirty(*pmd);
>
> pmdp_huge_split_prepare(vma, haddr, pmd);
> @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> if (soft_dirty)
> entry = pte_mksoft_dirty(entry);
> }
> - if (dirty)
> - SetPageDirty(page + i);
> pte = pte_offset_map(&_pmd, addr);
> BUG_ON(!pte_none(*pte));
> set_pte_at(mm, addr, pte, entry);
> @@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> * and finally we write the non-huge version of the pmd entry with
> * pmd_populate.
> */
> - pmdp_invalidate(vma, haddr, pmd);
> + old = pmdp_invalidate(vma, haddr, pmd);
> +
> + /*
> + * Transfer dirty bit using value returned by pmd_invalidate() to be
> + * sure we don't race with CPU that can set the bit under us.
> + */
> + if (pmd_dirty(old))
> + SetPageDirty(page);
> +

When I see this, without this patch, MADV_FREE has been broken because
it can lose dirty bit by early checking. Right?
If so, isn't it a candidate for -stable?

2017-06-16 11:31:44

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

"Kirill A. Shutemov" <[email protected]> writes:

> This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
> to transfer dirty and accessed bits.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> fs/proc/task_mmu.c | 8 ++++----
> mm/huge_memory.c | 29 ++++++++++++-----------------
> 2 files changed, 16 insertions(+), 21 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index f0c8b33d99b1..f2fc1ef5bba2 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c

.....

> @@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> page_ref_add(page, HPAGE_PMD_NR - 1);
> write = pmd_write(*pmd);
> young = pmd_young(*pmd);
> - dirty = pmd_dirty(*pmd);
> soft_dirty = pmd_soft_dirty(*pmd);
>
> pmdp_huge_split_prepare(vma, haddr, pmd);
> @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> if (soft_dirty)
> entry = pte_mksoft_dirty(entry);
> }
> - if (dirty)
> - SetPageDirty(page + i);
> pte = pte_offset_map(&_pmd, addr);
> BUG_ON(!pte_none(*pte));
> set_pte_at(mm, addr, pte, entry);
> @@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> * and finally we write the non-huge version of the pmd entry with
> * pmd_populate.
> */
> - pmdp_invalidate(vma, haddr, pmd);
> + old = pmdp_invalidate(vma, haddr, pmd);
> +
> + /*
> + * Transfer dirty bit using value returned by pmd_invalidate() to be
> + * sure we don't race with CPU that can set the bit under us.
> + */
> + if (pmd_dirty(old))
> + SetPageDirty(page);
> +
> pmd_populate(mm, pmd, pgtable);
>
> if (freeze) {

Can we invalidate the pmd early here ? ie, do pmdp_invalidate instead of
pmdp_huge_split_prepare() ?

-aneesh

2017-06-16 13:19:14

[permalink] [raw]

Subject: Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

On Fri, Jun 16, 2017 at 12:02:50PM +0900, Minchan Kim wrote:
> Hello,
>
> On Thu, Jun 15, 2017 at 05:52:24PM +0300, Kirill A. Shutemov wrote:
> > This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
> > to transfer dirty and accessed bits.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > fs/proc/task_mmu.c | 8 ++++----
> > mm/huge_memory.c | 29 ++++++++++++-----------------
> > 2 files changed, 16 insertions(+), 21 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index f0c8b33d99b1..f2fc1ef5bba2 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -906,13 +906,13 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> > static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
> > unsigned long addr, pmd_t *pmdp)
> > {
> > - pmd_t pmd = *pmdp;
> > + pmd_t old, pmd = *pmdp;
> >
> > /* See comment in change_huge_pmd() */
> > - pmdp_invalidate(vma, addr, pmdp);
> > - if (pmd_dirty(*pmdp))
> > + old = pmdp_invalidate(vma, addr, pmdp);
> > + if (pmd_dirty(old))
> > pmd = pmd_mkdirty(pmd);
> > - if (pmd_young(*pmdp))
> > + if (pmd_young(old))
> > pmd = pmd_mkyoung(pmd);
> >
> > pmd = pmd_wrprotect(pmd);
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a84909cf20d3..0433e73531bf 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1777,17 +1777,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > * pmdp_invalidate() is required to make sure we don't miss
> > * dirty/young flags set by hardware.
> > */
> > - entry = *pmd;
> > - pmdp_invalidate(vma, addr, pmd);
> > -
> > - /*
> > - * Recover dirty/young flags. It relies on pmdp_invalidate to not
> > - * corrupt them.
> > - */
> > - if (pmd_dirty(*pmd))
> > - entry = pmd_mkdirty(entry);
> > - if (pmd_young(*pmd))
> > - entry = pmd_mkyoung(entry);
> > + entry = pmdp_invalidate(vma, addr, pmd);
> >
> > entry = pmd_modify(entry, newprot);
> > if (preserve_write)
> > @@ -1927,8 +1917,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > struct mm_struct *mm = vma->vm_mm;
> > struct page *page;
> > pgtable_t pgtable;
> > - pmd_t _pmd;
> > - bool young, write, dirty, soft_dirty;
> > + pmd_t old, _pmd;
> > + bool young, write, soft_dirty;
> > unsigned long addr;
> > int i;
> >
> > @@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > page_ref_add(page, HPAGE_PMD_NR - 1);
> > write = pmd_write(*pmd);
> > young = pmd_young(*pmd);
> > - dirty = pmd_dirty(*pmd);
> > soft_dirty = pmd_soft_dirty(*pmd);
> >
> > pmdp_huge_split_prepare(vma, haddr, pmd);
> > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > if (soft_dirty)
> > entry = pte_mksoft_dirty(entry);
> > }
> > - if (dirty)
> > - SetPageDirty(page + i);
> > pte = pte_offset_map(&_pmd, addr);
> > BUG_ON(!pte_none(*pte));
> > set_pte_at(mm, addr, pte, entry);
> > @@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > * and finally we write the non-huge version of the pmd entry with
> > * pmd_populate.
> > */
> > - pmdp_invalidate(vma, haddr, pmd);
> > + old = pmdp_invalidate(vma, haddr, pmd);
> > +
> > + /*
> > + * Transfer dirty bit using value returned by pmd_invalidate() to be
> > + * sure we don't race with CPU that can set the bit under us.
> > + */
> > + if (pmd_dirty(old))
> > + SetPageDirty(page);
> > +
>
> When I see this, without this patch, MADV_FREE has been broken because
> it can lose dirty bit by early checking. Right?
> If so, isn't it a candidate for -stable?

Actually, I don't see how MADV_FREE supposed to work: vmscan splits THP on
reclaim and split_huge_page() would set unconditionally, so MADV_FREE
seems no effect on THP.

Or have I missed anything?

--
Kirill A. Shutemov

2017-06-16 13:21:49

[permalink] [raw]

Subject: Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

On Fri, Jun 16, 2017 at 05:01:30PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <[email protected]> writes:
>
> > This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
> > to transfer dirty and accessed bits.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > fs/proc/task_mmu.c | 8 ++++----
> > mm/huge_memory.c | 29 ++++++++++++-----------------
> > 2 files changed, 16 insertions(+), 21 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index f0c8b33d99b1..f2fc1ef5bba2 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
>
> .....
>
> > @@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > page_ref_add(page, HPAGE_PMD_NR - 1);
> > write = pmd_write(*pmd);
> > young = pmd_young(*pmd);
> > - dirty = pmd_dirty(*pmd);
> > soft_dirty = pmd_soft_dirty(*pmd);
> >
> > pmdp_huge_split_prepare(vma, haddr, pmd);
> > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > if (soft_dirty)
> > entry = pte_mksoft_dirty(entry);
> > }
> > - if (dirty)
> > - SetPageDirty(page + i);
> > pte = pte_offset_map(&_pmd, addr);
> > BUG_ON(!pte_none(*pte));
> > set_pte_at(mm, addr, pte, entry);
> > @@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > * and finally we write the non-huge version of the pmd entry with
> > * pmd_populate.
> > */
> > - pmdp_invalidate(vma, haddr, pmd);
> > + old = pmdp_invalidate(vma, haddr, pmd);
> > +
> > + /*
> > + * Transfer dirty bit using value returned by pmd_invalidate() to be
> > + * sure we don't race with CPU that can set the bit under us.
> > + */
> > + if (pmd_dirty(old))
> > + SetPageDirty(page);
> > +
> > pmd_populate(mm, pmd, pgtable);
> >
> > if (freeze) {
>
>
> Can we invalidate the pmd early here ? ie, do pmdp_invalidate instead of
> pmdp_huge_split_prepare() ?

I think we can. But it means we would block access to the page for longer
than it's necessary on most architectures. I guess it's not a bit deal.

Maybe as separate patch on top of this patchet? Aneesh, would you take
care of this?

--
Kirill A. Shutemov

2017-06-16 13:36:07

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

Hello Krill,

On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> +{
> + pmd_t old;
> +
> + /*
> + * We cannot assume what is value of pmd here, so there's no easy way
> + * to set if half by half. We have to fall back to cmpxchg64.
> + */
> + {
> + old = *pmdp;
> + } while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> +
> + return old;
> +}

I see further margin for optimization here (although it's only for PAE
x32..).

pmd is stable so we could do:

if (!(pmd & _PAGE_PRESENT)) {
cast to split_pmd and use xchg on pmd_low like
native_pmdp_get_and_clear and copy pmd_high non atomically
} else {
the above cmpxchg64 loop
}

Now thinking about the above I had a second thought if pmdp_establish
is the right interface and if we shouldn't replace pmdp_establish with
pmdp_mknotpresent instead to skip the pmd & _PAGE_PRESENT check that
will always be true in practice, so pmdp_mknotpresent will call
internally pmd_mknotpresent and it won't have to check for pmd &
_PAGE_PRESENT and it would have no cons on x86-64.

2017-06-16 13:40:46

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCHv2 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate()

On Thu, Jun 15, 2017 at 05:52:23PM +0300, Kirill A. Shutemov wrote:
> -void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> pmd_t *pmdp)
> {
> - pmd_t entry = *pmdp;
> - set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
> - flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> + pmd_t old = pmdp_establish(pmdp, pmd_mknotpresent(*pmdp));
> + if (pmd_present(old))
> + flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> + return old;
> }
> #endif

The pmd_present() check added above is superflous because there's no
point to call pmdp_invalidate if the pmd is not present (present as in
pmd_present) already. pmd_present returns true if _PAGE_PSE is set
and it was always set before calling pmdp_invalidate.

It looks like we could skip the flush if _PAGE_PRESENT is not set
(i.e. for example if the pmd is PROTNONE) but that's not what the above
pmd_present will do.

2017-06-16 13:52:15

[permalink] [raw]

Subject: Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

On Fri, Jun 16, 2017 at 04:19:08PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jun 16, 2017 at 12:02:50PM +0900, Minchan Kim wrote:
> > Hello,
> >
> > On Thu, Jun 15, 2017 at 05:52:24PM +0300, Kirill A. Shutemov wrote:
> > > This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
> > > to transfer dirty and accessed bits.
> > >
> > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > ---
> > > fs/proc/task_mmu.c | 8 ++++----
> > > mm/huge_memory.c | 29 ++++++++++++-----------------
> > > 2 files changed, 16 insertions(+), 21 deletions(-)
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index f0c8b33d99b1..f2fc1ef5bba2 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -906,13 +906,13 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> > > static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
> > > unsigned long addr, pmd_t *pmdp)
> > > {
> > > - pmd_t pmd = *pmdp;
> > > + pmd_t old, pmd = *pmdp;
> > >
> > > /* See comment in change_huge_pmd() */
> > > - pmdp_invalidate(vma, addr, pmdp);
> > > - if (pmd_dirty(*pmdp))
> > > + old = pmdp_invalidate(vma, addr, pmdp);
> > > + if (pmd_dirty(old))
> > > pmd = pmd_mkdirty(pmd);
> > > - if (pmd_young(*pmdp))
> > > + if (pmd_young(old))
> > > pmd = pmd_mkyoung(pmd);
> > >
> > > pmd = pmd_wrprotect(pmd);
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index a84909cf20d3..0433e73531bf 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1777,17 +1777,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > > * pmdp_invalidate() is required to make sure we don't miss
> > > * dirty/young flags set by hardware.
> > > */
> > > - entry = *pmd;
> > > - pmdp_invalidate(vma, addr, pmd);
> > > -
> > > - /*
> > > - * Recover dirty/young flags. It relies on pmdp_invalidate to not
> > > - * corrupt them.
> > > - */
> > > - if (pmd_dirty(*pmd))
> > > - entry = pmd_mkdirty(entry);
> > > - if (pmd_young(*pmd))
> > > - entry = pmd_mkyoung(entry);
> > > + entry = pmdp_invalidate(vma, addr, pmd);
> > >
> > > entry = pmd_modify(entry, newprot);
> > > if (preserve_write)
> > > @@ -1927,8 +1917,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > struct mm_struct *mm = vma->vm_mm;
> > > struct page *page;
> > > pgtable_t pgtable;
> > > - pmd_t _pmd;
> > > - bool young, write, dirty, soft_dirty;
> > > + pmd_t old, _pmd;
> > > + bool young, write, soft_dirty;
> > > unsigned long addr;
> > > int i;
> > >
> > > @@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > page_ref_add(page, HPAGE_PMD_NR - 1);
> > > write = pmd_write(*pmd);
> > > young = pmd_young(*pmd);
> > > - dirty = pmd_dirty(*pmd);
> > > soft_dirty = pmd_soft_dirty(*pmd);
> > >
> > > pmdp_huge_split_prepare(vma, haddr, pmd);
> > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > if (soft_dirty)
> > > entry = pte_mksoft_dirty(entry);
> > > }
> > > - if (dirty)
> > > - SetPageDirty(page + i);
> > > pte = pte_offset_map(&_pmd, addr);
> > > BUG_ON(!pte_none(*pte));
> > > set_pte_at(mm, addr, pte, entry);
> > > @@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > * and finally we write the non-huge version of the pmd entry with
> > > * pmd_populate.
> > > */
> > > - pmdp_invalidate(vma, haddr, pmd);
> > > + old = pmdp_invalidate(vma, haddr, pmd);
> > > +
> > > + /*
> > > + * Transfer dirty bit using value returned by pmd_invalidate() to be
> > > + * sure we don't race with CPU that can set the bit under us.
> > > + */
> > > + if (pmd_dirty(old))
> > > + SetPageDirty(page);
> > > +
> >
> > When I see this, without this patch, MADV_FREE has been broken because
> > it can lose dirty bit by early checking. Right?
> > If so, isn't it a candidate for -stable?
>
> Actually, I don't see how MADV_FREE supposed to work: vmscan splits THP on
> reclaim and split_huge_page() would set unconditionally, so MADV_FREE
> seems no effect on THP.

split_huge_page set PG_dirty to all subpages unconditionally?
If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
that piece of code. What I found one is just __split_huge_page_tail
which set PG_dirty to subpage if head page is dirty. IOW, if the head
page is not dirty, tail page will be clean, too.
Could you point out what routine set PG_dirty to all subpages unconditionally?

Thanks.

2017-06-16 14:27:25

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

Hello Minchan,

On Fri, Jun 16, 2017 at 10:52:09PM +0900, Minchan Kim wrote:
> > > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > > if (soft_dirty)
> > > > entry = pte_mksoft_dirty(entry);
> > > > }
> > > > - if (dirty)
> > > > - SetPageDirty(page + i);
> > > > pte = pte_offset_map(&_pmd, addr);
[..]
>
> split_huge_page set PG_dirty to all subpages unconditionally?
> If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
> that piece of code. What I found one is just __split_huge_page_tail
> which set PG_dirty to subpage if head page is dirty. IOW, if the head
> page is not dirty, tail page will be clean, too.
> Could you point out what routine set PG_dirty to all subpages unconditionally?

On a side note the snippet deleted above was useless, as long as
there's one left hugepmd to split, the physical page has to be still
compound and huge and as long as that's the case the tail pages
PG_dirty bit is meaningless (even if set, it's going to be clobbered
during the physical split).

In short PG_dirty is only meaningful in the head as long as it's
compound. The physical split in __split_huge_page_tail transfer the
head value to the tails like you mentioned, that's all as far as I can
tell.

2017-06-16 14:53:42

[permalink] [raw]

Subject: Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

Hi Andrea,

On Fri, Jun 16, 2017 at 04:27:20PM +0200, Andrea Arcangeli wrote:
> Hello Minchan,
>
> On Fri, Jun 16, 2017 at 10:52:09PM +0900, Minchan Kim wrote:
> > > > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > > > if (soft_dirty)
> > > > > entry = pte_mksoft_dirty(entry);
> > > > > }
> > > > > - if (dirty)
> > > > > - SetPageDirty(page + i);
> > > > > pte = pte_offset_map(&_pmd, addr);
> [..]
> >
> > split_huge_page set PG_dirty to all subpages unconditionally?
> > If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
> > that piece of code. What I found one is just __split_huge_page_tail
> > which set PG_dirty to subpage if head page is dirty. IOW, if the head
> > page is not dirty, tail page will be clean, too.
> > Could you point out what routine set PG_dirty to all subpages unconditionally?
>
> On a side note the snippet deleted above was useless, as long as
> there's one left hugepmd to split, the physical page has to be still
> compound and huge and as long as that's the case the tail pages
> PG_dirty bit is meaningless (even if set, it's going to be clobbered
> during the physical split).

I got it during reviewing this patch. That's why I didn't argue
this patch would break MADV_FREE by deleting routine which propagate
dirty to pte of subpages. However, although it's useless, I prefer
not removing the transfer of dirty bit. Because it would help MADV_FREE
users who want to use smaps to know how many of pages are not freeable
(i.e, dirtied) since MADV_FREE although it is not 100% correct.

>
> In short PG_dirty is only meaningful in the head as long as it's
> compound. The physical split in __split_huge_page_tail transfer the
> head value to the tails like you mentioned, that's all as far as I can
> tell.

Thanks for the comment. Then, this patch is to fix MADV_FREE's bug
which has lost dirty bit by transferring dirty bit too early.

Thanks.

2017-06-16 15:57:26

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

On Friday 16 June 2017 06:51 PM, Kirill A. Shutemov wrote:
> On Fri, Jun 16, 2017 at 05:01:30PM +0530, Aneesh Kumar K.V wrote:
>> "Kirill A. Shutemov" <[email protected]> writes:
>>
>>> This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
>>> to transfer dirty and accessed bits.
>>>
>>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>>> ---
>>> fs/proc/task_mmu.c | 8 ++++----
>>> mm/huge_memory.c | 29 ++++++++++++-----------------
>>> 2 files changed, 16 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index f0c8b33d99b1..f2fc1ef5bba2 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>
>> .....
>>
>>> @@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>> page_ref_add(page, HPAGE_PMD_NR - 1);
>>> write = pmd_write(*pmd);
>>> young = pmd_young(*pmd);
>>> - dirty = pmd_dirty(*pmd);
>>> soft_dirty = pmd_soft_dirty(*pmd);
>>>
>>> pmdp_huge_split_prepare(vma, haddr, pmd);
>>> @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>> if (soft_dirty)
>>> entry = pte_mksoft_dirty(entry);
>>> }
>>> - if (dirty)
>>> - SetPageDirty(page + i);
>>> pte = pte_offset_map(&_pmd, addr);
>>> BUG_ON(!pte_none(*pte));
>>> set_pte_at(mm, addr, pte, entry);
>>> @@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>> * and finally we write the non-huge version of the pmd entry with
>>> * pmd_populate.
>>> */
>>> - pmdp_invalidate(vma, haddr, pmd);
>>> + old = pmdp_invalidate(vma, haddr, pmd);
>>> +
>>> + /*
>>> + * Transfer dirty bit using value returned by pmd_invalidate() to be
>>> + * sure we don't race with CPU that can set the bit under us.
>>> + */
>>> + if (pmd_dirty(old))
>>> + SetPageDirty(page);
>>> +
>>> pmd_populate(mm, pmd, pgtable);
>>>
>>> if (freeze) {
>>
>>
>> Can we invalidate the pmd early here ? ie, do pmdp_invalidate instead of
>> pmdp_huge_split_prepare() ?
>
> I think we can. But it means we would block access to the page for longer
> than it's necessary on most architectures. I guess it's not a bit deal.
>
> Maybe as separate patch on top of this patchet? Aneesh, would you take
> care of this?
>

Yes, I cam do that.

-aneesh

2017-06-19 05:48:16

by Martin Schwidefsky

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On Thu, 15 Jun 2017 17:52:22 +0300
"Kirill A. Shutemov" <[email protected]> wrote:

> We need an atomic way to setup pmd page table entry, avoiding races with
> CPU setting dirty/accessed bits. This is required to implement
> pmdp_invalidate() that doesn't loose these bits.
>
> On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> setting it up half-by-half can expose broken corrupted entry to CPU.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> ---
> arch/x86/include/asm/pgtable-3level.h | 18 ++++++++++++++++++
> arch/x86/include/asm/pgtable.h | 14 ++++++++++++++
> 2 files changed, 32 insertions(+)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index f5af95a0c6b8..a924fc6a96b9 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1092,6 +1092,20 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
> }
>
> +#ifndef pmdp_establish
> +#define pmdp_establish pmdp_establish
> +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> +{
> + if (IS_ENABLED(CONFIG_SMP)) {
> + return xchg(pmdp, pmd);
> + } else {
> + pmd_t old = *pmdp;
> + *pmdp = pmd;
> + return old;
> + }
> +}
> +#endif
> +
> /*
> * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
> *

For the s390 version of the pmdp_establish function we need the mm to be able
to do the TLB flush correctly. Can we please add a "struct vm_area_struct *vma"
argument to pmdp_establish analog to pmdp_invalidate?

The s390 patch would then look like this:
--
>From 4d4641249d5e826c21c522d149553e89d73fcd4f Mon Sep 17 00:00:00 2001
From: Martin Schwidefsky <[email protected]>
Date: Mon, 19 Jun 2017 07:40:11 +0200
Subject: [PATCH] s390/mm: add pmdp_establish

Define the pmdp_establish function to replace a pmd entry with a new
one and return the old value.

Signed-off-by: Martin Schwidefsky <[email protected]>
---
arch/s390/include/asm/pgtable.h | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index bb59a0aa3249..dedeecd5455c 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1511,6 +1511,13 @@ static inline void pmdp_invalidate(struct vm_area_struct *vma,
pmdp_xchg_direct(vma->vm_mm, addr, pmdp, __pmd(_SEGMENT_ENTRY_EMPTY));
}

+static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
+ pmd_t *pmdp, pmd_t pmd)
+{
+ return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd);
+}
+#define pmdp_establish pmdp_establish
+
#define __HAVE_ARCH_PMDP_SET_WRPROTECT
static inline void pmdp_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pmd_t *pmdp)
--
2.11.2

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2017-06-19 12:46:55

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On Fri, Jun 16, 2017 at 03:36:00PM +0200, Andrea Arcangeli wrote:
> Hello Krill,
>
> On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> > +{
> > + pmd_t old;
> > +
> > + /*
> > + * We cannot assume what is value of pmd here, so there's no easy way
> > + * to set if half by half. We have to fall back to cmpxchg64.
> > + */
> > + {
> > + old = *pmdp;
> > + } while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> > +
> > + return old;
> > +}
>
> I see further margin for optimization here (although it's only for PAE
> x32..).
>
> pmd is stable so we could do:
>
> if (!(pmd & _PAGE_PRESENT)) {
> cast to split_pmd and use xchg on pmd_low like
> native_pmdp_get_and_clear and copy pmd_high non atomically
> } else {
> the above cmpxchg64 loop
> }
>
> Now thinking about the above I had a second thought if pmdp_establish
> is the right interface and if we shouldn't replace pmdp_establish with
> pmdp_mknotpresent instead to skip the pmd & _PAGE_PRESENT check that
> will always be true in practice, so pmdp_mknotpresent will call
> internally pmd_mknotpresent and it won't have to check for pmd &
> _PAGE_PRESENT and it would have no cons on x86-64.

With your proposed optimization, compiler is in good position to eliminate
cmpxchg loop for trivial cases as we have in pmdp_invalidate() case.
It can see that pmd is always has the present bit cleared.

I'll keep more flexible interface for now. Will see if anybody would see
more problems with it.

--
Kirill A. Shutemov

2017-06-19 12:48:26

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On Mon, Jun 19, 2017 at 07:48:01AM +0200, Martin Schwidefsky wrote:
> On Thu, 15 Jun 2017 17:52:22 +0300
> "Kirill A. Shutemov" <[email protected]> wrote:
>
> > We need an atomic way to setup pmd page table entry, avoiding races with
> > CPU setting dirty/accessed bits. This is required to implement
> > pmdp_invalidate() that doesn't loose these bits.
> >
> > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > setting it up half-by-half can expose broken corrupted entry to CPU.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: H. Peter Anvin <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > ---
> > arch/x86/include/asm/pgtable-3level.h | 18 ++++++++++++++++++
> > arch/x86/include/asm/pgtable.h | 14 ++++++++++++++
> > 2 files changed, 32 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > index f5af95a0c6b8..a924fc6a96b9 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -1092,6 +1092,20 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> > clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
> > }
> >
> > +#ifndef pmdp_establish
> > +#define pmdp_establish pmdp_establish
> > +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> > +{
> > + if (IS_ENABLED(CONFIG_SMP)) {
> > + return xchg(pmdp, pmd);
> > + } else {
> > + pmd_t old = *pmdp;
> > + *pmdp = pmd;
> > + return old;
> > + }
> > +}
> > +#endif
> > +
> > /*
> > * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
> > *
>
> For the s390 version of the pmdp_establish function we need the mm to be able
> to do the TLB flush correctly. Can we please add a "struct vm_area_struct *vma"
> argument to pmdp_establish analog to pmdp_invalidate?
>
> The s390 patch would then look like this:
> --
> From 4d4641249d5e826c21c522d149553e89d73fcd4f Mon Sep 17 00:00:00 2001
> From: Martin Schwidefsky <[email protected]>
> Date: Mon, 19 Jun 2017 07:40:11 +0200
> Subject: [PATCH] s390/mm: add pmdp_establish
>
> Define the pmdp_establish function to replace a pmd entry with a new
> one and return the old value.
>
> Signed-off-by: Martin Schwidefsky <[email protected]>
> ---
> arch/s390/include/asm/pgtable.h | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index bb59a0aa3249..dedeecd5455c 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1511,6 +1511,13 @@ static inline void pmdp_invalidate(struct vm_area_struct *vma,
> pmdp_xchg_direct(vma->vm_mm, addr, pmdp, __pmd(_SEGMENT_ENTRY_EMPTY));
> }
>
> +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> + pmd_t *pmdp, pmd_t pmd)
> +{
> + return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd);

I guess, you need address too :-P.

I'll change prototype of pmdp_establish() and apply your patch.

--
Kirill A. Shutemov

2017-06-19 13:05:42

by Martin Schwidefsky

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On Mon, 19 Jun 2017 15:48:19 +0300
"Kirill A. Shutemov" <[email protected]> wrote:

> On Mon, Jun 19, 2017 at 07:48:01AM +0200, Martin Schwidefsky wrote:
> > On Thu, 15 Jun 2017 17:52:22 +0300
> > "Kirill A. Shutemov" <[email protected]> wrote:
> >
> > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > CPU setting dirty/accessed bits. This is required to implement
> > > pmdp_invalidate() that doesn't loose these bits.
> > >
> > > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > > setting it up half-by-half can expose broken corrupted entry to CPU.
> > >
> > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > Cc: Ingo Molnar <[email protected]>
> > > Cc: H. Peter Anvin <[email protected]>
> > > Cc: Thomas Gleixner <[email protected]>
> > > ---
> > > arch/x86/include/asm/pgtable-3level.h | 18 ++++++++++++++++++
> > > arch/x86/include/asm/pgtable.h | 14 ++++++++++++++
> > > 2 files changed, 32 insertions(+)
> > >
> > > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > > index f5af95a0c6b8..a924fc6a96b9 100644
> > > --- a/arch/x86/include/asm/pgtable.h
> > > +++ b/arch/x86/include/asm/pgtable.h
> > > @@ -1092,6 +1092,20 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> > > clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
> > > }
> > >
> > > +#ifndef pmdp_establish
> > > +#define pmdp_establish pmdp_establish
> > > +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> > > +{
> > > + if (IS_ENABLED(CONFIG_SMP)) {
> > > + return xchg(pmdp, pmd);
> > > + } else {
> > > + pmd_t old = *pmdp;
> > > + *pmdp = pmd;
> > > + return old;
> > > + }
> > > +}
> > > +#endif
> > > +
> > > /*
> > > * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
> > > *
> >
> > For the s390 version of the pmdp_establish function we need the mm to be able
> > to do the TLB flush correctly. Can we please add a "struct vm_area_struct *vma"
> > argument to pmdp_establish analog to pmdp_invalidate?
> >
> > The s390 patch would then look like this:
> > --
> > From 4d4641249d5e826c21c522d149553e89d73fcd4f Mon Sep 17 00:00:00 2001
> > From: Martin Schwidefsky <[email protected]>
> > Date: Mon, 19 Jun 2017 07:40:11 +0200
> > Subject: [PATCH] s390/mm: add pmdp_establish
> >
> > Define the pmdp_establish function to replace a pmd entry with a new
> > one and return the old value.
> >
> > Signed-off-by: Martin Schwidefsky <[email protected]>
> > ---
> > arch/s390/include/asm/pgtable.h | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> > index bb59a0aa3249..dedeecd5455c 100644
> > --- a/arch/s390/include/asm/pgtable.h
> > +++ b/arch/s390/include/asm/pgtable.h
> > @@ -1511,6 +1511,13 @@ static inline void pmdp_invalidate(struct vm_area_struct *vma,
> > pmdp_xchg_direct(vma->vm_mm, addr, pmdp, __pmd(_SEGMENT_ENTRY_EMPTY));
> > }
> >
> > +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > + pmd_t *pmdp, pmd_t pmd)
> > +{
> > + return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd);
>
> I guess, you need address too :-P.
>
> I'll change prototype of pmdp_establish() and apply your patch.

Ahh, yes. vma + addr please ;-)

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2017-06-19 13:29:19

[permalink] [raw]

Subject: Re: [PATCHv2 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate()

On Fri, Jun 16, 2017 at 03:40:41PM +0200, Andrea Arcangeli wrote:
> On Thu, Jun 15, 2017 at 05:52:23PM +0300, Kirill A. Shutemov wrote:
> > -void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> > +pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> > pmd_t *pmdp)
> > {
> > - pmd_t entry = *pmdp;
> > - set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
> > - flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> > + pmd_t old = pmdp_establish(pmdp, pmd_mknotpresent(*pmdp));
> > + if (pmd_present(old))
> > + flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> > + return old;
> > }
> > #endif
>
> The pmd_present() check added above is superflous because there's no
> point to call pmdp_invalidate if the pmd is not present (present as in
> pmd_present) already. pmd_present returns true if _PAGE_PSE is set
> and it was always set before calling pmdp_invalidate.
>
> It looks like we could skip the flush if _PAGE_PRESENT is not set
> (i.e. for example if the pmd is PROTNONE) but that's not what the above
> pmd_present will do.

You are right. We seems don't have a generic way to check the entry is
present to CPU.

I guess I'll drop the check then.

--
Kirill A. Shutemov

2017-06-19 14:03:35

[permalink] [raw]

Subject: Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

On Fri, Jun 16, 2017 at 11:53:33PM +0900, Minchan Kim wrote:
> Hi Andrea,
>
> On Fri, Jun 16, 2017 at 04:27:20PM +0200, Andrea Arcangeli wrote:
> > Hello Minchan,
> >
> > On Fri, Jun 16, 2017 at 10:52:09PM +0900, Minchan Kim wrote:
> > > > > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > > > > if (soft_dirty)
> > > > > > entry = pte_mksoft_dirty(entry);
> > > > > > }
> > > > > > - if (dirty)
> > > > > > - SetPageDirty(page + i);
> > > > > > pte = pte_offset_map(&_pmd, addr);
> > [..]
> > >
> > > split_huge_page set PG_dirty to all subpages unconditionally?
> > > If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
> > > that piece of code. What I found one is just __split_huge_page_tail
> > > which set PG_dirty to subpage if head page is dirty. IOW, if the head
> > > page is not dirty, tail page will be clean, too.
> > > Could you point out what routine set PG_dirty to all subpages unconditionally?

When I wrote this code, I considered that we may want to track dirty
status on per-4k basis for file-backed THPs.

> > On a side note the snippet deleted above was useless, as long as
> > there's one left hugepmd to split, the physical page has to be still
> > compound and huge and as long as that's the case the tail pages
> > PG_dirty bit is meaningless (even if set, it's going to be clobbered
> > during the physical split).
>
> I got it during reviewing this patch. That's why I didn't argue
> this patch would break MADV_FREE by deleting routine which propagate
> dirty to pte of subpages. However, although it's useless, I prefer
> not removing the transfer of dirty bit. Because it would help MADV_FREE
> users who want to use smaps to know how many of pages are not freeable
> (i.e, dirtied) since MADV_FREE although it is not 100% correct.
>
> >
> > In short PG_dirty is only meaningful in the head as long as it's
> > compound. The physical split in __split_huge_page_tail transfer the
> > head value to the tails like you mentioned, that's all as far as I can
> > tell.
>
> Thanks for the comment. Then, this patch is to fix MADV_FREE's bug
> which has lost dirty bit by transferring dirty bit too early.

Erghh. I've misread splitting code. Yes, it's not unconditional. So we fix
actual bug.

But I'm not sure it's subject for -stable. I haven't seen any bug reports
that can be attributed to the bug.

--
Kirill A. Shutemov

2017-06-19 15:24:48

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

Hi Kirill,

On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> We need an atomic way to setup pmd page table entry, avoiding races with
> CPU setting dirty/accessed bits. This is required to implement
> pmdp_invalidate() that doesn't loose these bits.
>
> On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> setting it up half-by-half can expose broken corrupted entry to CPU.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Thomas Gleixner <[email protected]>

I'll look at this from the arm64 perspective. It would be good if we can
have a generic atomic implementation based on cmpxchg64 but I need to
look at the details first.

> +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> +{
> + pmd_t old;
> +
> + /*
> + * We cannot assume what is value of pmd here, so there's no easy way
> + * to set if half by half. We have to fall back to cmpxchg64.
> + */
> + {

BTW, you are missing a "do" here (and it probably compiles just fine
without it, though different behaviour).

> + old = *pmdp;
> + } while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> +
> + return old;
> +}

--
Catalin

2017-06-19 16:00:18

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On Mon, Jun 19, 2017 at 04:22:29PM +0100, Catalin Marinas wrote:
> Hi Kirill,
>
> On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > We need an atomic way to setup pmd page table entry, avoiding races with
> > CPU setting dirty/accessed bits. This is required to implement
> > pmdp_invalidate() that doesn't loose these bits.
> >
> > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > setting it up half-by-half can expose broken corrupted entry to CPU.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: H. Peter Anvin <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
>
> I'll look at this from the arm64 perspective. It would be good if we can
> have a generic atomic implementation based on cmpxchg64 but I need to
> look at the details first.

Unfortunately, I'm not sure it's possbile.

The format of a page table is defined per-arch. We cannot assume much about
it in generic code.

I guess we could make it compile by casting to 'unsigned long', but is it
useful?
Every architecture manintainer still has to validate that this assumption
is valid for the architecture.

> > +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> > +{
> > + pmd_t old;
> > +
> > + /*
> > + * We cannot assume what is value of pmd here, so there's no easy way
> > + * to set if half by half. We have to fall back to cmpxchg64.
> > + */
> > + {
>
> BTW, you are missing a "do" here (and it probably compiles just fine
> without it, though different behaviour).

Ouch. Thanks.

Hm, what is semantics of the construct without a "do"?

>
> > + old = *pmdp;
> > + } while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> > +
> > + return old;
> > +}
>
> --
> Catalin
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kirill A. Shutemov

2017-06-19 17:09:20

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On Mon, Jun 19, 2017 at 07:00:05PM +0300, Kirill A. Shutemov wrote:
> On Mon, Jun 19, 2017 at 04:22:29PM +0100, Catalin Marinas wrote:
> > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > CPU setting dirty/accessed bits. This is required to implement
> > > pmdp_invalidate() that doesn't loose these bits.
> > >
> > > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > > setting it up half-by-half can expose broken corrupted entry to CPU.
> > >
> > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > Cc: Ingo Molnar <[email protected]>
> > > Cc: H. Peter Anvin <[email protected]>
> > > Cc: Thomas Gleixner <[email protected]>
> >
> > I'll look at this from the arm64 perspective. It would be good if we can
> > have a generic atomic implementation based on cmpxchg64 but I need to
> > look at the details first.
>
> Unfortunately, I'm not sure it's possbile.
>
> The format of a page table is defined per-arch. We cannot assume much about
> it in generic code.
>
> I guess we could make it compile by casting to 'unsigned long', but is it
> useful?
> Every architecture manintainer still has to validate that this assumption
> is valid for the architecture.

You are right, not much gained in doing this.

Maybe a stupid question but can we not implement pmdp_invalidate() with
something like pmdp_get_and_clear() (usually reusing the ptep_*
equivalent). Or pmdp_clear_flush() (again, reusing ptep_clear_flush())?

In my quick grep on pmdp_invalidate, it seems to be followed by
set_pmd_at() or pmd_populate() already and the *pmd value after
mknotpresent isn't any different from 0 to the hardware (at least on
ARM). That's unless Linux expects to see some non-zero value here if
walking the page tables on another CPU.

> > > +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> > > +{
> > > + pmd_t old;
> > > +
> > > + /*
> > > + * We cannot assume what is value of pmd here, so there's no easy way
> > > + * to set if half by half. We have to fall back to cmpxchg64.
> > > + */
> > > + {
> >
> > BTW, you are missing a "do" here (and it probably compiles just fine
> > without it, though different behaviour).
>
> Ouch. Thanks.
>
> Hm, what is semantics of the construct without a "do"?

You can just ignore the brackets:

old = *pmdp;
while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd)
;

--
Catalin

2017-06-19 17:11:42

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

Kirill A. Shutemov <[email protected]> wrote:

> We need an atomic way to setup pmd page table entry, avoiding races with
> CPU setting dirty/accessed bits. This is required to implement
> pmdp_invalidate() that doesn't loose these bits.
>
> On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> setting it up half-by-half can expose broken corrupted entry to CPU.

...

>
> +#ifndef pmdp_establish
> +#define pmdp_establish pmdp_establish
> +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> +{
> + if (IS_ENABLED(CONFIG_SMP)) {
> + return xchg(pmdp, pmd);
> + } else {
> + pmd_t old = *pmdp;
> + *pmdp = pmd;

I think you may want to use WRITE_ONCE() here - otherwise nobody guarantees
that the compiler will not split writes to *pmdp. Although the kernel uses
similar code to setting PTEs and PMDs, I think that it is best to start
fixing it. Obviously, you might need a different code path for 32-bit
kernels.

Regards,
Nadav

2017-06-19 21:52:17

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On Mon, Jun 19, 2017 at 06:09:12PM +0100, Catalin Marinas wrote:
> On Mon, Jun 19, 2017 at 07:00:05PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Jun 19, 2017 at 04:22:29PM +0100, Catalin Marinas wrote:
> > > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > > CPU setting dirty/accessed bits. This is required to implement
> > > > pmdp_invalidate() that doesn't loose these bits.
> > > >
> > > > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > > > setting it up half-by-half can expose broken corrupted entry to CPU.
> > > >
> > > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > > Cc: Ingo Molnar <[email protected]>
> > > > Cc: H. Peter Anvin <[email protected]>
> > > > Cc: Thomas Gleixner <[email protected]>
> > >
> > > I'll look at this from the arm64 perspective. It would be good if we can
> > > have a generic atomic implementation based on cmpxchg64 but I need to
> > > look at the details first.
> >
> > Unfortunately, I'm not sure it's possbile.
> >
> > The format of a page table is defined per-arch. We cannot assume much about
> > it in generic code.
> >
> > I guess we could make it compile by casting to 'unsigned long', but is it
> > useful?
> > Every architecture manintainer still has to validate that this assumption
> > is valid for the architecture.
>
> You are right, not much gained in doing this.
>
> Maybe a stupid question but can we not implement pmdp_invalidate() with
> something like pmdp_get_and_clear() (usually reusing the ptep_*
> equivalent). Or pmdp_clear_flush() (again, reusing ptep_clear_flush())?
>
> In my quick grep on pmdp_invalidate, it seems to be followed by
> set_pmd_at() or pmd_populate() already and the *pmd value after
> mknotpresent isn't any different from 0 to the hardware (at least on
> ARM). That's unless Linux expects to see some non-zero value here if
> walking the page tables on another CPU.

The whole reason to have pmdp_invalidate() in first place is to never make
pmd clear in the middle. Otherwise we will get race with MADV_DONTNEED.
See ced108037c2a for an example of such race.

--
Kirill A. Shutemov

2017-06-19 21:57:45

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On Mon, Jun 19, 2017 at 10:11:35AM -0700, Nadav Amit wrote:
> Kirill A. Shutemov <[email protected]> wrote:
>
> > We need an atomic way to setup pmd page table entry, avoiding races with
> > CPU setting dirty/accessed bits. This is required to implement
> > pmdp_invalidate() that doesn't loose these bits.
> >
> > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > setting it up half-by-half can expose broken corrupted entry to CPU.
>
> ...
>
> >
> > +#ifndef pmdp_establish
> > +#define pmdp_establish pmdp_establish
> > +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> > +{
> > + if (IS_ENABLED(CONFIG_SMP)) {
> > + return xchg(pmdp, pmd);
> > + } else {
> > + pmd_t old = *pmdp;
> > + *pmdp = pmd;
>
> I think you may want to use WRITE_ONCE() here - otherwise nobody guarantees
> that the compiler will not split writes to *pmdp. Although the kernel uses
> similar code to setting PTEs and PMDs, I think that it is best to start
> fixing it. Obviously, you might need a different code path for 32-bit
> kernels.

This code is for 2-level pageing on 32-bit machines and for 4-level paging
on 64-bit machine. In both cases sizeof(pmd_t) == sizeof(unsigned long).
Sane compiler can't screw up anything here -- store of long is one shot.

Compiler still can issue duplicate of store, but there's no harm.
It guaranteed to be stable once ptl is released and CPU can't the entry
half-updated.

--
Kirill A. Shutemov

2017-06-20 02:52:20

[permalink] [raw]

Subject: Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

Hello Kirill,

On Mon, Jun 19, 2017 at 05:03:23PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jun 16, 2017 at 11:53:33PM +0900, Minchan Kim wrote:
> > Hi Andrea,
> >
> > On Fri, Jun 16, 2017 at 04:27:20PM +0200, Andrea Arcangeli wrote:
> > > Hello Minchan,
> > >
> > > On Fri, Jun 16, 2017 at 10:52:09PM +0900, Minchan Kim wrote:
> > > > > > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > > > > > if (soft_dirty)
> > > > > > > entry = pte_mksoft_dirty(entry);
> > > > > > > }
> > > > > > > - if (dirty)
> > > > > > > - SetPageDirty(page + i);
> > > > > > > pte = pte_offset_map(&_pmd, addr);
> > > [..]
> > > >
> > > > split_huge_page set PG_dirty to all subpages unconditionally?
> > > > If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
> > > > that piece of code. What I found one is just __split_huge_page_tail
> > > > which set PG_dirty to subpage if head page is dirty. IOW, if the head
> > > > page is not dirty, tail page will be clean, too.
> > > > Could you point out what routine set PG_dirty to all subpages unconditionally?
>
> When I wrote this code, I considered that we may want to track dirty
> status on per-4k basis for file-backed THPs.
>
> > > On a side note the snippet deleted above was useless, as long as
> > > there's one left hugepmd to split, the physical page has to be still
> > > compound and huge and as long as that's the case the tail pages
> > > PG_dirty bit is meaningless (even if set, it's going to be clobbered
> > > during the physical split).
> >
> > I got it during reviewing this patch. That's why I didn't argue
> > this patch would break MADV_FREE by deleting routine which propagate
> > dirty to pte of subpages. However, although it's useless, I prefer
> > not removing the transfer of dirty bit. Because it would help MADV_FREE
> > users who want to use smaps to know how many of pages are not freeable
> > (i.e, dirtied) since MADV_FREE although it is not 100% correct.
> >
> > >
> > > In short PG_dirty is only meaningful in the head as long as it's
> > > compound. The physical split in __split_huge_page_tail transfer the
> > > head value to the tails like you mentioned, that's all as far as I can
> > > tell.
> >
> > Thanks for the comment. Then, this patch is to fix MADV_FREE's bug
> > which has lost dirty bit by transferring dirty bit too early.
>
> Erghh. I've misread splitting code. Yes, it's not unconditional. So we fix
> actual bug.
>
> But I'm not sure it's subject for -stable. I haven't seen any bug reports
> that can be attributed to the bug.

Okay, I'm not against but please rewrite changelog to indicate it fixes
the problem. One more thing, as I mentioned, I don't want to remove
pmd dirty bit -> PG_dirty propagate to subpage part because it would be
helpful for MADV_FREE users.

Thanks.

>
> --
> Kirill A. Shutemov

2017-06-20 09:57:48

[permalink] [raw]

Subject: Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

On Tue, Jun 20, 2017 at 11:52:08AM +0900, Minchan Kim wrote:
> Hello Kirill,
>
> On Mon, Jun 19, 2017 at 05:03:23PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jun 16, 2017 at 11:53:33PM +0900, Minchan Kim wrote:
> > > Hi Andrea,
> > >
> > > On Fri, Jun 16, 2017 at 04:27:20PM +0200, Andrea Arcangeli wrote:
> > > > Hello Minchan,
> > > >
> > > > On Fri, Jun 16, 2017 at 10:52:09PM +0900, Minchan Kim wrote:
> > > > > > > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > > > > > > if (soft_dirty)
> > > > > > > > entry = pte_mksoft_dirty(entry);
> > > > > > > > }
> > > > > > > > - if (dirty)
> > > > > > > > - SetPageDirty(page + i);
> > > > > > > > pte = pte_offset_map(&_pmd, addr);
> > > > [..]
> > > > >
> > > > > split_huge_page set PG_dirty to all subpages unconditionally?
> > > > > If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
> > > > > that piece of code. What I found one is just __split_huge_page_tail
> > > > > which set PG_dirty to subpage if head page is dirty. IOW, if the head
> > > > > page is not dirty, tail page will be clean, too.
> > > > > Could you point out what routine set PG_dirty to all subpages unconditionally?
> >
> > When I wrote this code, I considered that we may want to track dirty
> > status on per-4k basis for file-backed THPs.
> >
> > > > On a side note the snippet deleted above was useless, as long as
> > > > there's one left hugepmd to split, the physical page has to be still
> > > > compound and huge and as long as that's the case the tail pages
> > > > PG_dirty bit is meaningless (even if set, it's going to be clobbered
> > > > during the physical split).
> > >
> > > I got it during reviewing this patch. That's why I didn't argue
> > > this patch would break MADV_FREE by deleting routine which propagate
> > > dirty to pte of subpages. However, although it's useless, I prefer
> > > not removing the transfer of dirty bit. Because it would help MADV_FREE
> > > users who want to use smaps to know how many of pages are not freeable
> > > (i.e, dirtied) since MADV_FREE although it is not 100% correct.
> > >
> > > >
> > > > In short PG_dirty is only meaningful in the head as long as it's
> > > > compound. The physical split in __split_huge_page_tail transfer the
> > > > head value to the tails like you mentioned, that's all as far as I can
> > > > tell.
> > >
> > > Thanks for the comment. Then, this patch is to fix MADV_FREE's bug
> > > which has lost dirty bit by transferring dirty bit too early.
> >
> > Erghh. I've misread splitting code. Yes, it's not unconditional. So we fix
> > actual bug.
> >
> > But I'm not sure it's subject for -stable. I haven't seen any bug reports
> > that can be attributed to the bug.
>
> Okay, I'm not against but please rewrite changelog to indicate it fixes
> the problem. One more thing, as I mentioned, I don't want to remove
> pmd dirty bit -> PG_dirty propagate to subpage part because it would be
> helpful for MADV_FREE users.

Oops, I misread smap accouting code so no problem to remove useless
propagation part I added for MADV_FREE.

Thanks.

2017-06-20 15:54:46

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On Tue, Jun 20, 2017 at 12:52:10AM +0300, Kirill A. Shutemov wrote:
> On Mon, Jun 19, 2017 at 06:09:12PM +0100, Catalin Marinas wrote:
> > On Mon, Jun 19, 2017 at 07:00:05PM +0300, Kirill A. Shutemov wrote:
> > > On Mon, Jun 19, 2017 at 04:22:29PM +0100, Catalin Marinas wrote:
> > > > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > > > CPU setting dirty/accessed bits. This is required to implement
> > > > > pmdp_invalidate() that doesn't loose these bits.
> > > > >
> > > > > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > > > > setting it up half-by-half can expose broken corrupted entry to CPU.
> > > > >
> > > > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > > > Cc: Ingo Molnar <[email protected]>
> > > > > Cc: H. Peter Anvin <[email protected]>
> > > > > Cc: Thomas Gleixner <[email protected]>
> > > >
> > > > I'll look at this from the arm64 perspective. It would be good if we can
> > > > have a generic atomic implementation based on cmpxchg64 but I need to
> > > > look at the details first.
> > >
> > > Unfortunately, I'm not sure it's possbile.
> > >
> > > The format of a page table is defined per-arch. We cannot assume much about
> > > it in generic code.
> > >
> > > I guess we could make it compile by casting to 'unsigned long', but is it
> > > useful?
> > > Every architecture manintainer still has to validate that this assumption
> > > is valid for the architecture.
> >
> > You are right, not much gained in doing this.
> >
> > Maybe a stupid question but can we not implement pmdp_invalidate() with
> > something like pmdp_get_and_clear() (usually reusing the ptep_*
> > equivalent). Or pmdp_clear_flush() (again, reusing ptep_clear_flush())?
> >
> > In my quick grep on pmdp_invalidate, it seems to be followed by
> > set_pmd_at() or pmd_populate() already and the *pmd value after
> > mknotpresent isn't any different from 0 to the hardware (at least on
> > ARM). That's unless Linux expects to see some non-zero value here if
> > walking the page tables on another CPU.
>
> The whole reason to have pmdp_invalidate() in first place is to never make
> pmd clear in the middle. Otherwise we will get race with MADV_DONTNEED.
> See ced108037c2a for an example of such race.

Thanks for the explanation. So you basically just want to set a !present
and !none pmd. I noticed that with your proposed pmdp_invalidate(),
pmdp_establish(pmd_mknotpresent(*pmdp)) could set a stale *pmdp (with
the present bit cleared) temporarily until updated with what
pmdp_establish() returned. Is there a risk of racing with other parts of
the kernel? I guess not since the pmd is !present.

For arm64, I don't see the point of a cmpxchg, so something like below
would do (it needs proper testing though):

-------------8<---------------------------
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index c213fdbd056c..8fe1dad9100a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -39,6 +39,7 @@

#ifndef __ASSEMBLY__

+#include <asm/cmpxchg.h>
#include <asm/fixmap.h>
#include <linux/mmdebug.h>

@@ -683,6 +684,11 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
{
ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
}
+
+static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
+{
+ return __pmd(xchg_relaxed(&pmd_val(*pmdp), pmd_val(pmd)));
+}
#endif
#endif /* CONFIG_ARM64_HW_AFDBM */

-------------8<---------------------------

--
Catalin

2017-06-21 09:53:14

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On Tue, Jun 20, 2017 at 04:54:38PM +0100, Catalin Marinas wrote:
> On Tue, Jun 20, 2017 at 12:52:10AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Jun 19, 2017 at 06:09:12PM +0100, Catalin Marinas wrote:
> > > On Mon, Jun 19, 2017 at 07:00:05PM +0300, Kirill A. Shutemov wrote:
> > > > On Mon, Jun 19, 2017 at 04:22:29PM +0100, Catalin Marinas wrote:
> > > > > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > > > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > > > > CPU setting dirty/accessed bits. This is required to implement
> > > > > > pmdp_invalidate() that doesn't loose these bits.
> > > > > >
> > > > > > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > > > > > setting it up half-by-half can expose broken corrupted entry to CPU.
> > > > > >
> > > > > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > > > > Cc: Ingo Molnar <[email protected]>
> > > > > > Cc: H. Peter Anvin <[email protected]>
> > > > > > Cc: Thomas Gleixner <[email protected]>
> > > > >
> > > > > I'll look at this from the arm64 perspective. It would be good if we can
> > > > > have a generic atomic implementation based on cmpxchg64 but I need to
> > > > > look at the details first.
> > > >
> > > > Unfortunately, I'm not sure it's possbile.
> > > >
> > > > The format of a page table is defined per-arch. We cannot assume much about
> > > > it in generic code.
> > > >
> > > > I guess we could make it compile by casting to 'unsigned long', but is it
> > > > useful?
> > > > Every architecture manintainer still has to validate that this assumption
> > > > is valid for the architecture.
> > >
> > > You are right, not much gained in doing this.
> > >
> > > Maybe a stupid question but can we not implement pmdp_invalidate() with
> > > something like pmdp_get_and_clear() (usually reusing the ptep_*
> > > equivalent). Or pmdp_clear_flush() (again, reusing ptep_clear_flush())?
> > >
> > > In my quick grep on pmdp_invalidate, it seems to be followed by
> > > set_pmd_at() or pmd_populate() already and the *pmd value after
> > > mknotpresent isn't any different from 0 to the hardware (at least on
> > > ARM). That's unless Linux expects to see some non-zero value here if
> > > walking the page tables on another CPU.
> >
> > The whole reason to have pmdp_invalidate() in first place is to never make
> > pmd clear in the middle. Otherwise we will get race with MADV_DONTNEED.
> > See ced108037c2a for an example of such race.
>
> Thanks for the explanation. So you basically just want to set a !present
> and !none pmd. I noticed that with your proposed pmdp_invalidate(),
> pmdp_establish(pmd_mknotpresent(*pmdp)) could set a stale *pmdp (with
> the present bit cleared) temporarily until updated with what
> pmdp_establish() returned. Is there a risk of racing with other parts of
> the kernel? I guess not since the pmd is !present.

I don't see such risk. Other parts of the kernel would see non-present pmd
and will have to take ptl to do anything meaningful with it.
pmdp_invalidate() caller has to hold ptl too, so the race is excluded.

> For arm64, I don't see the point of a cmpxchg, so something like below
> would do (it needs proper testing though):

Right. cmpxchg is required for x86 PAE, as it has sizeof(pmd_t) >
sizeof(long). We don't have 8-byte xchg() there.

Thanks, for the patch. I assume, I can use your signed-off-by, right?

Any chance you could help me with arm too?

--
Kirill A. Shutemov

2017-06-21 10:40:29

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On Wed, Jun 21, 2017 at 12:53:03PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jun 20, 2017 at 04:54:38PM +0100, Catalin Marinas wrote:
> > For arm64, I don't see the point of a cmpxchg, so something like below
> > would do (it needs proper testing though):
>
> Right. cmpxchg is required for x86 PAE, as it has sizeof(pmd_t) >
> sizeof(long). We don't have 8-byte xchg() there.
>
> Thanks, for the patch. I assume, I can use your signed-off-by, right?

Yes. And maybe some text (well, I just copied yours):

---------------8<--------------
arm64: Provide pmdp_establish() helper

We need an atomic way to setup pmd page table entry, avoiding races with
CPU setting dirty/accessed bits. This is required to implement
pmdp_invalidate() that doesn't lose these bits.

Signed-off-by: Catalin Marinas <[email protected]>
---------------8<--------------

> Any chance you could help me with arm too?

I'll have a look.

--
Catalin

2017-06-21 11:27:10

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On Wed, Jun 21, 2017 at 12:53:03PM +0300, Kirill A. Shutemov wrote:
> > > > > > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > > > > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > > > > > CPU setting dirty/accessed bits. This is required to implement
> > > > > > > pmdp_invalidate() that doesn't loose these bits.
[...]
> Any chance you could help me with arm too?

On arm (ARMv7 with LPAE) we don't have hardware updates of the
access/dirty bits, so a generic implementation would suffice. I didn't
find one in your patches, so here's an untested version:

static inline pmd_t pmdp_establish(struct mm_struct *mm, unsigned long address,
pmd_t *pmdp, pmd_t pmd)
{
pmd_t old_pmd = *pmdp;
set_pmd_at(mm, address, pmdp, pmd);
return old_pmd;
}

--
Catalin

2017-06-21 12:04:19

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On Wed, Jun 21, 2017 at 12:27:02PM +0100, Catalin Marinas wrote:
> On Wed, Jun 21, 2017 at 12:53:03PM +0300, Kirill A. Shutemov wrote:
> > > > > > > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > > > > > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > > > > > > CPU setting dirty/accessed bits. This is required to implement
> > > > > > > > pmdp_invalidate() that doesn't loose these bits.
> [...]
> > Any chance you could help me with arm too?
>
> On arm (ARMv7 with LPAE) we don't have hardware updates of the
> access/dirty bits, so a generic implementation would suffice. I didn't
> find one in your patches, so here's an untested version:
>
> static inline pmd_t pmdp_establish(struct mm_struct *mm, unsigned long address,
> pmd_t *pmdp, pmd_t pmd)
> {
> pmd_t old_pmd = *pmdp;
> set_pmd_at(mm, address, pmdp, pmd);
> return old_pmd;
> }

Thanks, I'll integrate this into the patchset.

--
Kirill A. Shutemov

2017-06-21 16:25:43

by Vineet Gupta

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On 06/21/2017 04:27 AM, Catalin Marinas wrote:
> On Wed, Jun 21, 2017 at 12:53:03PM +0300, Kirill A. Shutemov wrote:
>>>>>>> On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
>>>>>>>> We need an atomic way to setup pmd page table entry, avoiding races with
>>>>>>>> CPU setting dirty/accessed bits. This is required to implement
>>>>>>>> pmdp_invalidate() that doesn't loose these bits.
> [...]
>> Any chance you could help me with arm too?
> On arm (ARMv7 with LPAE) we don't have hardware updates of the
> access/dirty bits, so a generic implementation would suffice. I didn't
> find one in your patches, so here's an untested version:
>
> static inline pmd_t pmdp_establish(struct mm_struct *mm, unsigned long address,
> pmd_t *pmdp, pmd_t pmd)
> {
> pmd_t old_pmd = *pmdp;
> set_pmd_at(mm, address, pmdp, pmd);
> return old_pmd;
> }

So it seems the discussions have settled down and pmdp_establish() can be
implemented in generic way as above and it will suffice if arch doesn't have a
special need. It would be nice to add the comment above generic version that it
only needs to be implemented if hardware sets the accessed/dirty bits !

Then nothing special is needed for ARC - right ?

-Vineet

2017-06-21 17:16:12

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On Wed, Jun 21, 2017 at 08:49:03AM -0700, Vineet Gupta wrote:
> On 06/21/2017 04:27 AM, Catalin Marinas wrote:
> > On Wed, Jun 21, 2017 at 12:53:03PM +0300, Kirill A. Shutemov wrote:
> > > > > > > > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > > > > > > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > > > > > > > CPU setting dirty/accessed bits. This is required to implement
> > > > > > > > > pmdp_invalidate() that doesn't loose these bits.
> > [...]
> > > Any chance you could help me with arm too?
> > On arm (ARMv7 with LPAE) we don't have hardware updates of the
> > access/dirty bits, so a generic implementation would suffice. I didn't
> > find one in your patches, so here's an untested version:
> >
> > static inline pmd_t pmdp_establish(struct mm_struct *mm, unsigned long address,
> > pmd_t *pmdp, pmd_t pmd)
> > {
> > pmd_t old_pmd = *pmdp;
> > set_pmd_at(mm, address, pmdp, pmd);
> > return old_pmd;
> > }
>
> So it seems the discussions have settled down and pmdp_establish() can be
> implemented in generic way as above and it will suffice if arch doesn't have
> a special need. It would be nice to add the comment above generic version
> that it only needs to be implemented if hardware sets the accessed/dirty
> bits !
>
> Then nothing special is needed for ARC - right ?

I will define generic version as Catalin proposed with a comment, but
under the name generic_pmdp_establish. An arch can make use of it by

#define pmdp_establish generic_pmdp_establish

I don't want it to be used by default without attention from architecture
maintainer. It can lead unnoticied breakage if THP got enabled on new
arch.

--
Kirill A. Shutemov

2017-06-21 17:21:20

by Vineet Gupta

[permalink] [raw]

Subject: Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper

On 06/21/2017 10:16 AM, Kirill A. Shutemov wrote:
> On Wed, Jun 21, 2017 at 08:49:03AM -0700, Vineet Gupta wrote:
>> On 06/21/2017 04:27 AM, Catalin Marinas wrote:
>>> On Wed, Jun 21, 2017 at 12:53:03PM +0300, Kirill A. Shutemov wrote:
>>>>>>>>> On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
>>>>>>>>>> We need an atomic way to setup pmd page table entry, avoiding races with
>>>>>>>>>> CPU setting dirty/accessed bits. This is required to implement
>>>>>>>>>> pmdp_invalidate() that doesn't loose these bits.
>>> [...]
>>>> Any chance you could help me with arm too?
>>> On arm (ARMv7 with LPAE) we don't have hardware updates of the
>>> access/dirty bits, so a generic implementation would suffice. I didn't
>>> find one in your patches, so here's an untested version:
>>>
>>> static inline pmd_t pmdp_establish(struct mm_struct *mm, unsigned long address,
>>> pmd_t *pmdp, pmd_t pmd)
>>> {
>>> pmd_t old_pmd = *pmdp;
>>> set_pmd_at(mm, address, pmdp, pmd);
>>> return old_pmd;
>>> }
>> So it seems the discussions have settled down and pmdp_establish() can be
>> implemented in generic way as above and it will suffice if arch doesn't have
>> a special need. It would be nice to add the comment above generic version
>> that it only needs to be implemented if hardware sets the accessed/dirty
>> bits !
>>
>> Then nothing special is needed for ARC - right ?
> I will define generic version as Catalin proposed with a comment, but
> under the name generic_pmdp_establish. An arch can make use of it by
>
> #define pmdp_establish generic_pmdp_establish

Can you do that for ARC in your next posting - or want me to once you have posted
that ?

> I don't want it to be used by default without attention from architecture
> maintainer. It can lead unnoticied breakage if THP got enabled on new
> arch.

Makes sense !

-Vineet

2017-06-21 17:52:09