2020-06-15 22:22:38

by Peter Xu

[permalink] [raw]
Subject: [PATCH 00/25] mm: Page fault accounting cleanups

Gerald Schaefer reported an issue a week ago regarding to incorrect page fault
accountings for retried page fault after commit 4064b9827063 ("mm: allow
VM_FAULT_RETRY for multiple times"):

https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/

When I was looking at the issue, I actually found some other trivial issues too
related to the page fault accounting, namely:

- Incorrect accountings

- The major issue that was reported by Gerald, that we did multiple
accounting when the page fault retried while we should only do it once
(for either perf event accounting or per-task accounting).

- In many archs, major fault is only accounted when the 1st page fault is a
major fault, or the last page fault is a major fault. Ideally we should
account the page fault as a major fault as long as any of the page fault
retries is a major fault and keep the same behavior across archs.

- Missing accountings of perf events (PERF_COUNT_SW_PAGE_FAULTS[_[MAJ|MIN]]).

- Doing accounting with mmap_sem: not a big deal, but logically we can move
it after releasing the mmap_sem because accounting does not need it.

This series tries to address all of them by introducing mm_fault_accounting()
first, so that we move all the page fault accounting into the common code base,
then call it properly from arch pf handlers just like handle_mm_fault(). There
are some special cases, e.g., the um arch does not have "regs" in the fault
handler, so it's still using the manual statistics.

For each of the patch that fixes a specific arch, I'm CCing the maintainers and
the arch list if there is. Besides, I only lightly tested this series on x86.

Please have a look, thanks.

Peter Xu (25):
mm/um: Fix extra accounting for page fault retries
mm: Introduce mm_fault_accounting()
mm/alpha: Use mm_fault_accounting()
mm/arc: Use mm_fault_accounting()
mm/arm: Use mm_fault_accounting()
mm/arm64: Use mm_fault_accounting()
mm/csky: Use mm_fault_accounting()
mm/hexagon: Use mm_fault_accounting()
mm/ia64: Use mm_fault_accounting()
mm/m68k: Use mm_fault_accounting()
mm/microblaze: Use mm_fault_accounting()
mm/mips: Use mm_fault_accounting()
mm/nds32: Use mm_fault_accounting()
mm/nios2: Use mm_fault_accounting()
mm/openrisc: Use mm_fault_accounting()
mm/parisc: Use mm_fault_accounting()
mm/powerpc: Use mm_fault_accounting()
mm/riscv: Use mm_fault_accounting()
mm/s390: Use mm_fault_accounting()
mm/sh: Use mm_fault_accounting()
mm/sparc32: Use mm_fault_accounting()
mm/sparc64: Use mm_fault_accounting()
mm/unicore32: Use mm_fault_accounting()
mm/x86: Use mm_fault_accounting()
mm/xtensa: Use mm_fault_accounting()

arch/alpha/mm/fault.c | 9 ++++-----
arch/arc/mm/fault.c | 15 +++------------
arch/arm/mm/fault.c | 21 ++++-----------------
arch/arm64/mm/fault.c | 17 ++---------------
arch/csky/mm/fault.c | 13 +------------
arch/hexagon/mm/vm_fault.c | 9 +++------
arch/ia64/mm/fault.c | 8 +++-----
arch/m68k/mm/fault.c | 13 +++----------
arch/microblaze/mm/fault.c | 8 +++-----
arch/mips/mm/fault.c | 14 +++-----------
arch/nds32/mm/fault.c | 19 +++----------------
arch/nios2/mm/fault.c | 13 +++----------
arch/openrisc/mm/fault.c | 8 +++-----
arch/parisc/mm/fault.c | 8 +++-----
arch/powerpc/mm/fault.c | 13 ++++---------
arch/riscv/mm/fault.c | 21 +++------------------
arch/s390/mm/fault.c | 21 +++++----------------
arch/sh/mm/fault.c | 15 +++------------
arch/sparc/mm/fault_32.c | 16 +++-------------
arch/sparc/mm/fault_64.c | 16 ++++------------
arch/um/kernel/trap.c | 13 +++++++------
arch/unicore32/mm/fault.c | 15 +++++++--------
arch/x86/mm/fault.c | 10 +---------
arch/xtensa/mm/fault.c | 14 +++-----------
include/linux/mm.h | 2 ++
mm/memory.c | 18 ++++++++++++++++++
26 files changed, 101 insertions(+), 248 deletions(-)

--
2.26.2


2020-06-15 22:25:21

by Peter Xu

[permalink] [raw]
Subject: [PATCH 19/25] mm/s390: Use mm_fault_accounting()

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.

CC: Heiko Carstens <[email protected]>
CC: Vasily Gorbik <[email protected]>
CC: Christian Borntraeger <[email protected]>
CC: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
arch/s390/mm/fault.c | 21 +++++----------------
1 file changed, 5 insertions(+), 16 deletions(-)

diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index dedc28be27ab..8ca207635b59 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -392,7 +392,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
unsigned long trans_exc_code;
unsigned long address;
unsigned int flags;
- vm_fault_t fault;
+ vm_fault_t fault, major = 0;

tsk = current;
/*
@@ -428,7 +428,6 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
}

address = trans_exc_code & __FAIL_ADDR_MASK;
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
flags = FAULT_FLAG_DEFAULT;
if (user_mode(regs))
flags |= FAULT_FLAG_USER;
@@ -480,6 +479,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
* the fault.
*/
fault = handle_mm_fault(vma, address, flags);
+ major |= fault & VM_FAULT_MAJOR;
if (fault_signal_pending(fault, regs)) {
fault = VM_FAULT_SIGNAL;
if (flags & FAULT_FLAG_RETRY_NOWAIT)
@@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
if (unlikely(fault & VM_FAULT_ERROR))
goto out_up;

- /*
- * Major/minor page fault accounting is only done on the
- * initial attempt. If we go through a retry, it is extremely
- * likely that the page will be found in page cache at that point.
- */
if (flags & FAULT_FLAG_ALLOW_RETRY) {
- if (fault & VM_FAULT_MAJOR) {
- tsk->maj_flt++;
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
- regs, address);
- } else {
- tsk->min_flt++;
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
- regs, address);
- }
if (fault & VM_FAULT_RETRY) {
if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
(flags & FAULT_FLAG_RETRY_NOWAIT)) {
@@ -519,6 +505,9 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
goto retry;
}
}
+
+ mm_fault_accounting(tsk, regs, address, major);
+
if (IS_ENABLED(CONFIG_PGSTE) && gmap) {
address = __gmap_link(gmap, current->thread.gmap_addr,
address);
--
2.26.2

2020-06-15 22:25:28

by Peter Xu

[permalink] [raw]
Subject: [PATCH 20/25] mm/sh: Use mm_fault_accounting()

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.

CC: Yoshinori Sato <[email protected]>
CC: Rich Felker <[email protected]>
CC: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
arch/sh/mm/fault.c | 15 +++------------
1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 5f23d7907597..06b232973488 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -379,7 +379,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
struct task_struct *tsk;
struct mm_struct *mm;
struct vm_area_struct * vma;
- vm_fault_t fault;
+ vm_fault_t fault, major = 0;
unsigned int flags = FAULT_FLAG_DEFAULT;

tsk = current;
@@ -412,8 +412,6 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
if ((regs->sr & SR_IMASK) != SR_IMASK)
local_irq_enable();

- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
/*
* If we're in an interrupt, have no user context or are running
* with pagefaults disabled then we must not take the fault:
@@ -465,21 +463,13 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
* the fault.
*/
fault = handle_mm_fault(vma, address, flags);
+ major |= fault & VM_FAULT_MAJOR;

if (unlikely(fault & (VM_FAULT_RETRY | VM_FAULT_ERROR)))
if (mm_fault_error(regs, error_code, address, fault))
return;

if (flags & FAULT_FLAG_ALLOW_RETRY) {
- if (fault & VM_FAULT_MAJOR) {
- tsk->maj_flt++;
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
- regs, address);
- } else {
- tsk->min_flt++;
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
- regs, address);
- }
if (fault & VM_FAULT_RETRY) {
flags |= FAULT_FLAG_TRIED;

@@ -493,4 +483,5 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
}

up_read(&mm->mmap_sem);
+ mm_fault_accounting(tsk, regs, address, major);
}
--
2.26.2

2020-06-15 22:25:43

by Peter Xu

[permalink] [raw]
Subject: [PATCH 24/25] mm/x86: Use mm_fault_accounting()

Use the new mm_fault_accounting() helper for page fault accounting.

CC: Dave Hansen <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Borislav Petkov <[email protected]>
CC: [email protected]
CC: H. Peter Anvin <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
arch/x86/mm/fault.c | 10 +---------
1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a51df516b87b..28635b4e324c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1365,8 +1365,6 @@ void do_user_addr_fault(struct pt_regs *regs,
local_irq_enable();
}

- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
if (hw_error_code & X86_PF_WRITE)
flags |= FAULT_FLAG_WRITE;
if (hw_error_code & X86_PF_INSTR)
@@ -1493,13 +1491,7 @@ void do_user_addr_fault(struct pt_regs *regs,
* Major/minor page fault accounting. If any of the events
* returned VM_FAULT_MAJOR, we account it as a major fault.
*/
- if (major) {
- tsk->maj_flt++;
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
- } else {
- tsk->min_flt++;
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
- }
+ mm_fault_accounting(tsk, regs, address, major);

check_v8086_mode(regs, address, tsk);
}
--
2.26.2

2020-06-15 22:25:49

by Peter Xu

[permalink] [raw]
Subject: [PATCH 25/25] mm/xtensa: Use mm_fault_accounting()

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.

CC: Chris Zankel <[email protected]>
CC: Max Filippov <[email protected]>
CC: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
arch/xtensa/mm/fault.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index e7172bd53ced..511a2c5f9857 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -42,7 +42,7 @@ void do_page_fault(struct pt_regs *regs)
int code;

int is_write, is_exec;
- vm_fault_t fault;
+ vm_fault_t fault, major = 0;
unsigned int flags = FAULT_FLAG_DEFAULT;

code = SEGV_MAPERR;
@@ -109,6 +109,7 @@ void do_page_fault(struct pt_regs *regs)
* the fault.
*/
fault = handle_mm_fault(vma, address, flags);
+ major |= fault & VM_FAULT_MAJOR;

if (fault_signal_pending(fault, regs))
return;
@@ -123,10 +124,6 @@ void do_page_fault(struct pt_regs *regs)
BUG();
}
if (flags & FAULT_FLAG_ALLOW_RETRY) {
- if (fault & VM_FAULT_MAJOR)
- current->maj_flt++;
- else
- current->min_flt++;
if (fault & VM_FAULT_RETRY) {
flags |= FAULT_FLAG_TRIED;

@@ -140,12 +137,7 @@ void do_page_fault(struct pt_regs *regs)
}

up_read(&mm->mmap_sem);
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
- if (flags & VM_FAULT_MAJOR)
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
- else
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
-
+ mm_fault_accounting(current, regs, address, major);
return;

/* Something tried to access memory that isn't in our memory map..
--
2.26.2

2020-06-15 22:26:34

by Peter Xu

[permalink] [raw]
Subject: [PATCH 22/25] mm/sparc64: Use mm_fault_accounting()

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.

CC: David S. Miller <[email protected]>
CC: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
arch/sparc/mm/fault_64.c | 16 ++++------------
1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index c0c0dd471b6b..61be0a0d79c6 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -269,7 +269,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
struct vm_area_struct *vma;
unsigned int insn = 0;
int si_code, fault_code;
- vm_fault_t fault;
+ vm_fault_t fault, major = 0;
unsigned long address, mm_rss;
unsigned int flags = FAULT_FLAG_DEFAULT;

@@ -317,8 +317,6 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
if (faulthandler_disabled() || !mm)
goto intr_or_no_mm;

- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
if (!down_read_trylock(&mm->mmap_sem)) {
if ((regs->tstate & TSTATE_PRIV) &&
!search_exception_tables(regs->tpc)) {
@@ -424,6 +422,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
}

fault = handle_mm_fault(vma, address, flags);
+ major |= fault & VM_FAULT_MAJOR;

if (fault_signal_pending(fault, regs))
goto exit_exception;
@@ -439,15 +438,6 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
}

if (flags & FAULT_FLAG_ALLOW_RETRY) {
- if (fault & VM_FAULT_MAJOR) {
- current->maj_flt++;
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ,
- 1, regs, address);
- } else {
- current->min_flt++;
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN,
- 1, regs, address);
- }
if (fault & VM_FAULT_RETRY) {
flags |= FAULT_FLAG_TRIED;

@@ -461,6 +451,8 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
}
up_read(&mm->mmap_sem);

+ mm_fault_accounting(current, regs, address, major);
+
mm_rss = get_mm_rss(mm);
#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
mm_rss -= (mm->context.thp_pte_count * (HPAGE_SIZE / PAGE_SIZE));
--
2.26.2

2020-06-15 22:26:43

by Peter Xu

[permalink] [raw]
Subject: [PATCH 21/25] mm/sparc32: Use mm_fault_accounting()

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.

CC: David S. Miller <[email protected]>
CC: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
arch/sparc/mm/fault_32.c | 16 +++-------------
1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index f6e0e601f857..299e6e241a1c 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -167,7 +167,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
unsigned long g2;
int from_user = !(regs->psr & PSR_PS);
int code;
- vm_fault_t fault;
+ vm_fault_t fault, major = 0;
unsigned int flags = FAULT_FLAG_DEFAULT;

if (text_fault)
@@ -192,9 +192,6 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
*/
if (pagefault_disabled() || !mm)
goto no_context;
-
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
retry:
down_read(&mm->mmap_sem);

@@ -236,6 +233,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
* the fault.
*/
fault = handle_mm_fault(vma, address, flags);
+ major |= fault & VM_FAULT_MAJOR;

if (fault_signal_pending(fault, regs))
return;
@@ -251,15 +249,6 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
}

if (flags & FAULT_FLAG_ALLOW_RETRY) {
- if (fault & VM_FAULT_MAJOR) {
- current->maj_flt++;
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ,
- 1, regs, address);
- } else {
- current->min_flt++;
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN,
- 1, regs, address);
- }
if (fault & VM_FAULT_RETRY) {
flags |= FAULT_FLAG_TRIED;

@@ -273,6 +262,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
}

up_read(&mm->mmap_sem);
+ mm_fault_accounting(current, regs, address, major);
return;

/*
--
2.26.2

2020-06-15 22:27:32

by Peter Xu

[permalink] [raw]
Subject: [PATCH 23/25] mm/unicore32: Use mm_fault_accounting()

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.
Also, the perf events for page faults will be accounted too when the config has
CONFIG_PERF_EVENTS defined.

CC: Guan Xuetao <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
arch/unicore32/mm/fault.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 3022104aa613..240b38e81ed6 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -201,7 +201,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
struct task_struct *tsk;
struct mm_struct *mm;
int sig, code;
- vm_fault_t fault;
+ vm_fault_t fault, major = 0;
unsigned int flags = FAULT_FLAG_DEFAULT;

tsk = current;
@@ -245,6 +245,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
}

fault = __do_pf(mm, addr, fsr, flags, tsk);
+ major |= fault & VM_FAULT_MAJOR;

/* If we need to retry but a fatal signal is pending, handle the
* signal first. We do not need to release the mmap_sem because
@@ -254,10 +255,6 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
return 0;

if (!(fault & VM_FAULT_ERROR) && (flags & FAULT_FLAG_ALLOW_RETRY)) {
- if (fault & VM_FAULT_MAJOR)
- tsk->maj_flt++;
- else
- tsk->min_flt++;
if (fault & VM_FAULT_RETRY) {
flags |= FAULT_FLAG_TRIED;
goto retry;
@@ -267,11 +264,13 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
up_read(&mm->mmap_sem);

/*
- * Handle the "normal" case first - VM_FAULT_MAJOR
+ * Handle the "normal" case first
*/
- if (likely(!(fault &
- (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
+ if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
+ VM_FAULT_BADACCESS)))) {
+ mm_fault_accounting(tsk, regs, addr, major);
return 0;
+ }

/*
* If we are in kernel mode at this point, we
--
2.26.2

2020-06-15 23:15:46

by Max Filippov

[permalink] [raw]
Subject: Re: [PATCH 25/25] mm/xtensa: Use mm_fault_accounting()

On Mon, Jun 15, 2020 at 3:23 PM Peter Xu <[email protected]> wrote:
>
> Use the new mm_fault_accounting() helper for page fault accounting.
>
> Avoid doing page fault accounting multiple times if the page fault is retried.
>
> CC: Chris Zankel <[email protected]>
> CC: Max Filippov <[email protected]>
> CC: [email protected]
> Signed-off-by: Peter Xu <[email protected]>
> ---
> arch/xtensa/mm/fault.c | 14 +++-----------
> 1 file changed, 3 insertions(+), 11 deletions(-)

Acked-by: Max Filippov <[email protected]>

--
Thanks.
-- Max

2020-06-16 16:01:48

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()

On Mon, Jun 15, 2020 at 06:23:02PM -0400, Peter Xu wrote:
> Use the new mm_fault_accounting() helper for page fault accounting.
>
> Avoid doing page fault accounting multiple times if the page fault is retried.
>
> CC: Heiko Carstens <[email protected]>
> CC: Vasily Gorbik <[email protected]>
> CC: Christian Borntraeger <[email protected]>
> CC: [email protected]
> Signed-off-by: Peter Xu <[email protected]>
> ---
> arch/s390/mm/fault.c | 21 +++++----------------
> 1 file changed, 5 insertions(+), 16 deletions(-)
>
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index dedc28be27ab..8ca207635b59 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -392,7 +392,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> unsigned long trans_exc_code;
> unsigned long address;
> unsigned int flags;
> - vm_fault_t fault;
> + vm_fault_t fault, major = 0;
>
> tsk = current;
> /*
> @@ -428,7 +428,6 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> }
>
> address = trans_exc_code & __FAIL_ADDR_MASK;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> flags = FAULT_FLAG_DEFAULT;
> if (user_mode(regs))
> flags |= FAULT_FLAG_USER;
> @@ -480,6 +479,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> * the fault.
> */
> fault = handle_mm_fault(vma, address, flags);
> + major |= fault & VM_FAULT_MAJOR;
> if (fault_signal_pending(fault, regs)) {
> fault = VM_FAULT_SIGNAL;
> if (flags & FAULT_FLAG_RETRY_NOWAIT)
> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> if (unlikely(fault & VM_FAULT_ERROR))
> goto out_up;
>
> - /*
> - * Major/minor page fault accounting is only done on the
> - * initial attempt. If we go through a retry, it is extremely
> - * likely that the page will be found in page cache at that point.
> - */
> if (flags & FAULT_FLAG_ALLOW_RETRY) {
> - if (fault & VM_FAULT_MAJOR) {
> - tsk->maj_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> - regs, address);
> - } else {
> - tsk->min_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> - regs, address);
> - }
> if (fault & VM_FAULT_RETRY) {
> if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
> (flags & FAULT_FLAG_RETRY_NOWAIT)) {

Seems like the call to mm_fault_accounting() will be missed if
we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
jumps to "out_up"...

> @@ -519,6 +505,9 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> goto retry;
> }
> }
> +
> + mm_fault_accounting(tsk, regs, address, major);
> +
> if (IS_ENABLED(CONFIG_PGSTE) && gmap) {
> address = __gmap_link(gmap, current->thread.gmap_addr,
> address);
> --
> 2.26.2
>

2020-06-16 16:37:35

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()

Hi, Alexander,

On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
> > @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> > if (unlikely(fault & VM_FAULT_ERROR))
> > goto out_up;
> >
> > - /*
> > - * Major/minor page fault accounting is only done on the
> > - * initial attempt. If we go through a retry, it is extremely
> > - * likely that the page will be found in page cache at that point.
> > - */
> > if (flags & FAULT_FLAG_ALLOW_RETRY) {
> > - if (fault & VM_FAULT_MAJOR) {
> > - tsk->maj_flt++;
> > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> > - regs, address);
> > - } else {
> > - tsk->min_flt++;
> > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> > - regs, address);
> > - }
> > if (fault & VM_FAULT_RETRY) {
> > if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
> > (flags & FAULT_FLAG_RETRY_NOWAIT)) {
>
> Seems like the call to mm_fault_accounting() will be missed if
> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
> jumps to "out_up"...

This is true as a functional change. However that also means that we've got a
VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
than handled correctly (for instance, due to some try_lock failed during the
fault process).

To me, that case should not be counted as a page fault at all? Or we might get
the same duplicated accounting when the page fault retried from a higher stack.

Thanks,

--
Peter Xu

2020-06-16 18:58:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 00/25] mm: Page fault accounting cleanups

On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <[email protected]> wrote:
>
> This series tries to address all of them by introducing mm_fault_accounting()
> first, so that we move all the page fault accounting into the common code base,
> then call it properly from arch pf handlers just like handle_mm_fault().

Hmm.

So having looked at this a bit more, I'd actually like to go even
further, and just get rid of the per-architecture code _entirely_.

Here's a straw-man patch to the generic code - the idea is mostly laid
out in the comment that I'm just quoting here directly too:

/*
* Do accounting in the common code, to avoid unnecessary
* architecture differences or duplicated code.
*
* We arbitrarily make the rules be:
*
* - faults that never even got here (because the address
* wasn't valid). That includes arch_vma_access_permitted()
* failing above.
*
* So this is expressly not a "this many hardware page
* faults" counter. Use the hw profiling for that.
*
* - incomplete faults (ie RETRY) do not count (see above).
* They will only count once completed.
*
* - the fault counts as a "major" fault when the final
* successful fault is VM_FAULT_MAJOR, or if it was a
* retry (which implies that we couldn't handle it
* immediately previously).
*
* - if the fault is done for GUP, regs wil be NULL and
* no accounting will be done (but you _could_ pass in
* your own regs and it would be accounted to the thread
* doing the fault, not to the target!)
*/

the code itself in the patch is

(a) pretty trivial and self-evident

(b) INCOMPLETE

that (b) is worth noting: this patch won't compile on its own. It
intentionally leaves all the users without the new 'regs' argument,
because you obviously simply need to remove all the code that
currently tries to do any accounting.

Comments?

This is a bigger change, but I think it might be worth it to _really_
consolidate the major/minor logic.

One detail worth noting: I do wonder if we should put the

perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);

just in the arch code at the top of the fault handling, and consider
it entirely unrelated to the major/minor fault handling. The
major/minor faults fundamnetally are about successes. But the plain
PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
things that never even get to this point at all.

I'm not convinced it's useful to have three SW events that are defined
to be A=B+C.

But this does *not* do that part. It' sreally just an RFC patch.

Linus


Attachments:
patch (2.85 kB)

2020-06-16 21:05:37

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 00/25] mm: Page fault accounting cleanups

On Tue, Jun 16, 2020 at 11:55:17AM -0700, Linus Torvalds wrote:
> On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <[email protected]> wrote:
> >
> > This series tries to address all of them by introducing mm_fault_accounting()
> > first, so that we move all the page fault accounting into the common code base,
> > then call it properly from arch pf handlers just like handle_mm_fault().
>
> Hmm.
>
> So having looked at this a bit more, I'd actually like to go even
> further, and just get rid of the per-architecture code _entirely_.
>
> Here's a straw-man patch to the generic code - the idea is mostly laid
> out in the comment that I'm just quoting here directly too:
>
> /*
> * Do accounting in the common code, to avoid unnecessary
> * architecture differences or duplicated code.
> *
> * We arbitrarily make the rules be:
> *
> * - faults that never even got here (because the address
> * wasn't valid). That includes arch_vma_access_permitted()
> * failing above.
> *
> * So this is expressly not a "this many hardware page
> * faults" counter. Use the hw profiling for that.
> *
> * - incomplete faults (ie RETRY) do not count (see above).
> * They will only count once completed.
> *
> * - the fault counts as a "major" fault when the final
> * successful fault is VM_FAULT_MAJOR, or if it was a
> * retry (which implies that we couldn't handle it
> * immediately previously).
> *
> * - if the fault is done for GUP, regs wil be NULL and
> * no accounting will be done (but you _could_ pass in
> * your own regs and it would be accounted to the thread
> * doing the fault, not to the target!)
> */
>
> the code itself in the patch is
>
> (a) pretty trivial and self-evident
>
> (b) INCOMPLETE
>
> that (b) is worth noting: this patch won't compile on its own. It
> intentionally leaves all the users without the new 'regs' argument,
> because you obviously simply need to remove all the code that
> currently tries to do any accounting.
>
> Comments?

Looks clean to me. The definition of "major faults" will slightly change even
for those who has done the "major |= fault & MAJOR" operations before, but at
least I can't see anything bad about that either...

To make things easier, we can use the 1st patch to introduce this change,
however pass regs==NULL at the callers to never trigger this accounting. Then
we can still use one patch for each arch to do the final convertions.

>
> This is a bigger change, but I think it might be worth it to _really_
> consolidate the major/minor logic.
>
> One detail worth noting: I do wonder if we should put the
>
> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
>
> just in the arch code at the top of the fault handling, and consider
> it entirely unrelated to the major/minor fault handling. The
> major/minor faults fundamnetally are about successes. But the plain
> PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
> things that never even get to this point at all.
>
> I'm not convinced it's useful to have three SW events that are defined
> to be A=B+C.

IMHO it's still common to have a "total" statistics in softwares even if each
of the subsets are accounted separately. Here it's just a bit special because
there're only two elements so the addition is so straightforward. It seems a
trade-off on whether we'd like to do the accounting of errornous faults, or we
want to make it cleaner by put them altogether but only successful page faults.
I slightly preferred the latter due to the fact that I failed to find great
usefulness out of keeping error fault accountings, but no strong opinions..

Thanks,

--
Peter Xu

2020-06-17 00:57:13

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH 00/25] mm: Page fault accounting cleanups

Linus Torvalds <[email protected]> writes:
> On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <[email protected]> wrote:
>> This series tries to address all of them by introducing mm_fault_accounting()
>> first, so that we move all the page fault accounting into the common code base,
>> then call it properly from arch pf handlers just like handle_mm_fault().
>
> Hmm.
>
> So having looked at this a bit more, I'd actually like to go even
> further, and just get rid of the per-architecture code _entirely_.

<snip>

> One detail worth noting: I do wonder if we should put the
>
> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
>
> just in the arch code at the top of the fault handling, and consider
> it entirely unrelated to the major/minor fault handling. The
> major/minor faults fundamnetally are about successes. But the plain
> PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
> things that never even get to this point at all.

Yeah I think we should keep it in the arch code at roughly the top.

If it's moved to the end you could have a process spinning taking bad
page faults (and fixing them up), and see no sign of it from the perf
page fault counters.

cheers

2020-06-17 06:22:19

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()



On 16.06.20 18:35, Peter Xu wrote:
> Hi, Alexander,
>
> On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
>>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>>> if (unlikely(fault & VM_FAULT_ERROR))
>>> goto out_up;
>>>
>>> - /*
>>> - * Major/minor page fault accounting is only done on the
>>> - * initial attempt. If we go through a retry, it is extremely
>>> - * likely that the page will be found in page cache at that point.
>>> - */
>>> if (flags & FAULT_FLAG_ALLOW_RETRY) {
>>> - if (fault & VM_FAULT_MAJOR) {
>>> - tsk->maj_flt++;
>>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
>>> - regs, address);
>>> - } else {
>>> - tsk->min_flt++;
>>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
>>> - regs, address);
>>> - }
>>> if (fault & VM_FAULT_RETRY) {
>>> if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
>>> (flags & FAULT_FLAG_RETRY_NOWAIT)) {
>>
>> Seems like the call to mm_fault_accounting() will be missed if
>> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
>> jumps to "out_up"...
>
> This is true as a functional change. However that also means that we've got a
> VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
> than handled correctly (for instance, due to some try_lock failed during the
> fault process).
>
> To me, that case should not be counted as a page fault at all? Or we might get
> the same duplicated accounting when the page fault retried from a higher stack.
>
> Thanks

This case below (the one with the gmap) is the KVM case for doing a so called
pseudo page fault to our guests. (we notify our guests about major host page
faults and let it reschedule to something else instead of halting the vcpu).
This is being resolved with either gup or fixup_user_fault asynchronously by
KVM code (this can also be sync when the guest does not match some conditions)
We do not change the counters in that code as far as I can tell so we should
continue to do it here.

(see arch/s390/kvm/kvm-s390.c
static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
{
[...]
} else if (current->thread.gmap_pfault) {
trace_kvm_s390_major_guest_pfault(vcpu);
current->thread.gmap_pfault = 0;
if (kvm_arch_setup_async_pf(vcpu))
return 0;
return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
}

2020-06-17 08:08:21

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 00/25] mm: Page fault accounting cleanups

On Wed, Jun 17, 2020 at 10:55:14AM +1000, Michael Ellerman wrote:
> Linus Torvalds <[email protected]> writes:
> > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <[email protected]> wrote:
> >> This series tries to address all of them by introducing mm_fault_accounting()
> >> first, so that we move all the page fault accounting into the common code base,
> >> then call it properly from arch pf handlers just like handle_mm_fault().
> >
> > Hmm.
> >
> > So having looked at this a bit more, I'd actually like to go even
> > further, and just get rid of the per-architecture code _entirely_.
>
> <snip>
>
> > One detail worth noting: I do wonder if we should put the
> >
> > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
> >
> > just in the arch code at the top of the fault handling, and consider
> > it entirely unrelated to the major/minor fault handling. The
> > major/minor faults fundamnetally are about successes. But the plain
> > PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
> > things that never even get to this point at all.
>
> Yeah I think we should keep it in the arch code at roughly the top.

I agree. It's a nice idea to consolidate the code, but I don't see that
it's really possible for PERF_COUNT_SW_PAGE_FAULTS without significantly
changing the semantics (and a potentially less useful way. Of course,
moving more of do_page_fault() out of the arch code would be great, but
that's a much bigger effort.

> If it's moved to the end you could have a process spinning taking bad
> page faults (and fixing them up), and see no sign of it from the perf
> page fault counters.

The current arm64 behaviour is that we record PERF_COUNT_SW_PAGE_FAULTS
if _all_ of the following are true:

1. The fault isn't handled by kprobes
2. The pagefault handler is enabled
3. We have an mm (current->mm)
4. The fault isn't an unexpected kernel fault on a user address (we oops
and kill the task in this case)

Which loosely corresponds to "we took a fault on a user address that it
looks like we can handle".

That said, I'm happy to tweak this if it brings us into line with other
architectures.

Will

2020-06-17 16:08:33

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()

Hi, Christian,

On Wed, Jun 17, 2020 at 08:19:29AM +0200, Christian Borntraeger wrote:
>
>
> On 16.06.20 18:35, Peter Xu wrote:
> > Hi, Alexander,
> >
> > On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
> >>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> >>> if (unlikely(fault & VM_FAULT_ERROR))
> >>> goto out_up;
> >>>
> >>> - /*
> >>> - * Major/minor page fault accounting is only done on the
> >>> - * initial attempt. If we go through a retry, it is extremely
> >>> - * likely that the page will be found in page cache at that point.
> >>> - */
> >>> if (flags & FAULT_FLAG_ALLOW_RETRY) {
> >>> - if (fault & VM_FAULT_MAJOR) {
> >>> - tsk->maj_flt++;
> >>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> >>> - regs, address);
> >>> - } else {
> >>> - tsk->min_flt++;
> >>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> >>> - regs, address);
> >>> - }
> >>> if (fault & VM_FAULT_RETRY) {
> >>> if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
> >>> (flags & FAULT_FLAG_RETRY_NOWAIT)) {

[1]

> >>
> >> Seems like the call to mm_fault_accounting() will be missed if
> >> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
> >> jumps to "out_up"...
> >
> > This is true as a functional change. However that also means that we've got a
> > VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
> > than handled correctly (for instance, due to some try_lock failed during the
> > fault process).
> >
> > To me, that case should not be counted as a page fault at all? Or we might get
> > the same duplicated accounting when the page fault retried from a higher stack.
> >
> > Thanks
>
> This case below (the one with the gmap) is the KVM case for doing a so called
> pseudo page fault to our guests. (we notify our guests about major host page
> faults and let it reschedule to something else instead of halting the vcpu).
> This is being resolved with either gup or fixup_user_fault asynchronously by
> KVM code (this can also be sync when the guest does not match some conditions)
> We do not change the counters in that code as far as I can tell so we should
> continue to do it here.
>
> (see arch/s390/kvm/kvm-s390.c
> static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
> {
> [...]
> } else if (current->thread.gmap_pfault) {
> trace_kvm_s390_major_guest_pfault(vcpu);
> current->thread.gmap_pfault = 0;
> if (kvm_arch_setup_async_pf(vcpu))
> return 0;
> return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
> }

Please correct me if I'm wrong... but I still think what this patch does is the
right thing to do.

Note again that IMHO when reached [1] above it means the page fault is not
handled correctly so we need to fallback to KVM async page fault, then we
shouldn't increment the accountings until it's finally handled correctly. That
final accounting should be done in the async pf path in gup code where the page
fault is handled:

kvm_arch_fault_in_page
gmap_fault
fixup_user_fault

Where in fixup_user_fault() we have:

if (tsk) {
if (major)
tsk->maj_flt++;
else
tsk->min_flt++;
}

Thanks,

--
Peter Xu

2020-06-17 16:13:20

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 00/25] mm: Page fault accounting cleanups

On Wed, Jun 17, 2020 at 09:04:06AM +0100, Will Deacon wrote:
> On Wed, Jun 17, 2020 at 10:55:14AM +1000, Michael Ellerman wrote:
> > Linus Torvalds <[email protected]> writes:
> > > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <[email protected]> wrote:
> > >> This series tries to address all of them by introducing mm_fault_accounting()
> > >> first, so that we move all the page fault accounting into the common code base,
> > >> then call it properly from arch pf handlers just like handle_mm_fault().
> > >
> > > Hmm.
> > >
> > > So having looked at this a bit more, I'd actually like to go even
> > > further, and just get rid of the per-architecture code _entirely_.
> >
> > <snip>
> >
> > > One detail worth noting: I do wonder if we should put the
> > >
> > > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
> > >
> > > just in the arch code at the top of the fault handling, and consider
> > > it entirely unrelated to the major/minor fault handling. The
> > > major/minor faults fundamnetally are about successes. But the plain
> > > PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
> > > things that never even get to this point at all.
> >
> > Yeah I think we should keep it in the arch code at roughly the top.
>
> I agree. It's a nice idea to consolidate the code, but I don't see that
> it's really possible for PERF_COUNT_SW_PAGE_FAULTS without significantly
> changing the semantics (and a potentially less useful way. Of course,
> moving more of do_page_fault() out of the arch code would be great, but
> that's a much bigger effort.
>
> > If it's moved to the end you could have a process spinning taking bad
> > page faults (and fixing them up), and see no sign of it from the perf
> > page fault counters.
>
> The current arm64 behaviour is that we record PERF_COUNT_SW_PAGE_FAULTS
> if _all_ of the following are true:
>
> 1. The fault isn't handled by kprobes
> 2. The pagefault handler is enabled
> 3. We have an mm (current->mm)
> 4. The fault isn't an unexpected kernel fault on a user address (we oops
> and kill the task in this case)
>
> Which loosely corresponds to "we took a fault on a user address that it
> looks like we can handle".
>
> That said, I'm happy to tweak this if it brings us into line with other
> architectures.

I see. I'll keep the semantics for PERF_COUNT_SW_PAGE_FAULTS in the next
version. Thanks for all the comments!

--
Peter Xu

2020-06-17 16:18:38

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()



On 17.06.20 18:06, Peter Xu wrote:
> Hi, Christian,
>
> On Wed, Jun 17, 2020 at 08:19:29AM +0200, Christian Borntraeger wrote:
>>
>>
>> On 16.06.20 18:35, Peter Xu wrote:
>>> Hi, Alexander,
>>>
>>> On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
>>>>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>>>>> if (unlikely(fault & VM_FAULT_ERROR))
>>>>> goto out_up;
>>>>>
>>>>> - /*
>>>>> - * Major/minor page fault accounting is only done on the
>>>>> - * initial attempt. If we go through a retry, it is extremely
>>>>> - * likely that the page will be found in page cache at that point.
>>>>> - */
>>>>> if (flags & FAULT_FLAG_ALLOW_RETRY) {
>>>>> - if (fault & VM_FAULT_MAJOR) {
>>>>> - tsk->maj_flt++;
>>>>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
>>>>> - regs, address);
>>>>> - } else {
>>>>> - tsk->min_flt++;
>>>>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
>>>>> - regs, address);
>>>>> - }
>>>>> if (fault & VM_FAULT_RETRY) {
>>>>> if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
>>>>> (flags & FAULT_FLAG_RETRY_NOWAIT)) {
>
> [1]
>
>>>>
>>>> Seems like the call to mm_fault_accounting() will be missed if
>>>> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
>>>> jumps to "out_up"...
>>>
>>> This is true as a functional change. However that also means that we've got a
>>> VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
>>> than handled correctly (for instance, due to some try_lock failed during the
>>> fault process).
>>>
>>> To me, that case should not be counted as a page fault at all? Or we might get
>>> the same duplicated accounting when the page fault retried from a higher stack.
>>>
>>> Thanks
>>
>> This case below (the one with the gmap) is the KVM case for doing a so called
>> pseudo page fault to our guests. (we notify our guests about major host page
>> faults and let it reschedule to something else instead of halting the vcpu).
>> This is being resolved with either gup or fixup_user_fault asynchronously by
>> KVM code (this can also be sync when the guest does not match some conditions)
>> We do not change the counters in that code as far as I can tell so we should
>> continue to do it here.
>>
>> (see arch/s390/kvm/kvm-s390.c
>> static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
>> {
>> [...]
>> } else if (current->thread.gmap_pfault) {
>> trace_kvm_s390_major_guest_pfault(vcpu);
>> current->thread.gmap_pfault = 0;
>> if (kvm_arch_setup_async_pf(vcpu))
>> return 0;
>> return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
>> }
>
> Please correct me if I'm wrong... but I still think what this patch does is the
> right thing to do.
>
> Note again that IMHO when reached [1] above it means the page fault is not
> handled correctly so we need to fallback to KVM async page fault, then we
> shouldn't increment the accountings until it's finally handled correctly. That
> final accounting should be done in the async pf path in gup code where the page
> fault is handled:
>
> kvm_arch_fault_in_page
> gmap_fault
> fixup_user_fault
>
> Where in fixup_user_fault() we have:
>
> if (tsk) {
> if (major)
> tsk->maj_flt++;
> else
> tsk->min_flt++;
> }
>

Right that case does work. Its the case where we do not inject a pseudo pagefault and
instead fall back to synchronous fault-in.
What is about the other case:

kvm_setup_async_pf
->workqueue
async_pf_execute
get_user_pages_remote

Does get_user_pages_remote do the accounting as well? I cant see that.

2020-06-17 16:47:44

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()

On Wed, Jun 17, 2020 at 06:14:52PM +0200, Christian Borntraeger wrote:
>
>
> On 17.06.20 18:06, Peter Xu wrote:
> > Hi, Christian,
> >
> > On Wed, Jun 17, 2020 at 08:19:29AM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 16.06.20 18:35, Peter Xu wrote:
> >>> Hi, Alexander,
> >>>
> >>> On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
> >>>>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> >>>>> if (unlikely(fault & VM_FAULT_ERROR))
> >>>>> goto out_up;
> >>>>>
> >>>>> - /*
> >>>>> - * Major/minor page fault accounting is only done on the
> >>>>> - * initial attempt. If we go through a retry, it is extremely
> >>>>> - * likely that the page will be found in page cache at that point.
> >>>>> - */
> >>>>> if (flags & FAULT_FLAG_ALLOW_RETRY) {
> >>>>> - if (fault & VM_FAULT_MAJOR) {
> >>>>> - tsk->maj_flt++;
> >>>>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> >>>>> - regs, address);
> >>>>> - } else {
> >>>>> - tsk->min_flt++;
> >>>>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> >>>>> - regs, address);
> >>>>> - }
> >>>>> if (fault & VM_FAULT_RETRY) {
> >>>>> if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
> >>>>> (flags & FAULT_FLAG_RETRY_NOWAIT)) {
> >
> > [1]
> >
> >>>>
> >>>> Seems like the call to mm_fault_accounting() will be missed if
> >>>> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
> >>>> jumps to "out_up"...
> >>>
> >>> This is true as a functional change. However that also means that we've got a
> >>> VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
> >>> than handled correctly (for instance, due to some try_lock failed during the
> >>> fault process).
> >>>
> >>> To me, that case should not be counted as a page fault at all? Or we might get
> >>> the same duplicated accounting when the page fault retried from a higher stack.
> >>>
> >>> Thanks
> >>
> >> This case below (the one with the gmap) is the KVM case for doing a so called
> >> pseudo page fault to our guests. (we notify our guests about major host page
> >> faults and let it reschedule to something else instead of halting the vcpu).
> >> This is being resolved with either gup or fixup_user_fault asynchronously by
> >> KVM code (this can also be sync when the guest does not match some conditions)
> >> We do not change the counters in that code as far as I can tell so we should
> >> continue to do it here.
> >>
> >> (see arch/s390/kvm/kvm-s390.c
> >> static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
> >> {
> >> [...]
> >> } else if (current->thread.gmap_pfault) {
> >> trace_kvm_s390_major_guest_pfault(vcpu);
> >> current->thread.gmap_pfault = 0;
> >> if (kvm_arch_setup_async_pf(vcpu))
> >> return 0;
> >> return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
> >> }
> >
> > Please correct me if I'm wrong... but I still think what this patch does is the
> > right thing to do.
> >
> > Note again that IMHO when reached [1] above it means the page fault is not
> > handled correctly so we need to fallback to KVM async page fault, then we
> > shouldn't increment the accountings until it's finally handled correctly. That
> > final accounting should be done in the async pf path in gup code where the page
> > fault is handled:
> >
> > kvm_arch_fault_in_page
> > gmap_fault
> > fixup_user_fault
> >
> > Where in fixup_user_fault() we have:
> >
> > if (tsk) {
> > if (major)
> > tsk->maj_flt++;
> > else
> > tsk->min_flt++;
> > }
> >
>
> Right that case does work. Its the case where we do not inject a pseudo pagefault and
> instead fall back to synchronous fault-in.
> What is about the other case:
>
> kvm_setup_async_pf
> ->workqueue
> async_pf_execute
> get_user_pages_remote
>
> Does get_user_pages_remote do the accounting as well? I cant see that.
>

It should, with:

get_user_pages_remote
__get_user_pages_remote
__get_user_pages_locked
__get_user_pages
faultin_page

Where the accounting is done in faultin_page().

We probably need to also move the accounting in faultin_page() to be after the
retry check too, but that should be irrelevant to this patch.

Thanks!

--
Peter Xu

2020-07-20 21:28:57

by Rich Felker

[permalink] [raw]
Subject: Re: [PATCH 20/25] mm/sh: Use mm_fault_accounting()

On Mon, Jun 15, 2020 at 06:23:06PM -0400, Peter Xu wrote:
> Use the new mm_fault_accounting() helper for page fault accounting.
>
> Avoid doing page fault accounting multiple times if the page fault is retried.
>
> CC: Yoshinori Sato <[email protected]>
> CC: Rich Felker <[email protected]>
> CC: [email protected]
> Signed-off-by: Peter Xu <[email protected]>
> ---
> arch/sh/mm/fault.c | 15 +++------------
> 1 file changed, 3 insertions(+), 12 deletions(-)
>
> diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
> index 5f23d7907597..06b232973488 100644
> --- a/arch/sh/mm/fault.c
> +++ b/arch/sh/mm/fault.c
> @@ -379,7 +379,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
> struct task_struct *tsk;
> struct mm_struct *mm;
> struct vm_area_struct * vma;
> - vm_fault_t fault;
> + vm_fault_t fault, major = 0;
> unsigned int flags = FAULT_FLAG_DEFAULT;
>
> tsk = current;
> @@ -412,8 +412,6 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
> if ((regs->sr & SR_IMASK) != SR_IMASK)
> local_irq_enable();
>
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> -
> /*
> * If we're in an interrupt, have no user context or are running
> * with pagefaults disabled then we must not take the fault:
> @@ -465,21 +463,13 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
> * the fault.
> */
> fault = handle_mm_fault(vma, address, flags);
> + major |= fault & VM_FAULT_MAJOR;
>
> if (unlikely(fault & (VM_FAULT_RETRY | VM_FAULT_ERROR)))
> if (mm_fault_error(regs, error_code, address, fault))
> return;
>
> if (flags & FAULT_FLAG_ALLOW_RETRY) {
> - if (fault & VM_FAULT_MAJOR) {
> - tsk->maj_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> - regs, address);
> - } else {
> - tsk->min_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> - regs, address);
> - }
> if (fault & VM_FAULT_RETRY) {
> flags |= FAULT_FLAG_TRIED;
>
> @@ -493,4 +483,5 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
> }
>
> up_read(&mm->mmap_sem);
> + mm_fault_accounting(tsk, regs, address, major);
> }
> --
> 2.26.2

What's the status on the rest of this series? Do you need any action
from my side (arch/sh) at this time?

Rich

2020-07-20 22:09:08

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 20/25] mm/sh: Use mm_fault_accounting()

Hi, Rich,

On Mon, Jul 20, 2020 at 05:25:24PM -0400, Rich Felker wrote:
> What's the status on the rest of this series? Do you need any action
> from my side (arch/sh) at this time?

This series should have been queued by Andrew in -next. So iiuc no further
action is needed.

Thanks for asking,

--
Peter Xu