2020-01-17 03:46:34

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 2/5] mm: introduce external memory hinting API

There is usecase that System Management Software(SMS) want to give
a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
in the case of Android, it is the ActivityManagerService.

It's similar in spirit to madvise(MADV_WONTNEED), but the information
required to make the reclaim decision is not known to the app. Instead,
it is known to the centralized userspace daemon(ActivityManagerService),
and that daemon must be able to initiate reclaim on its own without
any app involvement.

To solve the issue, this patch introduces new syscall process_madvise(2).
It uses pidfd of an external processs to give the hint.

int process_madvise(int pidfd, void *addr, size_t length, int advise,
unsigned long flag);

Since it could affect other process's address range, only privileged
process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
gives it the right to ptrace the process could use it successfully.
The flag argument is reserved for future use if we need to extend the
API.

I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky. Because we are not sure all hints make
sense from external process and implementation for the hint may rely on
the caller being in the current context so it could be error-prone.
Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

If someone want to add other hints, we could hear hear the usecase and
review it for each hint. It's more safe for maintainace rather than
introducing a buggy syscall but hard to fix it later.

Signed-off-by: Minchan Kim <[email protected]>
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/sys_ni.c | 1 +
mm/madvise.c | 63 +++++++++++++++++++++
21 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index e56950f23b49..776c61803315 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -477,3 +477,4 @@
# 545 reserved for clone3
546 common watch_devices sys_watch_devices
547 common openat2 sys_openat2
+548 common process_madvise sys_process_madvise
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 7fb2f4d59210..a43381542276 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -451,3 +451,4 @@
435 common clone3 sys_clone3
436 common watch_devices sys_watch_devices
437 common openat2 sys_openat2
+438 common process_madvise sys_process_madvise
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 8aa00ccb0b96..b722e47377a5 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)

-#define __NR_compat_syscalls 438
+#define __NR_compat_syscalls 439
#endif

#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 31f0ce25719e..e3643d7fecc3 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -883,6 +883,8 @@ __SYSCALL(__NR_clone3, sys_clone3)
__SYSCALL(__NR_watch_devices, sys_watch_devices)
#define __NR_openat2 437
__SYSCALL(__NR_openat2, sys_openat2)
+#define __NR_process_madvise 438
+__SYSCALL(__NR_process_madvise, process_madvise)

/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index b9aa59931905..c156abc9a298 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -358,3 +358,4 @@
# 435 reserved for clone3
436 common watch_devices sys_watch_devices
437 common openat2 sys_openat2
+438 common process_madvise sys_process_madvise
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 868c1ef89d35..5b6034b6650f 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -437,3 +437,4 @@
# 435 reserved for clone3
436 common watch_devices sys_watch_devices
437 common openat2 sys_openat2
+438 common process_madvise sys_process_madvise
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 544b4cef18b3..4bef584af09c 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -443,3 +443,4 @@
435 common clone3 sys_clone3
436 common watch_devices sys_watch_devices
437 common openat2 sys_openat2
+438 common process_madvise sys_process_madvise
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 05e8aee5dae7..7061b2103438 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -376,3 +376,4 @@
435 n32 clone3 __sys_clone3
436 n32 watch_devices sys_watch_devices
437 n32 openat2 sys_openat2
+438 n32 process_madivse sys_process_madvise
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 24d6c01328fb..84042d57fbfb 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -352,3 +352,4 @@
435 n64 clone3 __sys_clone3
436 n64 watch_devices sys_watch_devices
437 n64 openat2 sys_openat2
+438 n64 process_madvise sys_process_madvise
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 4b5f77a4e1a2..5bfd359c7e6f 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@
435 common clone3 sys_clone3_wrapper
436 common watch_devices sys_watch_devices
437 common openat2 sys_openat2
+438 common process_madvise sys_process_madvise
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 9716dc85a517..ffa0e679aca0 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -519,3 +519,4 @@
435 nospu clone3 ppc_clone3
436 common watch_devices sys_watch_devices
437 common openat2 sys_openat2
+438 common process_madvise sys_process_madvise
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 7da330f8b03e..c301717216ca 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@
435 common clone3 sys_clone3 sys_clone3
436 common watch_devices sys_watch_devices sys_watch_devices
437 common openat2 sys_openat2 sys_openat2
+438 common process_madvise sys_process_madvise sys_process_madvise
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index bb7e68e25337..b8f15701f69f 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@
# 435 reserved for clone3
436 common watch_devices sys_watch_devices
437 common openat2 sys_openat2
+438 common process_madvise sys_process_madvise
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 646a1fad7218..7ea95f37b222 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -483,3 +483,4 @@
# 435 reserved for clone3
436 common watch_devices sys_watch_devices
437 common openat2 sys_openat2
+438 common process_madvise sys_process_madvise
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 57c53acee290..76a2c266fe7e 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -442,3 +442,4 @@
435 i386 clone3 sys_clone3 __ia32_sys_clone3
436 i386 watch_devices sys_watch_devices __ia32_sys_watch_devices
437 i386 openat2 sys_openat2 __ia32_sys_openat2
+438 i386 process_madvise sys_process_madvise __ia32_sys_process_madvise
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1dd8d21f6500..b697cd8620cb 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -359,6 +359,7 @@
435 common clone3 __x64_sys_clone3/ptregs
436 common watch_devices __x64_sys_watch_devices
437 common openat2 __x64_sys_openat2
+438 common process_madvise __x64_sys_process_madvise

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 0f48ab7bd75b..2e9813ecfd7d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -408,3 +408,4 @@
435 common clone3 sys_clone3
436 common watch_devices sys_watch_devices
437 common openat2 sys_openat2
+438 common process_madvise sys_process_madvise
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 433c8c85636e..1b58a11ff49f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -877,6 +877,8 @@ asmlinkage long sys_munlockall(void);
asmlinkage long sys_mincore(unsigned long start, size_t len,
unsigned char __user * vec);
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
+asmlinkage long sys_process_madvise(int pidfd, unsigned long start,
+ size_t len, int behavior, unsigned long flags);
asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
unsigned long flags);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 33f3856a9c3c..4a49fbaea013 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -856,8 +856,11 @@ __SYSCALL(__NR_watch_devices, sys_watch_devices)
#define __NR_openat2 437
__SYSCALL(__NR_openat2, sys_openat2)

+#define __NR_process_madvise 438
+__SYSCALL(__NR_process_madvise, sys_process_madvise)
+
#undef __NR_syscalls
-#define __NR_syscalls 438
+#define __NR_syscalls 439

/*
* 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0e9b275260f8..10ce5eac8b4b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -281,6 +281,7 @@ COND_SYSCALL(mlockall);
COND_SYSCALL(munlockall);
COND_SYSCALL(mincore);
COND_SYSCALL(madvise);
+COND_SYSCALL(process_madvise);
COND_SYSCALL(remap_file_pages);
COND_SYSCALL(mbind);
COND_SYSCALL_COMPAT(mbind);
diff --git a/mm/madvise.c b/mm/madvise.c
index 0c901de531e4..59b5cc99ef61 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -17,6 +17,7 @@
#include <linux/falloc.h>
#include <linux/fadvise.h>
#include <linux/sched.h>
+#include <linux/sched/mm.h>
#include <linux/ksm.h>
#include <linux/fs.h>
#include <linux/file.h>
@@ -993,6 +994,18 @@ madvise_behavior_valid(int behavior)
}
}

+static bool
+process_madvise_behavior_valid(int behavior)
+{
+ switch (behavior) {
+ case MADV_COLD:
+ case MADV_PAGEOUT:
+ return true;
+ default:
+ return false;
+ }
+}
+
/*
* madvise_common - request behavior hint to address range of the target process
*
@@ -1151,6 +1164,10 @@ static int madvise_common(struct task_struct *task, struct mm_struct *mm,
* MADV_DONTDUMP - the application wants to prevent pages in the given range
* from being included in its core dump.
* MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ * MADV_COLD - the application uses the memory less so the kernel can deactivate
+ * the memory to evict them quickly when the memory pressure happen.
+ * MADV_PAGEOUT - the application uses the memroy very rarely so kernel can
+ * page out the memory instantly.
*
* return values:
* zero - success
@@ -1169,3 +1186,49 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
{
return madvise_common(current, current->mm, start, len_in, behavior);
}
+
+SYSCALL_DEFINE5(process_madvise, int, pidfd, unsigned long, start,
+ size_t, len_in, int, behavior, unsigned long, flags)
+{
+ int ret;
+ struct fd f;
+ struct pid *pid;
+ struct task_struct *task;
+ struct mm_struct *mm;
+
+ if (flags != 0)
+ return -EINVAL;
+
+ if (!process_madvise_behavior_valid(behavior))
+ return -EINVAL;
+
+ f = fdget(pidfd);
+ if (!f.file)
+ return -EBADF;
+
+ pid = pidfd_pid(f.file);
+ if (IS_ERR(pid)) {
+ ret = PTR_ERR(pid);
+ goto fdput;
+ }
+
+ task = get_pid_task(pid, PIDTYPE_PID);
+ if (!task) {
+ ret = -ESRCH;
+ goto fdput;
+ }
+
+ mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+ if (IS_ERR_OR_NULL(mm)) {
+ ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+ goto release_task;
+ }
+
+ ret = madvise_common(task, mm, start, len_in, behavior);
+ mmput(mm);
+release_task:
+ put_task_struct(task);
+fdput:
+ fdput(f);
+ return ret;
+}
--
2.25.0.rc1.283.g88dfdc4193-goog


2020-01-17 11:53:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> There is usecase that System Management Software(SMS) want to give
> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> in the case of Android, it is the ActivityManagerService.
>
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.
>
> To solve the issue, this patch introduces new syscall process_madvise(2).
> It uses pidfd of an external processs to give the hint.
>
> int process_madvise(int pidfd, void *addr, size_t length, int advise,
> unsigned long flag);
>
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrace the process could use it successfully.
> The flag argument is reserved for future use if we need to extend the
> API.
>
> I think supporting all hints madvise has/will supported/support to
> process_madvise is rather risky. Because we are not sure all hints make
> sense from external process and implementation for the hint may rely on
> the caller being in the current context so it could be error-prone.
> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
>
> If someone want to add other hints, we could hear hear the usecase and
> review it for each hint. It's more safe for maintainace rather than
> introducing a buggy syscall but hard to fix it later.

I have brought this up when we discussed this in the past but there is
no reflection on that here so let me bring that up again.

I believe that the interface has an inherent problem that it is racy.
The external entity needs to know the address space layout of the target
process to do anyhing useful on it. The address space is however under
the full control of the target process though and the external entity
has no means to find out that the layout has changed. So
time-to-check-time-to-act is an inherent problem.

This is a serious design flaw and it should be explained why it doesn't
matter or how to use the interface properly to prevent that problem.
--
Michal Hocko
SUSE Labs

2020-01-17 15:59:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > There is usecase that System Management Software(SMS) want to give
> > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > in the case of Android, it is the ActivityManagerService.
> >
> > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > required to make the reclaim decision is not known to the app. Instead,
> > it is known to the centralized userspace daemon(ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without
> > any app involvement.
> >
> > To solve the issue, this patch introduces new syscall process_madvise(2).
> > It uses pidfd of an external processs to give the hint.
> >
> > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > unsigned long flag);
> >
> > Since it could affect other process's address range, only privileged
> > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > gives it the right to ptrace the process could use it successfully.
> > The flag argument is reserved for future use if we need to extend the
> > API.
> >
> > I think supporting all hints madvise has/will supported/support to
> > process_madvise is rather risky. Because we are not sure all hints make
> > sense from external process and implementation for the hint may rely on
> > the caller being in the current context so it could be error-prone.
> > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> >
> > If someone want to add other hints, we could hear hear the usecase and
> > review it for each hint. It's more safe for maintainace rather than
> > introducing a buggy syscall but hard to fix it later.
>
> I have brought this up when we discussed this in the past but there is
> no reflection on that here so let me bring that up again.
>
> I believe that the interface has an inherent problem that it is racy.
> The external entity needs to know the address space layout of the target
> process to do anyhing useful on it. The address space is however under
> the full control of the target process though and the external entity
> has no means to find out that the layout has changed. So
> time-to-check-time-to-act is an inherent problem.
>
> This is a serious design flaw and it should be explained why it doesn't
> matter or how to use the interface properly to prevent that problem.

I agree, it looks flawed.

Also I don't see what System Management Software can generically do on
sub-process level. I mean how can it decide which part of address space is
less important than other.

I see how a manager can indicate that this process (or a group of
processes) is less important than other, but on per-addres-range basis?

--
Kirill A. Shutemov

2020-01-17 17:28:06

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > There is usecase that System Management Software(SMS) want to give
> > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > in the case of Android, it is the ActivityManagerService.
> >
> > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > required to make the reclaim decision is not known to the app. Instead,
> > it is known to the centralized userspace daemon(ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without
> > any app involvement.
> >
> > To solve the issue, this patch introduces new syscall process_madvise(2).
> > It uses pidfd of an external processs to give the hint.
> >
> > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > unsigned long flag);
> >
> > Since it could affect other process's address range, only privileged
> > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > gives it the right to ptrace the process could use it successfully.
> > The flag argument is reserved for future use if we need to extend the
> > API.
> >
> > I think supporting all hints madvise has/will supported/support to
> > process_madvise is rather risky. Because we are not sure all hints make
> > sense from external process and implementation for the hint may rely on
> > the caller being in the current context so it could be error-prone.
> > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> >
> > If someone want to add other hints, we could hear hear the usecase and
> > review it for each hint. It's more safe for maintainace rather than
> > introducing a buggy syscall but hard to fix it later.
>
> I have brought this up when we discussed this in the past but there is
> no reflection on that here so let me bring that up again.
>
> I believe that the interface has an inherent problem that it is racy.
> The external entity needs to know the address space layout of the target
> process to do anyhing useful on it. The address space is however under
> the full control of the target process though and the external entity
> has no means to find out that the layout has changed. So
> time-to-check-time-to-act is an inherent problem.
>
> This is a serious design flaw and it should be explained why it doesn't
> matter or how to use the interface properly to prevent that problem.

Sorry for the missing that part.

It's not a particular problem of this API because other APIs already have
done with that(e.g., move_pages, process_vm_writev).

Point is userspace has several ways for the control of target process
like SIGSTOP, cgroup freezer or even no need to control since platform
is already aware of that the process will never run until he grant it
or it's resilient even though the race happens.

In future, if we want to support more fine-grained consistency model
like memory layout, we could provide some API to get cookie(e.g.,
seq count which is updated whenever vma of the process changes). And then
we could feed the cookie to process_madvise's last argument so that
it can fail if founds it's not matched.
For that API, Daniel already posted RFC - process_getinfo[1].
https://lore.kernel.org/lkml/[email protected]/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

2020-01-17 17:34:00

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > There is usecase that System Management Software(SMS) want to give
> > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > in the case of Android, it is the ActivityManagerService.
> > >
> > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > required to make the reclaim decision is not known to the app. Instead,
> > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > and that daemon must be able to initiate reclaim on its own without
> > > any app involvement.
> > >
> > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > It uses pidfd of an external processs to give the hint.
> > >
> > > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > unsigned long flag);
> > >
> > > Since it could affect other process's address range, only privileged
> > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > gives it the right to ptrace the process could use it successfully.
> > > The flag argument is reserved for future use if we need to extend the
> > > API.
> > >
> > > I think supporting all hints madvise has/will supported/support to
> > > process_madvise is rather risky. Because we are not sure all hints make
> > > sense from external process and implementation for the hint may rely on
> > > the caller being in the current context so it could be error-prone.
> > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > >
> > > If someone want to add other hints, we could hear hear the usecase and
> > > review it for each hint. It's more safe for maintainace rather than
> > > introducing a buggy syscall but hard to fix it later.
> >
> > I have brought this up when we discussed this in the past but there is
> > no reflection on that here so let me bring that up again.
> >
> > I believe that the interface has an inherent problem that it is racy.
> > The external entity needs to know the address space layout of the target
> > process to do anyhing useful on it. The address space is however under
> > the full control of the target process though and the external entity
> > has no means to find out that the layout has changed. So
> > time-to-check-time-to-act is an inherent problem.
> >
> > This is a serious design flaw and it should be explained why it doesn't
> > matter or how to use the interface properly to prevent that problem.
>
> I agree, it looks flawed.
>
> Also I don't see what System Management Software can generically do on
> sub-process level. I mean how can it decide which part of address space is
> less important than other.
>
> I see how a manager can indicate that this process (or a group of
> processes) is less important than other, but on per-addres-range basis?

For example, memory ranges shared by several processes or critical for the
latency, we could avoid those ranges to be cold/pageout to prevent
unncecessary CPU burning/paging.

I also think people don't want to give an KSM hint to non-mergeable area.

2020-01-17 21:28:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > There is usecase that System Management Software(SMS) want to give
> > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > in the case of Android, it is the ActivityManagerService.
> > > >
> > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > required to make the reclaim decision is not known to the app. Instead,
> > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > and that daemon must be able to initiate reclaim on its own without
> > > > any app involvement.
> > > >
> > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > It uses pidfd of an external processs to give the hint.
> > > >
> > > > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > unsigned long flag);
> > > >
> > > > Since it could affect other process's address range, only privileged
> > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > gives it the right to ptrace the process could use it successfully.
> > > > The flag argument is reserved for future use if we need to extend the
> > > > API.
> > > >
> > > > I think supporting all hints madvise has/will supported/support to
> > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > sense from external process and implementation for the hint may rely on
> > > > the caller being in the current context so it could be error-prone.
> > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > >
> > > > If someone want to add other hints, we could hear hear the usecase and
> > > > review it for each hint. It's more safe for maintainace rather than
> > > > introducing a buggy syscall but hard to fix it later.
> > >
> > > I have brought this up when we discussed this in the past but there is
> > > no reflection on that here so let me bring that up again.
> > >
> > > I believe that the interface has an inherent problem that it is racy.
> > > The external entity needs to know the address space layout of the target
> > > process to do anyhing useful on it. The address space is however under
> > > the full control of the target process though and the external entity
> > > has no means to find out that the layout has changed. So
> > > time-to-check-time-to-act is an inherent problem.
> > >
> > > This is a serious design flaw and it should be explained why it doesn't
> > > matter or how to use the interface properly to prevent that problem.
> >
> > I agree, it looks flawed.
> >
> > Also I don't see what System Management Software can generically do on
> > sub-process level. I mean how can it decide which part of address space is
> > less important than other.
> >
> > I see how a manager can indicate that this process (or a group of
> > processes) is less important than other, but on per-addres-range basis?
>
> For example, memory ranges shared by several processes or critical for the
> latency, we could avoid those ranges to be cold/pageout to prevent
> unncecessary CPU burning/paging.

Hmm.. I still don't see why any external entity has a better (or any)
knowledge about the matter. The process has to do this, no?

> I also think people don't want to give an KSM hint to non-mergeable area.

And how the manager knows which data is mergable?

If you are intimate enough with the process' internal state feel free to
inject syscall into the process with ptrace. Why bother with half-measures?

--
Kirill A. Shutemov

2020-01-18 09:43:11

by SeongJae Park

[permalink] [raw]
Subject: Re: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Sat, 18 Jan 2020 00:26:53 +0300 "Kirill A. Shutemov" <[email protected]> wrote:

> On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > > There is usecase that System Management Software(SMS) want to give
> > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > > in the case of Android, it is the ActivityManagerService.
> > > > >
> > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > > required to make the reclaim decision is not known to the app. Instead,
> > > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > > and that daemon must be able to initiate reclaim on its own without
> > > > > any app involvement.
> > > > >
> > > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > > It uses pidfd of an external processs to give the hint.
> > > > >
> > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > > unsigned long flag);
> > > > >
> > > > > Since it could affect other process's address range, only privileged
> > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > > gives it the right to ptrace the process could use it successfully.
> > > > > The flag argument is reserved for future use if we need to extend the
> > > > > API.
> > > > >
> > > > > I think supporting all hints madvise has/will supported/support to
> > > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > > sense from external process and implementation for the hint may rely on
> > > > > the caller being in the current context so it could be error-prone.
> > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > >
> > > > > If someone want to add other hints, we could hear hear the usecase and
> > > > > review it for each hint. It's more safe for maintainace rather than
> > > > > introducing a buggy syscall but hard to fix it later.
> > > >
> > > > I have brought this up when we discussed this in the past but there is
> > > > no reflection on that here so let me bring that up again.
> > > >
> > > > I believe that the interface has an inherent problem that it is racy.
> > > > The external entity needs to know the address space layout of the target
> > > > process to do anyhing useful on it. The address space is however under
> > > > the full control of the target process though and the external entity
> > > > has no means to find out that the layout has changed. So
> > > > time-to-check-time-to-act is an inherent problem.
> > > >
> > > > This is a serious design flaw and it should be explained why it doesn't
> > > > matter or how to use the interface properly to prevent that problem.
> > >
> > > I agree, it looks flawed.
> > >
> > > Also I don't see what System Management Software can generically do on
> > > sub-process level. I mean how can it decide which part of address space is
> > > less important than other.
> > >
> > > I see how a manager can indicate that this process (or a group of
> > > processes) is less important than other, but on per-addres-range basis?
> >
> > For example, memory ranges shared by several processes or critical for the
> > latency, we could avoid those ranges to be cold/pageout to prevent
> > unncecessary CPU burning/paging.
>
> Hmm.. I still don't see why any external entity has a better (or any)
> knowledge about the matter. The process has to do this, no?
>
> > I also think people don't want to give an KSM hint to non-mergeable area.
>
> And how the manager knows which data is mergable?

Couldn't 'idle_page_tracking' like features could be used by the external
manager processes to know that?


Thanks,
SeongJae Park

>
> If you are intimate enough with the process' internal state feel free to
> inject syscall into the process with ptrace. Why bother with half-measures?
>
> --
> Kirill A. Shutemov
>
>

2020-01-19 16:15:44

by Sandeep Patil

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
> On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > > There is usecase that System Management Software(SMS) want to give
> > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > > in the case of Android, it is the ActivityManagerService.
> > > > >
> > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > > required to make the reclaim decision is not known to the app. Instead,
> > > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > > and that daemon must be able to initiate reclaim on its own without
> > > > > any app involvement.
> > > > >
> > > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > > It uses pidfd of an external processs to give the hint.
> > > > >
> > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > > unsigned long flag);
> > > > >
> > > > > Since it could affect other process's address range, only privileged
> > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > > gives it the right to ptrace the process could use it successfully.
> > > > > The flag argument is reserved for future use if we need to extend the
> > > > > API.
> > > > >
> > > > > I think supporting all hints madvise has/will supported/support to
> > > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > > sense from external process and implementation for the hint may rely on
> > > > > the caller being in the current context so it could be error-prone.
> > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > >
> > > > > If someone want to add other hints, we could hear hear the usecase and
> > > > > review it for each hint. It's more safe for maintainace rather than
> > > > > introducing a buggy syscall but hard to fix it later.
> > > >
> > > > I have brought this up when we discussed this in the past but there is
> > > > no reflection on that here so let me bring that up again.
> > > >
> > > > I believe that the interface has an inherent problem that it is racy.
> > > > The external entity needs to know the address space layout of the target
> > > > process to do anyhing useful on it. The address space is however under
> > > > the full control of the target process though and the external entity
> > > > has no means to find out that the layout has changed. So
> > > > time-to-check-time-to-act is an inherent problem.
> > > >
> > > > This is a serious design flaw and it should be explained why it doesn't
> > > > matter or how to use the interface properly to prevent that problem.
> > >
> > > I agree, it looks flawed.
> > >
> > > Also I don't see what System Management Software can generically do on
> > > sub-process level. I mean how can it decide which part of address space is
> > > less important than other.
> > >
> > > I see how a manager can indicate that this process (or a group of
> > > processes) is less important than other, but on per-addres-range basis?
> >
> > For example, memory ranges shared by several processes or critical for the
> > latency, we could avoid those ranges to be cold/pageout to prevent
> > unncecessary CPU burning/paging.
>
> Hmm.. I still don't see why any external entity has a better (or any)
> knowledge about the matter. The process has to do this, no?

FWIW, I totally agree with the time-to-check-time-to-react problem. However,
I'd like to clarify the ActivityManager/SystemServer case (I'll call it
SystemServer from now on)

For Android, every application (including the special SystemServer) are forked
from Zygote. The reason ofcourse is to share as many libraries and classes between
the two as possible to benefit from the preloading during boot.

After applications start, (almost) all of the APIs end up calling into this
SystemServer process over IPC (binder) and back to the application.

In a fully running system, the SystemServer monitors every single process
periodically to calculate their PSS / RSS and also decides which process is
"important" to the user for interactivity.

So, because of how these processes start _and_ the fact that the SystemServer
is looping to monitor each process, it does tend to *know* which address
range of the application is not used / useful.

Besides, we can never rely on applications to clean things up themselves.
We've had the "hey app1, the system is low on memory, please trim your
memory usage down" notifications for a long time[1]. They rely on
applications honoring the broadcasts and very few do.

So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant memory in
these applications will be useful.


- ssp

1. https://developer.android.com/topic/performance/memory


>
> > I also think people don't want to give an KSM hint to non-mergeable area.
>
> And how the manager knows which data is mergable?
>
> If you are intimate enough with the process' internal state feel free to
> inject syscall into the process with ptrace. Why bother with half-measures?
>
> --
> Kirill A. Shutemov

2020-01-20 07:59:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Sun 19-01-20 08:14:31, [email protected] wrote:
> On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> > > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > > > There is usecase that System Management Software(SMS) want to give
> > > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > > > in the case of Android, it is the ActivityManagerService.
> > > > > >
> > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > > > required to make the reclaim decision is not known to the app. Instead,
> > > > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > > > and that daemon must be able to initiate reclaim on its own without
> > > > > > any app involvement.
> > > > > >
> > > > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > > > It uses pidfd of an external processs to give the hint.
> > > > > >
> > > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > > > unsigned long flag);
> > > > > >
> > > > > > Since it could affect other process's address range, only privileged
> > > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > > > gives it the right to ptrace the process could use it successfully.
> > > > > > The flag argument is reserved for future use if we need to extend the
> > > > > > API.
> > > > > >
> > > > > > I think supporting all hints madvise has/will supported/support to
> > > > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > > > sense from external process and implementation for the hint may rely on
> > > > > > the caller being in the current context so it could be error-prone.
> > > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > > >
> > > > > > If someone want to add other hints, we could hear hear the usecase and
> > > > > > review it for each hint. It's more safe for maintainace rather than
> > > > > > introducing a buggy syscall but hard to fix it later.
> > > > >
> > > > > I have brought this up when we discussed this in the past but there is
> > > > > no reflection on that here so let me bring that up again.
> > > > >
> > > > > I believe that the interface has an inherent problem that it is racy.
> > > > > The external entity needs to know the address space layout of the target
> > > > > process to do anyhing useful on it. The address space is however under
> > > > > the full control of the target process though and the external entity
> > > > > has no means to find out that the layout has changed. So
> > > > > time-to-check-time-to-act is an inherent problem.
> > > > >
> > > > > This is a serious design flaw and it should be explained why it doesn't
> > > > > matter or how to use the interface properly to prevent that problem.
> > > >
> > > > I agree, it looks flawed.
> > > >
> > > > Also I don't see what System Management Software can generically do on
> > > > sub-process level. I mean how can it decide which part of address space is
> > > > less important than other.
> > > >
> > > > I see how a manager can indicate that this process (or a group of
> > > > processes) is less important than other, but on per-addres-range basis?
> > >
> > > For example, memory ranges shared by several processes or critical for the
> > > latency, we could avoid those ranges to be cold/pageout to prevent
> > > unncecessary CPU burning/paging.
> >
> > Hmm.. I still don't see why any external entity has a better (or any)
> > knowledge about the matter. The process has to do this, no?
>
> FWIW, I totally agree with the time-to-check-time-to-react problem. However,
> I'd like to clarify the ActivityManager/SystemServer case (I'll call it
> SystemServer from now on)
>
> For Android, every application (including the special SystemServer) are forked
> from Zygote. The reason ofcourse is to share as many libraries and classes between
> the two as possible to benefit from the preloading during boot.
>
> After applications start, (almost) all of the APIs end up calling into this
> SystemServer process over IPC (binder) and back to the application.
>
> In a fully running system, the SystemServer monitors every single process
> periodically to calculate their PSS / RSS and also decides which process is
> "important" to the user for interactivity.
>
> So, because of how these processes start _and_ the fact that the SystemServer
> is looping to monitor each process, it does tend to *know* which address
> range of the application is not used / useful.
>
> Besides, we can never rely on applications to clean things up themselves.
> We've had the "hey app1, the system is low on memory, please trim your
> memory usage down" notifications for a long time[1]. They rely on
> applications honoring the broadcasts and very few do.
>
> So, if we want to avoid the inevitable killing of the application and
> restarting it, some way to be able to tell the OS about unimportant memory in
> these applications will be useful.

This is a useful information that should be a part of the changelog. I
do see how the current form of the API might fit into Android model
without many problems. But we are not designing an API for a single
usecase, right? In a highly cooperative environments you can use ptrace
code injection as mentioned by Kirill. Or is there any fundamental
problem about that?

The interface really has to be robust to future potential usecases.
--
Michal Hocko
SUSE Labs

2020-01-20 08:04:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Fri 17-01-20 09:25:42, Minchan Kim wrote:
> On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > There is usecase that System Management Software(SMS) want to give
> > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > in the case of Android, it is the ActivityManagerService.
> > >
> > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > required to make the reclaim decision is not known to the app. Instead,
> > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > and that daemon must be able to initiate reclaim on its own without
> > > any app involvement.
> > >
> > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > It uses pidfd of an external processs to give the hint.
> > >
> > > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > unsigned long flag);
> > >
> > > Since it could affect other process's address range, only privileged
> > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > gives it the right to ptrace the process could use it successfully.
> > > The flag argument is reserved for future use if we need to extend the
> > > API.
> > >
> > > I think supporting all hints madvise has/will supported/support to
> > > process_madvise is rather risky. Because we are not sure all hints make
> > > sense from external process and implementation for the hint may rely on
> > > the caller being in the current context so it could be error-prone.
> > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > >
> > > If someone want to add other hints, we could hear hear the usecase and
> > > review it for each hint. It's more safe for maintainace rather than
> > > introducing a buggy syscall but hard to fix it later.
> >
> > I have brought this up when we discussed this in the past but there is
> > no reflection on that here so let me bring that up again.
> >
> > I believe that the interface has an inherent problem that it is racy.
> > The external entity needs to know the address space layout of the target
> > process to do anyhing useful on it. The address space is however under
> > the full control of the target process though and the external entity
> > has no means to find out that the layout has changed. So
> > time-to-check-time-to-act is an inherent problem.
> >
> > This is a serious design flaw and it should be explained why it doesn't
> > matter or how to use the interface properly to prevent that problem.
>
> Sorry for the missing that part.
>
> It's not a particular problem of this API because other APIs already have
> done with that(e.g., move_pages, process_vm_writev).

I am sorry but this is not really an argument.

> Point is userspace has several ways for the control of target process
> like SIGSTOP, cgroup freezer or even no need to control since platform
> is already aware of that the process will never run until he grant it
> or it's resilient even though the race happens.

If you have that level of control then you can simply inject the code
via ptrace and you do not need a new syscall in the first place.

> In future, if we want to support more fine-grained consistency model
> like memory layout, we could provide some API to get cookie(e.g.,
> seq count which is updated whenever vma of the process changes). And then
> we could feed the cookie to process_madvise's last argument so that
> it can fail if founds it's not matched.
> For that API, Daniel already posted RFC - process_getinfo[1].
> https://lore.kernel.org/lkml/[email protected]/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

So why do not we start with a clean API since the beginning? I do agree
that a remote madvise is an interesting feature and it opens gates to
all sorts of userspace memory management which is not possible this
days. But the syscall has to have a reasonable semantic to allow that.
We cannot simply start with a half proken symantic first based on an
Android usecase and then hit the wall as soon as others with a different
user space model want to use it as well.
--
Michal Hocko
SUSE Labs

2020-01-20 10:25:58

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On 17.01.2020 14:52, Michal Hocko wrote:
> On Thu 16-01-20 15:59:50, Minchan Kim wrote:
>> There is usecase that System Management Software(SMS) want to give
>> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
>> in the case of Android, it is the ActivityManagerService.
>>
>> It's similar in spirit to madvise(MADV_WONTNEED), but the information
>> required to make the reclaim decision is not known to the app. Instead,
>> it is known to the centralized userspace daemon(ActivityManagerService),
>> and that daemon must be able to initiate reclaim on its own without
>> any app involvement.
>>
>> To solve the issue, this patch introduces new syscall process_madvise(2).
>> It uses pidfd of an external processs to give the hint.
>>
>> int process_madvise(int pidfd, void *addr, size_t length, int advise,
>> unsigned long flag);
>>
>> Since it could affect other process's address range, only privileged
>> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
>> gives it the right to ptrace the process could use it successfully.
>> The flag argument is reserved for future use if we need to extend the
>> API.
>>
>> I think supporting all hints madvise has/will supported/support to
>> process_madvise is rather risky. Because we are not sure all hints make
>> sense from external process and implementation for the hint may rely on
>> the caller being in the current context so it could be error-prone.
>> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
>>
>> If someone want to add other hints, we could hear hear the usecase and
>> review it for each hint. It's more safe for maintainace rather than
>> introducing a buggy syscall but hard to fix it later.
>
> I have brought this up when we discussed this in the past but there is
> no reflection on that here so let me bring that up again.
>
> I believe that the interface has an inherent problem that it is racy.
> The external entity needs to know the address space layout of the target
> process to do anyhing useful on it. The address space is however under
> the full control of the target process though and the external entity
> has no means to find out that the layout has changed. So
> time-to-check-time-to-act is an inherent problem.
>
> This is a serious design flaw and it should be explained why it doesn't
> matter or how to use the interface properly to prevent that problem.

Really, any address space manipulation, where more than one process is
involved, is racy. Even two threads on common memory need a synchronization
to manage mappings in a sane way. Managing memory from two processes
is the same in principle, and the only difference is that another level
of synchronization is required.

I'm not fan and user for this feature (at least for now, nobody knows
what will be in the future), but design flaw argument does not look
fully valid.

Regards,
Kirill

2020-01-20 10:41:32

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On 20.01.2020 10:58, Michal Hocko wrote:
> On Sun 19-01-20 08:14:31, [email protected] wrote:
>> On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
>>> On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
>>>> On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
>>>>> On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
>>>>>> On Thu 16-01-20 15:59:50, Minchan Kim wrote:
>>>>>>> There is usecase that System Management Software(SMS) want to give
>>>>>>> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
>>>>>>> in the case of Android, it is the ActivityManagerService.
>>>>>>>
>>>>>>> It's similar in spirit to madvise(MADV_WONTNEED), but the information
>>>>>>> required to make the reclaim decision is not known to the app. Instead,
>>>>>>> it is known to the centralized userspace daemon(ActivityManagerService),
>>>>>>> and that daemon must be able to initiate reclaim on its own without
>>>>>>> any app involvement.
>>>>>>>
>>>>>>> To solve the issue, this patch introduces new syscall process_madvise(2).
>>>>>>> It uses pidfd of an external processs to give the hint.
>>>>>>>
>>>>>>> int process_madvise(int pidfd, void *addr, size_t length, int advise,
>>>>>>> unsigned long flag);
>>>>>>>
>>>>>>> Since it could affect other process's address range, only privileged
>>>>>>> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
>>>>>>> gives it the right to ptrace the process could use it successfully.
>>>>>>> The flag argument is reserved for future use if we need to extend the
>>>>>>> API.
>>>>>>>
>>>>>>> I think supporting all hints madvise has/will supported/support to
>>>>>>> process_madvise is rather risky. Because we are not sure all hints make
>>>>>>> sense from external process and implementation for the hint may rely on
>>>>>>> the caller being in the current context so it could be error-prone.
>>>>>>> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
>>>>>>>
>>>>>>> If someone want to add other hints, we could hear hear the usecase and
>>>>>>> review it for each hint. It's more safe for maintainace rather than
>>>>>>> introducing a buggy syscall but hard to fix it later.
>>>>>>
>>>>>> I have brought this up when we discussed this in the past but there is
>>>>>> no reflection on that here so let me bring that up again.
>>>>>>
>>>>>> I believe that the interface has an inherent problem that it is racy.
>>>>>> The external entity needs to know the address space layout of the target
>>>>>> process to do anyhing useful on it. The address space is however under
>>>>>> the full control of the target process though and the external entity
>>>>>> has no means to find out that the layout has changed. So
>>>>>> time-to-check-time-to-act is an inherent problem.
>>>>>>
>>>>>> This is a serious design flaw and it should be explained why it doesn't
>>>>>> matter or how to use the interface properly to prevent that problem.
>>>>>
>>>>> I agree, it looks flawed.
>>>>>
>>>>> Also I don't see what System Management Software can generically do on
>>>>> sub-process level. I mean how can it decide which part of address space is
>>>>> less important than other.
>>>>>
>>>>> I see how a manager can indicate that this process (or a group of
>>>>> processes) is less important than other, but on per-addres-range basis?
>>>>
>>>> For example, memory ranges shared by several processes or critical for the
>>>> latency, we could avoid those ranges to be cold/pageout to prevent
>>>> unncecessary CPU burning/paging.
>>>
>>> Hmm.. I still don't see why any external entity has a better (or any)
>>> knowledge about the matter. The process has to do this, no?
>>
>> FWIW, I totally agree with the time-to-check-time-to-react problem. However,
>> I'd like to clarify the ActivityManager/SystemServer case (I'll call it
>> SystemServer from now on)
>>
>> For Android, every application (including the special SystemServer) are forked
>> from Zygote. The reason ofcourse is to share as many libraries and classes between
>> the two as possible to benefit from the preloading during boot.
>>
>> After applications start, (almost) all of the APIs end up calling into this
>> SystemServer process over IPC (binder) and back to the application.
>>
>> In a fully running system, the SystemServer monitors every single process
>> periodically to calculate their PSS / RSS and also decides which process is
>> "important" to the user for interactivity.
>>
>> So, because of how these processes start _and_ the fact that the SystemServer
>> is looping to monitor each process, it does tend to *know* which address
>> range of the application is not used / useful.
>>
>> Besides, we can never rely on applications to clean things up themselves.
>> We've had the "hey app1, the system is low on memory, please trim your
>> memory usage down" notifications for a long time[1]. They rely on
>> applications honoring the broadcasts and very few do.
>>
>> So, if we want to avoid the inevitable killing of the application and
>> restarting it, some way to be able to tell the OS about unimportant memory in
>> these applications will be useful.
>
> This is a useful information that should be a part of the changelog. I
> do see how the current form of the API might fit into Android model
> without many problems. But we are not designing an API for a single
> usecase, right? In a highly cooperative environments you can use ptrace
> code injection as mentioned by Kirill. Or is there any fundamental
> problem about that?

There could be only problems with multi-threads applications, which
poll the state of another threads (and exit with error, when they
found that someone is traced). But this may be workarounded by freezer
cgroup making all the thread group is frozen. It should guarantee
consistent state of the whole process after attaching.

Kirill

2020-01-20 11:28:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> On 17.01.2020 14:52, Michal Hocko wrote:
> > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> >> There is usecase that System Management Software(SMS) want to give
> >> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> >> in the case of Android, it is the ActivityManagerService.
> >>
> >> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> >> required to make the reclaim decision is not known to the app. Instead,
> >> it is known to the centralized userspace daemon(ActivityManagerService),
> >> and that daemon must be able to initiate reclaim on its own without
> >> any app involvement.
> >>
> >> To solve the issue, this patch introduces new syscall process_madvise(2).
> >> It uses pidfd of an external processs to give the hint.
> >>
> >> int process_madvise(int pidfd, void *addr, size_t length, int advise,
> >> unsigned long flag);
> >>
> >> Since it could affect other process's address range, only privileged
> >> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> >> gives it the right to ptrace the process could use it successfully.
> >> The flag argument is reserved for future use if we need to extend the
> >> API.
> >>
> >> I think supporting all hints madvise has/will supported/support to
> >> process_madvise is rather risky. Because we are not sure all hints make
> >> sense from external process and implementation for the hint may rely on
> >> the caller being in the current context so it could be error-prone.
> >> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> >>
> >> If someone want to add other hints, we could hear hear the usecase and
> >> review it for each hint. It's more safe for maintainace rather than
> >> introducing a buggy syscall but hard to fix it later.
> >
> > I have brought this up when we discussed this in the past but there is
> > no reflection on that here so let me bring that up again.
> >
> > I believe that the interface has an inherent problem that it is racy.
> > The external entity needs to know the address space layout of the target
> > process to do anyhing useful on it. The address space is however under
> > the full control of the target process though and the external entity
> > has no means to find out that the layout has changed. So
> > time-to-check-time-to-act is an inherent problem.
> >
> > This is a serious design flaw and it should be explained why it doesn't
> > matter or how to use the interface properly to prevent that problem.
>
> Really, any address space manipulation, where more than one process is
> involved, is racy.

They are, indeed. But that is not the point I wanted to make.

> Even two threads on common memory need a synchronization
> to manage mappings in a sane way. Managing memory from two processes
> is the same in principle, and the only difference is that another level
> of synchronization is required.

Well, not really. The operation might simply attempt to perform an
operation on a specific memory area and get a failure if it doesn't
reference the same object anymore. What I think we need is some form of
a handle to operate on. In the past we have discussed several
directions. I was proposing /proc/self/map_anon/ (analogous to
map_files) where you could inspect anonymous memory and get a file
handle for it. madvise would then operate on the fd and then there
shouldn't be a real problem to revalidate that the object is still
valid. But there was no general enthusiasm about that approach. There
are likely some land mines on the way.

There was another approach mentioned by Minchan in other email in this
thread.

What I want to say is that I believe a remove madvise can have a
sensible semantic even without a strong synchronization between the
monitor and the target task. We just have to make sure that the monitor
never operates on a different object then it believes it acts on.

On the other hand, there will never be any way to make this interface
reasonable for destructive operations though because the content of the
memory needs a strong synchronization IMHO.
--
Michal Hocko
SUSE Labs

2020-01-20 12:40:47

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> > On 17.01.2020 14:52, Michal Hocko wrote:
> > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > >> There is usecase that System Management Software(SMS) want to give
> > >> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > >> in the case of Android, it is the ActivityManagerService.
> > >>
> > >> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > >> required to make the reclaim decision is not known to the app. Instead,
> > >> it is known to the centralized userspace daemon(ActivityManagerService),
> > >> and that daemon must be able to initiate reclaim on its own without
> > >> any app involvement.
> > >>
> > >> To solve the issue, this patch introduces new syscall process_madvise(2).
> > >> It uses pidfd of an external processs to give the hint.
> > >>
> > >> int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > >> unsigned long flag);
> > >>
> > >> Since it could affect other process's address range, only privileged
> > >> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > >> gives it the right to ptrace the process could use it successfully.
> > >> The flag argument is reserved for future use if we need to extend the
> > >> API.
> > >>
> > >> I think supporting all hints madvise has/will supported/support to
> > >> process_madvise is rather risky. Because we are not sure all hints make
> > >> sense from external process and implementation for the hint may rely on
> > >> the caller being in the current context so it could be error-prone.
> > >> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > >>
> > >> If someone want to add other hints, we could hear hear the usecase and
> > >> review it for each hint. It's more safe for maintainace rather than
> > >> introducing a buggy syscall but hard to fix it later.
> > >
> > > I have brought this up when we discussed this in the past but there is
> > > no reflection on that here so let me bring that up again.
> > >
> > > I believe that the interface has an inherent problem that it is racy.
> > > The external entity needs to know the address space layout of the target
> > > process to do anyhing useful on it. The address space is however under
> > > the full control of the target process though and the external entity
> > > has no means to find out that the layout has changed. So
> > > time-to-check-time-to-act is an inherent problem.
> > >
> > > This is a serious design flaw and it should be explained why it doesn't
> > > matter or how to use the interface properly to prevent that problem.
> >
> > Really, any address space manipulation, where more than one process is
> > involved, is racy.
>
> They are, indeed. But that is not the point I wanted to make.
>
> > Even two threads on common memory need a synchronization
> > to manage mappings in a sane way. Managing memory from two processes
> > is the same in principle, and the only difference is that another level
> > of synchronization is required.
>
> Well, not really. The operation might simply attempt to perform an
> operation on a specific memory area and get a failure if it doesn't
> reference the same object anymore. What I think we need is some form of
> a handle to operate on. In the past we have discussed several
> directions. I was proposing /proc/self/map_anon/ (analogous to
> map_files) where you could inspect anonymous memory and get a file
> handle for it. madvise would then operate on the fd and then there
> shouldn't be a real problem to revalidate that the object is still
> valid. But there was no general enthusiasm about that approach. There
> are likely some land mines on the way.

Converting anon memory to file-backed is bad idea and going to backfire.
It will break many things around rmap (for THP in particular). We have
assumption that an anon page cannot be mapped twice in the same process.
Breaking rules around memory inheritance is not helpful too.

It is a major re-design of the subsystem and I don't see any real need for
this.

--
Kirill A. Shutemov

2020-01-20 13:25:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote:
> On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
[...]
> > > Even two threads on common memory need a synchronization
> > > to manage mappings in a sane way. Managing memory from two processes
> > > is the same in principle, and the only difference is that another level
> > > of synchronization is required.
> >
> > Well, not really. The operation might simply attempt to perform an
> > operation on a specific memory area and get a failure if it doesn't
> > reference the same object anymore. What I think we need is some form of
> > a handle to operate on. In the past we have discussed several
> > directions. I was proposing /proc/self/map_anon/ (analogous to
> > map_files) where you could inspect anonymous memory and get a file
> > handle for it. madvise would then operate on the fd and then there
> > shouldn't be a real problem to revalidate that the object is still
> > valid. But there was no general enthusiasm about that approach. There
> > are likely some land mines on the way.
>
> Converting anon memory to file-backed is bad idea and going to backfire.

I didn't mean to convert. I meant to expose that information via proc
the same way we do for file backed mappings. That shouldn't really
require to re-design the way how anonymous vma work IMO. But I haven't
tried that so there might be many gotchas there.

There are obvious things to think about though. Such fd cannot be sent
to other processes (SCM stuff), mmap of the file would have to be
disallowed and many others I am not aware of. I am not even pushing this
direction because I am not convinced about how viable it is myself. But
it would sound like a nice extension of the existing mechanism we have
and a file based madvise sounds attractive to me as well because we
already have that.
--
Michal Hocko
SUSE Labs

2020-01-20 14:23:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Mon, Jan 20, 2020 at 02:24:05PM +0100, Michal Hocko wrote:
> On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote:
> > On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> > > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> [...]
> > > > Even two threads on common memory need a synchronization
> > > > to manage mappings in a sane way. Managing memory from two processes
> > > > is the same in principle, and the only difference is that another level
> > > > of synchronization is required.
> > >
> > > Well, not really. The operation might simply attempt to perform an
> > > operation on a specific memory area and get a failure if it doesn't
> > > reference the same object anymore. What I think we need is some form of
> > > a handle to operate on. In the past we have discussed several
> > > directions. I was proposing /proc/self/map_anon/ (analogous to
> > > map_files) where you could inspect anonymous memory and get a file
> > > handle for it. madvise would then operate on the fd and then there
> > > shouldn't be a real problem to revalidate that the object is still
> > > valid. But there was no general enthusiasm about that approach. There
> > > are likely some land mines on the way.
> >
> > Converting anon memory to file-backed is bad idea and going to backfire.
>
> I didn't mean to convert. I meant to expose that information via proc
> the same way we do for file backed mappings. That shouldn't really
> require to re-design the way how anonymous vma work IMO. But I haven't
> tried that so there might be many gotchas there.
>
> There are obvious things to think about though. Such fd cannot be sent
> to other processes (SCM stuff), mmap of the file would have to be
> disallowed and many others I am not aware of. I am not even pushing this
> direction because I am not convinced about how viable it is myself. But
> it would sound like a nice extension of the existing mechanism we have
> and a file based madvise sounds attractive to me as well because we
> already have that.

If the fd cannot be passed around or mmaped what does it represent?
And how is it different from plain address?

--
Kirill A. Shutemov

2020-01-20 15:47:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Mon 20-01-20 17:21:32, Kirill A. Shutemov wrote:
> On Mon, Jan 20, 2020 at 02:24:05PM +0100, Michal Hocko wrote:
> > On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote:
> > > On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> > > > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> > [...]
> > > > > Even two threads on common memory need a synchronization
> > > > > to manage mappings in a sane way. Managing memory from two processes
> > > > > is the same in principle, and the only difference is that another level
> > > > > of synchronization is required.
> > > >
> > > > Well, not really. The operation might simply attempt to perform an
> > > > operation on a specific memory area and get a failure if it doesn't
> > > > reference the same object anymore. What I think we need is some form of
> > > > a handle to operate on. In the past we have discussed several
> > > > directions. I was proposing /proc/self/map_anon/ (analogous to
> > > > map_files) where you could inspect anonymous memory and get a file
> > > > handle for it. madvise would then operate on the fd and then there
> > > > shouldn't be a real problem to revalidate that the object is still
> > > > valid. But there was no general enthusiasm about that approach. There
> > > > are likely some land mines on the way.
> > >
> > > Converting anon memory to file-backed is bad idea and going to backfire.
> >
> > I didn't mean to convert. I meant to expose that information via proc
> > the same way we do for file backed mappings. That shouldn't really
> > require to re-design the way how anonymous vma work IMO. But I haven't
> > tried that so there might be many gotchas there.
> >
> > There are obvious things to think about though. Such fd cannot be sent
> > to other processes (SCM stuff), mmap of the file would have to be
> > disallowed and many others I am not aware of. I am not even pushing this
> > direction because I am not convinced about how viable it is myself. But
> > it would sound like a nice extension of the existing mechanism we have
> > and a file based madvise sounds attractive to me as well because we
> > already have that.
>
> If the fd cannot be passed around or mmaped what does it represent?

Because it is a cookie maintained by the kernel.

> And how is it different from plain address?

Kernel can revalidate that the given fd is denoting the vma it was
created for and simply fail with ENOENT or whatever suits if somebody
did unmap and mmap to the same address.
--
Michal Hocko
SUSE Labs

2020-01-21 18:12:31

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
> On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > > There is usecase that System Management Software(SMS) want to give
> > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > > in the case of Android, it is the ActivityManagerService.
> > > > >
> > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > > required to make the reclaim decision is not known to the app. Instead,
> > > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > > and that daemon must be able to initiate reclaim on its own without
> > > > > any app involvement.
> > > > >
> > > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > > It uses pidfd of an external processs to give the hint.
> > > > >
> > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > > unsigned long flag);
> > > > >
> > > > > Since it could affect other process's address range, only privileged
> > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > > gives it the right to ptrace the process could use it successfully.
> > > > > The flag argument is reserved for future use if we need to extend the
> > > > > API.
> > > > >
> > > > > I think supporting all hints madvise has/will supported/support to
> > > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > > sense from external process and implementation for the hint may rely on
> > > > > the caller being in the current context so it could be error-prone.
> > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > >
> > > > > If someone want to add other hints, we could hear hear the usecase and
> > > > > review it for each hint. It's more safe for maintainace rather than
> > > > > introducing a buggy syscall but hard to fix it later.
> > > >
> > > > I have brought this up when we discussed this in the past but there is
> > > > no reflection on that here so let me bring that up again.
> > > >
> > > > I believe that the interface has an inherent problem that it is racy.
> > > > The external entity needs to know the address space layout of the target
> > > > process to do anyhing useful on it. The address space is however under
> > > > the full control of the target process though and the external entity
> > > > has no means to find out that the layout has changed. So
> > > > time-to-check-time-to-act is an inherent problem.
> > > >
> > > > This is a serious design flaw and it should be explained why it doesn't
> > > > matter or how to use the interface properly to prevent that problem.
> > >
> > > I agree, it looks flawed.
> > >
> > > Also I don't see what System Management Software can generically do on
> > > sub-process level. I mean how can it decide which part of address space is
> > > less important than other.
> > >
> > > I see how a manager can indicate that this process (or a group of
> > > processes) is less important than other, but on per-addres-range basis?
> >
> > For example, memory ranges shared by several processes or critical for the
> > latency, we could avoid those ranges to be cold/pageout to prevent
> > unncecessary CPU burning/paging.
>
> Hmm.. I still don't see why any external entity has a better (or any)
> knowledge about the matter. The process has to do this, no?

I think Sandeep already gave enough information in other thread.

>
> > I also think people don't want to give an KSM hint to non-mergeable area.
>
> And how the manager knows which data is mergable?

Oleksandr, could you say your thought why you need address range based
API?

>
> If you are intimate enough with the process' internal state feel free to
> inject syscall into the process with ptrace. Why bother with half-measures?

Concern is we want to act the hint in caller's context, not calle because
calle is usually very limited in cpuset/cgroups or even freezed state so
they couldn't act by themselves quick enough, which makes many problems.
One of efforts to solve the issue was "Expedited memory reclaim"

https://lwn.net/Articles/785709/

That could be also a good candidate for process_madvise API.

2020-01-21 18:33:34

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
> On Sun 19-01-20 08:14:31, [email protected] wrote:
> > On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> > > > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > > > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > > > > There is usecase that System Management Software(SMS) want to give
> > > > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > > > > in the case of Android, it is the ActivityManagerService.
> > > > > > >
> > > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > > > > required to make the reclaim decision is not known to the app. Instead,
> > > > > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > > > > and that daemon must be able to initiate reclaim on its own without
> > > > > > > any app involvement.
> > > > > > >
> > > > > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > > > > It uses pidfd of an external processs to give the hint.
> > > > > > >
> > > > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > > > > unsigned long flag);
> > > > > > >
> > > > > > > Since it could affect other process's address range, only privileged
> > > > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > > > > gives it the right to ptrace the process could use it successfully.
> > > > > > > The flag argument is reserved for future use if we need to extend the
> > > > > > > API.
> > > > > > >
> > > > > > > I think supporting all hints madvise has/will supported/support to
> > > > > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > > > > sense from external process and implementation for the hint may rely on
> > > > > > > the caller being in the current context so it could be error-prone.
> > > > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > > > >
> > > > > > > If someone want to add other hints, we could hear hear the usecase and
> > > > > > > review it for each hint. It's more safe for maintainace rather than
> > > > > > > introducing a buggy syscall but hard to fix it later.
> > > > > >
> > > > > > I have brought this up when we discussed this in the past but there is
> > > > > > no reflection on that here so let me bring that up again.
> > > > > >
> > > > > > I believe that the interface has an inherent problem that it is racy.
> > > > > > The external entity needs to know the address space layout of the target
> > > > > > process to do anyhing useful on it. The address space is however under
> > > > > > the full control of the target process though and the external entity
> > > > > > has no means to find out that the layout has changed. So
> > > > > > time-to-check-time-to-act is an inherent problem.
> > > > > >
> > > > > > This is a serious design flaw and it should be explained why it doesn't
> > > > > > matter or how to use the interface properly to prevent that problem.
> > > > >
> > > > > I agree, it looks flawed.
> > > > >
> > > > > Also I don't see what System Management Software can generically do on
> > > > > sub-process level. I mean how can it decide which part of address space is
> > > > > less important than other.
> > > > >
> > > > > I see how a manager can indicate that this process (or a group of
> > > > > processes) is less important than other, but on per-addres-range basis?
> > > >
> > > > For example, memory ranges shared by several processes or critical for the
> > > > latency, we could avoid those ranges to be cold/pageout to prevent
> > > > unncecessary CPU burning/paging.
> > >
> > > Hmm.. I still don't see why any external entity has a better (or any)
> > > knowledge about the matter. The process has to do this, no?
> >
> > FWIW, I totally agree with the time-to-check-time-to-react problem. However,
> > I'd like to clarify the ActivityManager/SystemServer case (I'll call it
> > SystemServer from now on)
> >
> > For Android, every application (including the special SystemServer) are forked
> > from Zygote. The reason ofcourse is to share as many libraries and classes between
> > the two as possible to benefit from the preloading during boot.
> >
> > After applications start, (almost) all of the APIs end up calling into this
> > SystemServer process over IPC (binder) and back to the application.
> >
> > In a fully running system, the SystemServer monitors every single process
> > periodically to calculate their PSS / RSS and also decides which process is
> > "important" to the user for interactivity.
> >
> > So, because of how these processes start _and_ the fact that the SystemServer
> > is looping to monitor each process, it does tend to *know* which address
> > range of the application is not used / useful.
> >
> > Besides, we can never rely on applications to clean things up themselves.
> > We've had the "hey app1, the system is low on memory, please trim your
> > memory usage down" notifications for a long time[1]. They rely on
> > applications honoring the broadcasts and very few do.
> >
> > So, if we want to avoid the inevitable killing of the application and
> > restarting it, some way to be able to tell the OS about unimportant memory in
> > these applications will be useful.

Thanks for adding more useful description, Sandeep.

>
> This is a useful information that should be a part of the changelog. I

I will incldue it in next respin.

> do see how the current form of the API might fit into Android model
> without many problems. But we are not designing an API for a single
> usecase, right? In a highly cooperative environments you can use ptrace
> code injection as mentioned by Kirill. Or is there any fundamental
> problem about that?

I replied it at Kirill's thread so let's discuss there.

>
> The interface really has to be robust to future potential usecases.

I do understand your concern but for me, it's chicken and egg problem.
We usually do best effort to make something perfect as far as possible
but we also don't do over-engineering without real usecase from the
beginning.

I already told you how we could synchronize among processes and potential
way to be extended Daniel suggested(That's why current API has extra field
for the cookie) even though we don't need it right now.

If you want to suggest the other way, please explain why your idea is
better and why we need it at this moment. I don't think that is a
blocker for the progress of this API since we already have several ways
to synchronize processes.

2020-01-21 18:44:51

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Mon, Jan 20, 2020 at 02:24:05PM +0100, Michal Hocko wrote:
> On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote:
> > On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> > > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> [...]
> > > > Even two threads on common memory need a synchronization
> > > > to manage mappings in a sane way. Managing memory from two processes
> > > > is the same in principle, and the only difference is that another level
> > > > of synchronization is required.
> > >
> > > Well, not really. The operation might simply attempt to perform an
> > > operation on a specific memory area and get a failure if it doesn't
> > > reference the same object anymore. What I think we need is some form of
> > > a handle to operate on. In the past we have discussed several
> > > directions. I was proposing /proc/self/map_anon/ (analogous to
> > > map_files) where you could inspect anonymous memory and get a file
> > > handle for it. madvise would then operate on the fd and then there
> > > shouldn't be a real problem to revalidate that the object is still
> > > valid. But there was no general enthusiasm about that approach. There
> > > are likely some land mines on the way.
> >
> > Converting anon memory to file-backed is bad idea and going to backfire.
>
> I didn't mean to convert. I meant to expose that information via proc
> the same way we do for file backed mappings. That shouldn't really
> require to re-design the way how anonymous vma work IMO. But I haven't
> tried that so there might be many gotchas there.
>
> There are obvious things to think about though. Such fd cannot be sent
> to other processes (SCM stuff), mmap of the file would have to be
> disallowed and many others I am not aware of. I am not even pushing this
> direction because I am not convinced about how viable it is myself. But
> it would sound like a nice extension of the existing mechanism we have
> and a file based madvise sounds attractive to me as well because we
> already have that.

I am not a fan of fd based approach but I already reserved last argument
of the API as extendable field so we could use the field as "fd" when we
really need that kinds of fine-grained synchronization model if it's not
enough with SGISTOP, freezer and so.

2020-01-22 08:30:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Tue 21-01-20 10:32:12, Minchan Kim wrote:
> On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
[...]
> > The interface really has to be robust to future potential usecases.
>
> I do understand your concern but for me, it's chicken and egg problem.
> We usually do best effort to make something perfect as far as possible
> but we also don't do over-engineering without real usecase from the
> beginning.
>
> I already told you how we could synchronize among processes and potential
> way to be extended Daniel suggested(That's why current API has extra field
> for the cookie) even though we don't need it right now.

If you can synchronize with the target task then you do not need a
remote interface. Just use ptrace and you are done with it.

> If you want to suggest the other way, please explain why your idea is
> better and why we need it at this moment.

I believe I have explained my concerns and why they matter. All you are
saying is that you do not care because your particular usecase doesn't
care. And that is a first signal of a future disaster when we end up
with a broken and unfixable interface we have to maintain for ever.

I will not go as far as to nack this but you should seriously think
about other potential usecases and how they would work and what we are
going to do when a first non-cooperative userspace memory management
usecase materializes.
--
Michal Hocko
SUSE Labs

2020-01-22 09:52:41

by SeongJae Park

[permalink] [raw]
Subject: Re: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Wed, 22 Jan 2020 09:28:53 +0100 Michal Hocko <[email protected]> wrote:

> On Tue 21-01-20 10:32:12, Minchan Kim wrote:
> > On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
> [...]
> > > The interface really has to be robust to future potential usecases.
> >
> > I do understand your concern but for me, it's chicken and egg problem.
> > We usually do best effort to make something perfect as far as possible
> > but we also don't do over-engineering without real usecase from the
> > beginning.
> >
> > I already told you how we could synchronize among processes and potential
> > way to be extended Daniel suggested(That's why current API has extra field
> > for the cookie) even though we don't need it right now.
>
> If you can synchronize with the target task then you do not need a
> remote interface. Just use ptrace and you are done with it.
>
> > If you want to suggest the other way, please explain why your idea is
> > better and why we need it at this moment.
>
> I believe I have explained my concerns and why they matter. All you are
> saying is that you do not care because your particular usecase doesn't
> care. And that is a first signal of a future disaster when we end up
> with a broken and unfixable interface we have to maintain for ever.
>
> I will not go as far as to nack this but you should seriously think
> about other potential usecases and how they would work and what we are
> going to do when a first non-cooperative userspace memory management
> usecase materializes.

Beside of the specific environment of Android, I think there are many ways to
know the address space layout and access patterns of other processes. The
idle_page_tracking might be an example that widelay available.

Of course, the information might not strictly correct due to the timing issue,
but could be still worth to be used under some extreme situations, such as
memory pressure or fragmentation. For the same reason, ptrace() would not be
sufficient, as we have no perfect control, but only some level of control that
would be useful under specific situations.

I assume the users of this systemcall would understand the tradeoff and make
decisions. Also, as the users already have the right to do the tradeoff, I
think it's fair. In other words, I think the caller has both the power and the
responsibility to deal with the time-to-check-time-to-react problem.

Nonetheless, I also agree this is important concern and the patch would be
better if it adds more detailed documentation regarding this issue.

If I'm missing some points or you have different opinions, please let me know.


Thanks,
SeongJae Park

> --
> Michal Hocko
> SUSE Labs
>

2020-01-22 10:04:09

by Michal Hocko

[permalink] [raw]
Subject: Re: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Wed 22-01-20 10:36:24, SeongJae Park wrote:
> On Wed, 22 Jan 2020 09:28:53 +0100 Michal Hocko <[email protected]> wrote:
>
> > On Tue 21-01-20 10:32:12, Minchan Kim wrote:
> > > On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
> > [...]
> > > > The interface really has to be robust to future potential usecases.
> > >
> > > I do understand your concern but for me, it's chicken and egg problem.
> > > We usually do best effort to make something perfect as far as possible
> > > but we also don't do over-engineering without real usecase from the
> > > beginning.
> > >
> > > I already told you how we could synchronize among processes and potential
> > > way to be extended Daniel suggested(That's why current API has extra field
> > > for the cookie) even though we don't need it right now.
> >
> > If you can synchronize with the target task then you do not need a
> > remote interface. Just use ptrace and you are done with it.
> >
> > > If you want to suggest the other way, please explain why your idea is
> > > better and why we need it at this moment.
> >
> > I believe I have explained my concerns and why they matter. All you are
> > saying is that you do not care because your particular usecase doesn't
> > care. And that is a first signal of a future disaster when we end up
> > with a broken and unfixable interface we have to maintain for ever.
> >
> > I will not go as far as to nack this but you should seriously think
> > about other potential usecases and how they would work and what we are
> > going to do when a first non-cooperative userspace memory management
> > usecase materializes.
>
> Beside of the specific environment of Android, I think there are many ways to
> know the address space layout and access patterns of other processes. The
> idle_page_tracking might be an example that widelay available.
>
> Of course, the information might not strictly correct due to the timing issue,
> but could be still worth to be used under some extreme situations, such as
> memory pressure or fragmentation. For the same reason, ptrace() would not be
> sufficient, as we have no perfect control, but only some level of control that
> would be useful under specific situations.

I am not sure I see your point. I am talking about races where a remote
task is operating on a completely different object because the one it
checked for has been unmapped and new one mapped over it. Memory
pressure or a fragmentation will not change the object itself. Sure the
memory might be reclaimed but that should be completely OK unless I am
missing something.

> I assume the users of this systemcall would understand the tradeoff and make
> decisions.

I disagree. My experience tells me that users tend to squeeze the
maximum and beyond and hope they get what they want.

> Also, as the users already have the right to do the tradeoff, I
> think it's fair. In other words, I think the caller has both the power and the
> responsibility to deal with the time-to-check-time-to-react problem.
>
> Nonetheless, I also agree this is important concern and the patch would be
> better if it adds more detailed documentation regarding this issue.

If there is _really_ a strong consensus that the racy interface is
reasonable then it absolutely has to be described with a clearly state
that those races might result in hard to predict behavior unless all
tasks sharing the address space are blocked between the check and the
madvise call.
--
Michal Hocko
SUSE Labs

2020-01-22 10:45:54

by Oleksandr Natalenko

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

Hello.

On Tue, Jan 21, 2020 at 10:11:13AM -0800, Minchan Kim wrote:
> > > I also think people don't want to give an KSM hint to non-mergeable area.
> >
> > And how the manager knows which data is mergable?
>
> Oleksandr, could you say your thought why you need address range based
> API?

It seems I've overlooked an important piece of this submission: one
cannot apply the hint to all the anonymous mapping regardless of address
range. For KSM I'd rather either have a possibility to hint all the
anonymous mappings, or, as it was suggested previously, be able to iterate
over existing mappings using some (fd-based?) API.

--
Best regards,
Oleksandr Natalenko (post-factum)
Senior Software Maintenance Engineer

2020-01-22 13:30:02

by SeongJae Park

[permalink] [raw]
Subject: Re: Re: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Wed, 22 Jan 2020 11:02:33 +0100 Michal Hocko <[email protected]> wrote:

> On Wed 22-01-20 10:36:24, SeongJae Park wrote:
> > On Wed, 22 Jan 2020 09:28:53 +0100 Michal Hocko <[email protected]> wrote:
> >
> > > On Tue 21-01-20 10:32:12, Minchan Kim wrote:
> > > > On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
> > > [...]
> > > > > The interface really has to be robust to future potential usecases.
> > > >
> > > > I do understand your concern but for me, it's chicken and egg problem.
> > > > We usually do best effort to make something perfect as far as possible
> > > > but we also don't do over-engineering without real usecase from the
> > > > beginning.
> > > >
> > > > I already told you how we could synchronize among processes and potential
> > > > way to be extended Daniel suggested(That's why current API has extra field
> > > > for the cookie) even though we don't need it right now.
> > >
> > > If you can synchronize with the target task then you do not need a
> > > remote interface. Just use ptrace and you are done with it.
> > >
> > > > If you want to suggest the other way, please explain why your idea is
> > > > better and why we need it at this moment.
> > >
> > > I believe I have explained my concerns and why they matter. All you are
> > > saying is that you do not care because your particular usecase doesn't
> > > care. And that is a first signal of a future disaster when we end up
> > > with a broken and unfixable interface we have to maintain for ever.
> > >
> > > I will not go as far as to nack this but you should seriously think
> > > about other potential usecases and how they would work and what we are
> > > going to do when a first non-cooperative userspace memory management
> > > usecase materializes.
> >
> > Beside of the specific environment of Android, I think there are many ways to
> > know the address space layout and access patterns of other processes. The
> > idle_page_tracking might be an example that widelay available.
> >
> > Of course, the information might not strictly correct due to the timing issue,
> > but could be still worth to be used under some extreme situations, such as
> > memory pressure or fragmentation. For the same reason, ptrace() would not be
> > sufficient, as we have no perfect control, but only some level of control that
> > would be useful under specific situations.
>
> I am not sure I see your point. I am talking about races where a remote
> task is operating on a completely different object because the one it
> checked for has been unmapped and new one mapped over it. Memory
> pressure or a fragmentation will not change the object itself. Sure the
> memory might be reclaimed but that should be completely OK unless I am
> missing something.

Thank you for pointing out your concerns in more detail. I was assuming a case
using MADV_PAGEOUT or MADV_HUGEPAGE like hints under access frequency
monitoring for better performance under memory pressure or fragmentation,
respectively. Under the race, such hints might incur some performance
degradation, but no critical problem such as SEGV. I previously implemented
such optimization for research purpose and it was worthy. Nonetheless, it was
just a research purpose hack.

MADV_FREE like hints might result in SEGV and thus of course should be avoided.
But, to my perspective, the 4 hints madvise_process() is currently supporting
(COLD, PAGEOUT, MERGEABLE, UNMERGEABLE) are not too risky even under the race.
That's why I said the incorrect information could be worth to be used under
some extreme situations.

>
> > I assume the users of this systemcall would understand the tradeoff and make
> > decisions.
>
> I disagree. My experience tells me that users tend to squeeze the
> maximum and beyond and hope they get what they want.
>
> > Also, as the users already have the right to do the tradeoff, I
> > think it's fair. In other words, I think the caller has both the power and the
> > responsibility to deal with the time-to-check-time-to-react problem.
> >
> > Nonetheless, I also agree this is important concern and the patch would be
> > better if it adds more detailed documentation regarding this issue.
>
> If there is _really_ a strong consensus that the racy interface is
> reasonable then it absolutely has to be described with a clearly state
> that those races might result in hard to predict behavior unless all
> tasks sharing the address space are blocked between the check and the
> madvise call.

So, it's still too risky to simply believe users to do the things well on their
responsibility, but a strong real consensus on needs and clear description
might justify this. I also agreed.


Thanks,
SeongJae Park

> --
> Michal Hocko
> SUSE Labs
>

2020-01-23 01:42:41

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Wed, Jan 22, 2020 at 09:28:53AM +0100, Michal Hocko wrote:
> On Tue 21-01-20 10:32:12, Minchan Kim wrote:
> > On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
> [...]
> > > The interface really has to be robust to future potential usecases.
> >
> > I do understand your concern but for me, it's chicken and egg problem.
> > We usually do best effort to make something perfect as far as possible
> > but we also don't do over-engineering without real usecase from the
> > beginning.
> >
> > I already told you how we could synchronize among processes and potential
> > way to be extended Daniel suggested(That's why current API has extra field
> > for the cookie) even though we don't need it right now.
>
> If you can synchronize with the target task then you do not need a
> remote interface. Just use ptrace and you are done with it.

As I mentioned in other reply, we want to do in caller's context, not
callee's one because target processes stay in little cores, which are
much slower than the core the manager lives in.
The other reason is the apps are already freezed so they couldn't response
by ptrace.

>
> > If you want to suggest the other way, please explain why your idea is
> > better and why we need it at this moment.
>
> I believe I have explained my concerns and why they matter. All you are
> saying is that you do not care because your particular usecase doesn't
> care. And that is a first signal of a future disaster when we end up
> with a broken and unfixable interface we have to maintain for ever.

We already had suggested cookie and fd based approaches so I reserved a
argument for that to make the API extendable.
Thing is currently it's a just optimization idea since we have several
ways to sychronize processes(e.g., signal, cgroup freezer, userfaultfd
and so). It's a just matter of granularity, not necessary one we should
introduce it from the beginnig.
If someone needs that kinds of fine-grained consistency, we could extend
it then. And that's the usual way we make progress when we couldn't
know the future.

What do you want to see further?

2020-01-23 01:45:35

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Wed, Jan 22, 2020 at 11:44:24AM +0100, Oleksandr Natalenko wrote:
> Hello.
>
> On Tue, Jan 21, 2020 at 10:11:13AM -0800, Minchan Kim wrote:
> > > > I also think people don't want to give an KSM hint to non-mergeable area.
> > >
> > > And how the manager knows which data is mergable?
> >
> > Oleksandr, could you say your thought why you need address range based
> > API?
>
> It seems I've overlooked an important piece of this submission: one
> cannot apply the hint to all the anonymous mapping regardless of address
> range. For KSM I'd rather either have a possibility to hint all the
> anonymous mappings, or, as it was suggested previously, be able to iterate
> over existing mappings using some (fd-based?) API.

Thing is how you could identify a certan range is better for KSM than
others from external process?

2020-01-23 07:30:39

by Oleksandr Natalenko

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Wed, Jan 22, 2020 at 05:43:16PM -0800, Minchan Kim wrote:
> > It seems I've overlooked an important piece of this submission: one
> > cannot apply the hint to all the anonymous mapping regardless of address
> > range. For KSM I'd rather either have a possibility to hint all the
> > anonymous mappings, or, as it was suggested previously, be able to iterate
> > over existing mappings using some (fd-based?) API.
>
> Thing is how you could identify a certan range is better for KSM than
> others from external process?

I think the info like this is kinda available via /proc/pid/smaps. It
lists the ranges and the vmflags. But using it raises 2 concerns: one is
the absence of guarantee the mappings won't change after smaps is read
and the second one is that there's no separate vmflag for marking a vma
as non-meregable (and IIRC from previous attempts on addressing this,
we've already exhausted all the flags on 32-bit arches, so it is not
something that can be trivially addressed).

--
Best regards,
Oleksandr Natalenko (post-factum)
Senior Software Maintenance Engineer

2020-01-23 09:16:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: introduce external memory hinting API

On Wed 22-01-20 17:41:31, Minchan Kim wrote:
[...]
> What do you want to see further?

Either a consensus that it is sufficient to have an inherently racy
interface that requires some form of external sychronization or
a robust interface that can cope with races in a sensible way.
And no, having a flag for future extension is definitely not the
way how a new API should be added. Seriously!

--
Michal Hocko
SUSE Labs