2020-03-02 19:39:32

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v7 0/7] introduce memory hinting API for external process

Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With that,
application could give hints to kernel what memory range are preferred to be
reclaimed. However, in some platform(e.g., Android), the information
required to make the hinting decision is not known to the app.
Instead, it is known to a centralized userspace daemon(e.g., ActivityManagerService),
and that daemon must be able to initiate reclaim on its own without any app
involvement.

To solve the concern, this patch introduces new syscall - process_madvise(2).
Bascially, it's same with madvise(2) syscall but it has some differences.

1. It needs pidfd of target process to provide the hint
2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this moment.
Other hints in madvise will be opened when there are explicit requests from
community to prevent unexpected bugs we couldn't support.
3. Only privileged processes can do something for other process's address
space.

For more detail of the new API, please see "mm: introduce external memory hinting API"
description in this patchset.

* from v6 - https://lore.kernel.org/linux-api/[email protected]/
* fix comments and descriptions - Suren
* Add Reviewed-by - Suren
* fix build break reported by 0-day

* from v5 - https://lore.kernel.org/linux-mm/[email protected]/
* use null task and requestor's mm for io_madvise - Jann and Jens
* use right commit description for moving pidfd_get_pid - Christoph

* from v4 - https://lore.kernel.org/linux-mm/[email protected]/
* pass mm down to functions, not accessing task->mm - Jann
* clean up - Alexander
* add Reviewed-by - Alexander, SeongJae
* patch reordering

* from v3 - https://lore.kernel.org/linux-mm/[email protected]/
* verify task->mm aftere access_mm - Oleg
* split some patches for easy review - Alexander
* clean up fatal signal checking - Suren

* from v2 - https://lore.kernel.org/linux-mm/[email protected]/
* check signal callee and caller to bail out - Kirill Tkhai
* put more clarification for justification of new API

* from v1 - https://lore.kernel.org/linux-mm/[email protected]/
* fix syscall number - SeongJae
* use get_pid_task - Kirill Tkhai
* extend API to support pid as well as pidfd - Kirill Tkhai

Minchan Kim (5):
mm: pass task and mm to do_madvise
mm: introduce external memory hinting API
mm: check fatal signal pending of target process
pid: move pidfd_get_pid function to pid.c
mm: support both pid and pidfd for process_madvise

Oleksandr Natalenko (2):
mm/madvise: employ mmget_still_valid for write lock
mm/madvise: allow KSM hints for remote API

arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/io_uring.c | 2 +-
include/linux/mm.h | 3 +-
include/linux/pid.h | 1 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/exit.c | 17 ---
kernel/pid.c | 17 +++
kernel/sys_ni.c | 1 +
mm/madvise.c | 144 ++++++++++++++++----
26 files changed, 167 insertions(+), 44 deletions(-)

--
2.25.0.265.gbab2e86ba0-goog


2020-03-02 19:39:57

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v7 7/7] mm/madvise: allow KSM hints for remote API

From: Oleksandr Natalenko <[email protected]>

It all began with the fact that KSM works only on memory that is marked
by madvise(). And the only way to get around that is to either:

* use LD_PRELOAD; or
* patch the kernel with something like UKSM or PKSM.

(i skip ptrace can of worms here intentionally)

To overcome this restriction, lets employ a new remote madvise API. This
can be used by some small userspace helper daemon that will do auto-KSM
job for us.

I think of two major consumers of remote KSM hints:

* hosts, that run containers, especially similar ones and especially in
a trusted environment, sharing the same runtime like Node.js;

* heavy applications, that can be run in multiple instances, not
limited to opensource ones like Firefox, but also those that cannot be
modified since they are binary-only and, maybe, statically linked.

Speaking of statistics, more numbers can be found in the very first
submission, that is related to this one [1]. For my current setup with
two Firefox instances I get 100 to 200 MiB saved for the second instance
depending on the amount of tabs.

1 FF instance with 15 tabs:

$ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
410

2 FF instances, second one has 12 tabs (all the tabs are different):

$ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
592

At the very moment I do not have specific numbers for containerised
workload, but those should be comparable in case the containers share
similar/same runtime.

[1] https://lore.kernel.org/patchwork/patch/1012142/

Reviewed-by: SeongJae Park <[email protected]>
Signed-off-by: Oleksandr Natalenko <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/madvise.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index e77c6c1fad34..f4fa962ee74d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1005,6 +1005,10 @@ process_madvise_behavior_valid(int behavior)
switch (behavior) {
case MADV_COLD:
case MADV_PAGEOUT:
+#ifdef CONFIG_KSM
+ case MADV_MERGEABLE:
+ case MADV_UNMERGEABLE:
+#endif
return true;
default:
return false;
--
2.25.0.265.gbab2e86ba0-goog

2020-03-02 19:40:23

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v7 5/7] mm: support both pid and pidfd for process_madvise

There is a demand[1] to support pid as well pidfd for process_madvise
to reduce unnecessary syscall to get pidfd if the user has control of
the target process(ie, they could guarantee the process is not gone
or pid is not reused).

This patch aims for supporting both options like waitid(2). So, the
syscall is currently,

int process_madvise(int which, pid_t pid, void *addr,
size_t length, int advise, unsigned long flag);

@which is actually idtype_t for userspace libray and currently,
it supports P_PID and P_PIDFD.

[1] https://lore.kernel.org/linux-mm/[email protected]/

Cc: Christian Brauner <[email protected]>
Reviewed-by: Suren Baghdasaryan <[email protected]>
Suggested-by: Kirill Tkhai <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/syscalls.h | 3 ++-
mm/madvise.c | 34 ++++++++++++++++++++++------------
2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e4cd2c2f8bb4..f5ada20e2943 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -876,7 +876,8 @@ asmlinkage long sys_munlockall(void);
asmlinkage long sys_mincore(unsigned long start, size_t len,
unsigned char __user * vec);
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
-asmlinkage long sys_process_madvise(int pidfd, unsigned long start,
+
+asmlinkage long sys_process_madvise(int which, pid_t pid, unsigned long start,
size_t len, int behavior, unsigned long flags);
asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
diff --git a/mm/madvise.c b/mm/madvise.c
index 6543f2bfc3d8..e794367f681e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1182,11 +1182,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
return do_madvise(current, current->mm, start, len_in, behavior);
}

-SYSCALL_DEFINE5(process_madvise, int, pidfd, unsigned long, start,
+SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid, unsigned long, start,
size_t, len_in, int, behavior, unsigned long, flags)
{
int ret;
- struct fd f;
struct pid *pid;
struct task_struct *task;
struct mm_struct *mm;
@@ -1197,20 +1196,31 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, unsigned long, start,
if (!process_madvise_behavior_valid(behavior))
return -EINVAL;

- f = fdget(pidfd);
- if (!f.file)
- return -EBADF;
+ switch (which) {
+ case P_PID:
+ if (upid <= 0)
+ return -EINVAL;
+
+ pid = find_get_pid(upid);
+ if (!pid)
+ return -ESRCH;
+ break;
+ case P_PIDFD:
+ if (upid < 0)
+ return -EINVAL;

- pid = pidfd_pid(f.file);
- if (IS_ERR(pid)) {
- ret = PTR_ERR(pid);
- goto fdput;
+ pid = pidfd_get_pid(upid);
+ if (IS_ERR(pid))
+ return PTR_ERR(pid);
+ break;
+ default:
+ return -EINVAL;
}

task = get_pid_task(pid, PIDTYPE_PID);
if (!task) {
ret = -ESRCH;
- goto fdput;
+ goto put_pid;
}

mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
@@ -1223,7 +1233,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, unsigned long, start,
mmput(mm);
release_task:
put_task_struct(task);
-fdput:
- fdput(f);
+put_pid:
+ put_pid(pid);
return ret;
}
--
2.25.0.265.gbab2e86ba0-goog

2020-03-02 21:16:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v7 0/7] introduce memory hinting API for external process

On Mon, 2 Mar 2020 11:36:23 -0800 Minchan Kim <[email protected]> wrote:

> Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With that,
> application could give hints to kernel what memory range are preferred to be
> reclaimed. However, in some platform(e.g., Android), the information
> required to make the hinting decision is not known to the app.
> Instead, it is known to a centralized userspace daemon(e.g., ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without any app
> involvement.
>
> To solve the concern, this patch introduces new syscall - process_madvise(2).
> Bascially, it's same with madvise(2) syscall but it has some differences.
>
> 1. It needs pidfd of target process to provide the hint
> 2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this moment.
> Other hints in madvise will be opened when there are explicit requests from
> community to prevent unexpected bugs we couldn't support.
> 3. Only privileged processes can do something for other process's address
> space.
>
> For more detail of the new API, please see "mm: introduce external memory hinting API"
> description in this patchset.

Thanks, I grabbed these.

I massaged the patch titles significantly - mainly to alert readers to
the fact that we're proposing a new syscall.

Is a manpage for process_madvise(2) being prepared?

2020-03-02 21:44:00

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v7 0/7] introduce memory hinting API for external process

On Mon, Mar 02, 2020 at 01:16:18PM -0800, Andrew Morton wrote:
> On Mon, 2 Mar 2020 11:36:23 -0800 Minchan Kim <[email protected]> wrote:
>
> > Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With that,
> > application could give hints to kernel what memory range are preferred to be
> > reclaimed. However, in some platform(e.g., Android), the information
> > required to make the hinting decision is not known to the app.
> > Instead, it is known to a centralized userspace daemon(e.g., ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without any app
> > involvement.
> >
> > To solve the concern, this patch introduces new syscall - process_madvise(2).
> > Bascially, it's same with madvise(2) syscall but it has some differences.
> >
> > 1. It needs pidfd of target process to provide the hint
> > 2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this moment.
> > Other hints in madvise will be opened when there are explicit requests from
> > community to prevent unexpected bugs we couldn't support.
> > 3. Only privileged processes can do something for other process's address
> > space.
> >
> > For more detail of the new API, please see "mm: introduce external memory hinting API"
> > description in this patchset.
>
> Thanks, I grabbed these.
>
> I massaged the patch titles significantly - mainly to alert readers to
> the fact that we're proposing a new syscall.
>
> Is a manpage for process_madvise(2) being prepared?

I will prepare it, Thanks!

2020-03-06 11:16:22

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v7 5/7] mm: support both pid and pidfd for process_madvise

On 3/2/20 8:36 PM, Minchan Kim wrote:
> There is a demand[1] to support pid as well pidfd for process_madvise
> to reduce unnecessary syscall to get pidfd if the user has control of
> the target process(ie, they could guarantee the process is not gone
> or pid is not reused).
>
> This patch aims for supporting both options like waitid(2). So, the
> syscall is currently,
>
> int process_madvise(int which, pid_t pid, void *addr,
> size_t length, int advise, unsigned long flag);

This is again halfway between kernel and userspace description, so if we stick
to userspace then it's:

int process_madvise(idtype_t idtype, id_t id, void *addr,
size_t length, int advice, unsigned long flags);


> @which is actually idtype_t for userspace libray and currently,
> it supports P_PID and P_PIDFD.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> Cc: Christian Brauner <[email protected]>
> Reviewed-by: Suren Baghdasaryan <[email protected]>
> Suggested-by: Kirill Tkhai <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>

Reviewed-by: Vlastimil Babka <[email protected]>

2020-03-06 13:14:42

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v7 7/7] mm/madvise: allow KSM hints for remote API

On 3/2/20 8:36 PM, Minchan Kim wrote:
> From: Oleksandr Natalenko <[email protected]>
>
> It all began with the fact that KSM works only on memory that is marked
> by madvise(). And the only way to get around that is to either:
>
> * use LD_PRELOAD; or
> * patch the kernel with something like UKSM or PKSM.
>
> (i skip ptrace can of worms here intentionally)
>
> To overcome this restriction, lets employ a new remote madvise API. This
> can be used by some small userspace helper daemon that will do auto-KSM
> job for us.
>
> I think of two major consumers of remote KSM hints:
>
> * hosts, that run containers, especially similar ones and especially in
> a trusted environment, sharing the same runtime like Node.js;
>
> * heavy applications, that can be run in multiple instances, not
> limited to opensource ones like Firefox, but also those that cannot be
> modified since they are binary-only and, maybe, statically linked.
>
> Speaking of statistics, more numbers can be found in the very first
> submission, that is related to this one [1]. For my current setup with
> two Firefox instances I get 100 to 200 MiB saved for the second instance
> depending on the amount of tabs.
>
> 1 FF instance with 15 tabs:
>
> $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> 410
>
> 2 FF instances, second one has 12 tabs (all the tabs are different):
>
> $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> 592
>
> At the very moment I do not have specific numbers for containerised
> workload, but those should be comparable in case the containers share
> similar/same runtime.
>
> [1] https://lore.kernel.org/patchwork/patch/1012142/
>
> Reviewed-by: SeongJae Park <[email protected]>
> Signed-off-by: Oleksandr Natalenko <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>

This will lead to one process calling unmerge_ksm_pages() of another. There's a
(signal_pending(current)) test there, should it check also the other task,
analogically to task 3?
Then break_ksm() is fine as it is, as ksmd also calls it, right?

> ---
> mm/madvise.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index e77c6c1fad34..f4fa962ee74d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1005,6 +1005,10 @@ process_madvise_behavior_valid(int behavior)
> switch (behavior) {
> case MADV_COLD:
> case MADV_PAGEOUT:
> +#ifdef CONFIG_KSM
> + case MADV_MERGEABLE:
> + case MADV_UNMERGEABLE:
> +#endif
> return true;
> default:
> return false;
>

2020-03-06 13:43:19

by Oleksandr Natalenko

[permalink] [raw]
Subject: Re: [PATCH v7 7/7] mm/madvise: allow KSM hints for remote API

On Fri, Mar 06, 2020 at 02:13:49PM +0100, Vlastimil Babka wrote:
> On 3/2/20 8:36 PM, Minchan Kim wrote:
> > From: Oleksandr Natalenko <[email protected]>
> >
> > It all began with the fact that KSM works only on memory that is marked
> > by madvise(). And the only way to get around that is to either:
> >
> > * use LD_PRELOAD; or
> > * patch the kernel with something like UKSM or PKSM.
> >
> > (i skip ptrace can of worms here intentionally)
> >
> > To overcome this restriction, lets employ a new remote madvise API. This
> > can be used by some small userspace helper daemon that will do auto-KSM
> > job for us.
> >
> > I think of two major consumers of remote KSM hints:
> >
> > * hosts, that run containers, especially similar ones and especially in
> > a trusted environment, sharing the same runtime like Node.js;
> >
> > * heavy applications, that can be run in multiple instances, not
> > limited to opensource ones like Firefox, but also those that cannot be
> > modified since they are binary-only and, maybe, statically linked.
> >
> > Speaking of statistics, more numbers can be found in the very first
> > submission, that is related to this one [1]. For my current setup with
> > two Firefox instances I get 100 to 200 MiB saved for the second instance
> > depending on the amount of tabs.
> >
> > 1 FF instance with 15 tabs:
> >
> > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> > 410
> >
> > 2 FF instances, second one has 12 tabs (all the tabs are different):
> >
> > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> > 592
> >
> > At the very moment I do not have specific numbers for containerised
> > workload, but those should be comparable in case the containers share
> > similar/same runtime.
> >
> > [1] https://lore.kernel.org/patchwork/patch/1012142/
> >
> > Reviewed-by: SeongJae Park <[email protected]>
> > Signed-off-by: Oleksandr Natalenko <[email protected]>
> > Signed-off-by: Minchan Kim <[email protected]>
>
> This will lead to one process calling unmerge_ksm_pages() of another. There's a
> (signal_pending(current)) test there, should it check also the other task,
> analogically to task 3?

Do we care about current there then? Shall we just pass mm into unmerge_ksm_pages and check the signals of the target task only, be it current or something else?

> Then break_ksm() is fine as it is, as ksmd also calls it, right?

I think break_ksm() cares only about mmap_sem protection, so we should
be fine here.

>
> > ---
> > mm/madvise.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index e77c6c1fad34..f4fa962ee74d 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1005,6 +1005,10 @@ process_madvise_behavior_valid(int behavior)
> > switch (behavior) {
> > case MADV_COLD:
> > case MADV_PAGEOUT:
> > +#ifdef CONFIG_KSM
> > + case MADV_MERGEABLE:
> > + case MADV_UNMERGEABLE:
> > +#endif
> > return true;
> > default:
> > return false;
> >
>

--
Best regards,
Oleksandr Natalenko (post-factum)
Principal Software Maintenance Engineer

2020-03-06 16:10:15

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v7 7/7] mm/madvise: allow KSM hints for remote API

On 3/6/20 2:41 PM, Oleksandr Natalenko wrote:
> On Fri, Mar 06, 2020 at 02:13:49PM +0100, Vlastimil Babka wrote:
>> On 3/2/20 8:36 PM, Minchan Kim wrote:
>> > From: Oleksandr Natalenko <[email protected]>
>> >
>> > It all began with the fact that KSM works only on memory that is marked
>> > by madvise(). And the only way to get around that is to either:
>> >
>> > * use LD_PRELOAD; or
>> > * patch the kernel with something like UKSM or PKSM.
>> >
>> > (i skip ptrace can of worms here intentionally)
>> >
>> > To overcome this restriction, lets employ a new remote madvise API. This
>> > can be used by some small userspace helper daemon that will do auto-KSM
>> > job for us.
>> >
>> > I think of two major consumers of remote KSM hints:
>> >
>> > * hosts, that run containers, especially similar ones and especially in
>> > a trusted environment, sharing the same runtime like Node.js;

Ah, I forgot to ask, given the discussion of races in patch 2 (Question 2),
where android can stop the tasks to apply the madvise hints in a race-free
manner, how does that work for remote KSM hints in your scenarios, especially
the one above?

>> >
>> > * heavy applications, that can be run in multiple instances, not
>> > limited to opensource ones like Firefox, but also those that cannot be
>> > modified since they are binary-only and, maybe, statically linked.
>> >
>> > Speaking of statistics, more numbers can be found in the very first
>> > submission, that is related to this one [1]. For my current setup with
>> > two Firefox instances I get 100 to 200 MiB saved for the second instance
>> > depending on the amount of tabs.
>> >
>> > 1 FF instance with 15 tabs:
>> >
>> > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
>> > 410
>> >
>> > 2 FF instances, second one has 12 tabs (all the tabs are different):
>> >
>> > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
>> > 592
>> >
>> > At the very moment I do not have specific numbers for containerised
>> > workload, but those should be comparable in case the containers share
>> > similar/same runtime.
>> >
>> > [1] https://lore.kernel.org/patchwork/patch/1012142/
>> >
>> > Reviewed-by: SeongJae Park <[email protected]>
>> > Signed-off-by: Oleksandr Natalenko <[email protected]>
>> > Signed-off-by: Minchan Kim <[email protected]>
>>
>> This will lead to one process calling unmerge_ksm_pages() of another. There's a
>> (signal_pending(current)) test there, should it check also the other task,
>> analogically to task 3?
>
> Do we care about current there then? Shall we just pass mm into unmerge_ksm_pages and check the signals of the target task only, be it current or something else?

Dunno, it's nice to react to signals quickly, for any proces that gets them, no?

>> Then break_ksm() is fine as it is, as ksmd also calls it, right?
>
> I think break_ksm() cares only about mmap_sem protection, so we should
> be fine here.
>
>>
>> > ---
>> > mm/madvise.c | 4 ++++
>> > 1 file changed, 4 insertions(+)
>> >
>> > diff --git a/mm/madvise.c b/mm/madvise.c
>> > index e77c6c1fad34..f4fa962ee74d 100644
>> > --- a/mm/madvise.c
>> > +++ b/mm/madvise.c
>> > @@ -1005,6 +1005,10 @@ process_madvise_behavior_valid(int behavior)
>> > switch (behavior) {
>> > case MADV_COLD:
>> > case MADV_PAGEOUT:
>> > +#ifdef CONFIG_KSM
>> > + case MADV_MERGEABLE:
>> > + case MADV_UNMERGEABLE:
>> > +#endif
>> > return true;
>> > default:
>> > return false;
>> >
>>
>

2020-03-09 13:12:49

by Oleksandr Natalenko

[permalink] [raw]
Subject: Re: [PATCH v7 7/7] mm/madvise: allow KSM hints for remote API

On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote:
> On 3/6/20 2:41 PM, Oleksandr Natalenko wrote:
> > On Fri, Mar 06, 2020 at 02:13:49PM +0100, Vlastimil Babka wrote:
> >> On 3/2/20 8:36 PM, Minchan Kim wrote:
> >> > From: Oleksandr Natalenko <[email protected]>
> >> >
> >> > It all began with the fact that KSM works only on memory that is marked
> >> > by madvise(). And the only way to get around that is to either:
> >> >
> >> > * use LD_PRELOAD; or
> >> > * patch the kernel with something like UKSM or PKSM.
> >> >
> >> > (i skip ptrace can of worms here intentionally)
> >> >
> >> > To overcome this restriction, lets employ a new remote madvise API. This
> >> > can be used by some small userspace helper daemon that will do auto-KSM
> >> > job for us.
> >> >
> >> > I think of two major consumers of remote KSM hints:
> >> >
> >> > * hosts, that run containers, especially similar ones and especially in
> >> > a trusted environment, sharing the same runtime like Node.js;
>
> Ah, I forgot to ask, given the discussion of races in patch 2 (Question 2),
> where android can stop the tasks to apply the madvise hints in a race-free
> manner, how does that work for remote KSM hints in your scenarios, especially
> the one above?

We have cgroup.freeze for that.

>
> >> >
> >> > * heavy applications, that can be run in multiple instances, not
> >> > limited to opensource ones like Firefox, but also those that cannot be
> >> > modified since they are binary-only and, maybe, statically linked.
> >> >
> >> > Speaking of statistics, more numbers can be found in the very first
> >> > submission, that is related to this one [1]. For my current setup with
> >> > two Firefox instances I get 100 to 200 MiB saved for the second instance
> >> > depending on the amount of tabs.
> >> >
> >> > 1 FF instance with 15 tabs:
> >> >
> >> > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> >> > 410
> >> >
> >> > 2 FF instances, second one has 12 tabs (all the tabs are different):
> >> >
> >> > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> >> > 592
> >> >
> >> > At the very moment I do not have specific numbers for containerised
> >> > workload, but those should be comparable in case the containers share
> >> > similar/same runtime.
> >> >
> >> > [1] https://lore.kernel.org/patchwork/patch/1012142/
> >> >
> >> > Reviewed-by: SeongJae Park <[email protected]>
> >> > Signed-off-by: Oleksandr Natalenko <[email protected]>
> >> > Signed-off-by: Minchan Kim <[email protected]>
> >>
> >> This will lead to one process calling unmerge_ksm_pages() of another. There's a
> >> (signal_pending(current)) test there, should it check also the other task,
> >> analogically to task 3?
> >
> > Do we care about current there then? Shall we just pass mm into unmerge_ksm_pages and check the signals of the target task only, be it current or something else?
>
> Dunno, it's nice to react to signals quickly, for any proces that gets them, no?

So, do you mean something like this?

===
diff --git a/mm/ksm.c b/mm/ksm.c
index 363ec8189561..b39c237cfcf4 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma,
for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
if (ksm_test_exit(vma->vm_mm))
break;
- if (signal_pending(current))
+ if (signal_pending(current) ||
+ signal_pending(rcu_dereference(vma->vm_mm->owner)))
err = -ERESTARTSYS;
else
err = break_ksm(vma, addr);
===

BTW, this won't work with !CONFIG_MEMCG, so probably task_struct should be
passed through instead. IIUC, this would also require amending struct
mm_slot in order to share the same code path with ksmd.

I'm not sure I've seen such a culprit anywhere else, so I'm in doubt
this would be a correct thing to do.

Ideas?

>
> >> Then break_ksm() is fine as it is, as ksmd also calls it, right?
> >
> > I think break_ksm() cares only about mmap_sem protection, so we should
> > be fine here.
> >
> >>
> >> > ---
> >> > mm/madvise.c | 4 ++++
> >> > 1 file changed, 4 insertions(+)
> >> >
> >> > diff --git a/mm/madvise.c b/mm/madvise.c
> >> > index e77c6c1fad34..f4fa962ee74d 100644
> >> > --- a/mm/madvise.c
> >> > +++ b/mm/madvise.c
> >> > @@ -1005,6 +1005,10 @@ process_madvise_behavior_valid(int behavior)
> >> > switch (behavior) {
> >> > case MADV_COLD:
> >> > case MADV_PAGEOUT:
> >> > +#ifdef CONFIG_KSM
> >> > + case MADV_MERGEABLE:
> >> > + case MADV_UNMERGEABLE:
> >> > +#endif
> >> > return true;
> >> > default:
> >> > return false;
> >> >
> >>
> >
>

--
Best regards,
Oleksandr Natalenko (post-factum)
Principal Software Maintenance Engineer

2020-03-09 15:09:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 7/7] mm/madvise: allow KSM hints for remote API

On Mon 09-03-20 14:11:17, Oleksandr Natalenko wrote:
> On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote:
[...]
> > Dunno, it's nice to react to signals quickly, for any proces that gets them, no?
>
> So, do you mean something like this?
>
> ===
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 363ec8189561..b39c237cfcf4 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma,
> for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
> if (ksm_test_exit(vma->vm_mm))
> break;
> - if (signal_pending(current))
> + if (signal_pending(current) ||
> + signal_pending(rcu_dereference(vma->vm_mm->owner)))
> err = -ERESTARTSYS;
> else
> err = break_ksm(vma, addr);
> ===

This is broken because mm might be attached to different tasks.
AFAIU this check is meant to allow quick backoff of the _calling_
process so that it doesn't waste time when the context is killed
already. I do not understand why should we care about any other context
here? What is the actual problem this would solve?

--
Michal Hocko
SUSE Labs

2020-03-09 15:21:08

by Oleksandr Natalenko

[permalink] [raw]
Subject: Re: [PATCH v7 7/7] mm/madvise: allow KSM hints for remote API

On Mon, Mar 09, 2020 at 04:08:15PM +0100, Michal Hocko wrote:
> On Mon 09-03-20 14:11:17, Oleksandr Natalenko wrote:
> > On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote:
> [...]
> > > Dunno, it's nice to react to signals quickly, for any proces that gets them, no?
> >
> > So, do you mean something like this?
> >
> > ===
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 363ec8189561..b39c237cfcf4 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma,
> > for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
> > if (ksm_test_exit(vma->vm_mm))
> > break;
> > - if (signal_pending(current))
> > + if (signal_pending(current) ||
> > + signal_pending(rcu_dereference(vma->vm_mm->owner)))
> > err = -ERESTARTSYS;
> > else
> > err = break_ksm(vma, addr);
> > ===
>
> This is broken because mm might be attached to different tasks.
> AFAIU this check is meant to allow quick backoff of the _calling_
> process so that it doesn't waste time when the context is killed
> already. I do not understand why should we care about any other context
> here? What is the actual problem this would solve?

I agree with you, but still trying to understand what does Vlastimil mean
:).

>
> --
> Michal Hocko
> SUSE Labs
>

--
Best regards,
Oleksandr Natalenko (post-factum)
Principal Software Maintenance Engineer

2020-03-09 15:43:36

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v7 7/7] mm/madvise: allow KSM hints for remote API

On 3/9/20 4:19 PM, Oleksandr Natalenko wrote:
> On Mon, Mar 09, 2020 at 04:08:15PM +0100, Michal Hocko wrote:
>> On Mon 09-03-20 14:11:17, Oleksandr Natalenko wrote:
>> > On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote:
>> [...]
>> > > Dunno, it's nice to react to signals quickly, for any proces that gets them, no?
>> >
>> > So, do you mean something like this?
>> >
>> > ===
>> > diff --git a/mm/ksm.c b/mm/ksm.c
>> > index 363ec8189561..b39c237cfcf4 100644
>> > --- a/mm/ksm.c
>> > +++ b/mm/ksm.c
>> > @@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma,
>> > for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
>> > if (ksm_test_exit(vma->vm_mm))
>> > break;
>> > - if (signal_pending(current))
>> > + if (signal_pending(current) ||
>> > + signal_pending(rcu_dereference(vma->vm_mm->owner)))
>> > err = -ERESTARTSYS;
>> > else
>> > err = break_ksm(vma, addr);
>> > ===
>>
>> This is broken because mm might be attached to different tasks.
>> AFAIU this check is meant to allow quick backoff of the _calling_
>> process so that it doesn't waste time when the context is killed
>> already. I do not understand why should we care about any other context
>> here? What is the actual problem this would solve?
>
> I agree with you, but still trying to understand what does Vlastimil mean
> :).

Well you wondered if we should stop caring about current, and I said that
probably wouldn't be nice.
As for caring about the other task, patch 3/7 does that for
(MADV_COLD|MADV_PAGEOUT) so I just pointed out that the KSM case doesn't. AFAIU
if we don't check the signals, we might be blocking the killed task from exiting?

>>
>> --
>> Michal Hocko
>> SUSE Labs
>>
>

2020-03-09 16:06:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 7/7] mm/madvise: allow KSM hints for remote API

On Mon 09-03-20 16:42:43, Vlastimil Babka wrote:
> On 3/9/20 4:19 PM, Oleksandr Natalenko wrote:
> > On Mon, Mar 09, 2020 at 04:08:15PM +0100, Michal Hocko wrote:
> >> On Mon 09-03-20 14:11:17, Oleksandr Natalenko wrote:
> >> > On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote:
> >> [...]
> >> > > Dunno, it's nice to react to signals quickly, for any proces that gets them, no?
> >> >
> >> > So, do you mean something like this?
> >> >
> >> > ===
> >> > diff --git a/mm/ksm.c b/mm/ksm.c
> >> > index 363ec8189561..b39c237cfcf4 100644
> >> > --- a/mm/ksm.c
> >> > +++ b/mm/ksm.c
> >> > @@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma,
> >> > for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
> >> > if (ksm_test_exit(vma->vm_mm))
> >> > break;
> >> > - if (signal_pending(current))
> >> > + if (signal_pending(current) ||
> >> > + signal_pending(rcu_dereference(vma->vm_mm->owner)))
> >> > err = -ERESTARTSYS;
> >> > else
> >> > err = break_ksm(vma, addr);
> >> > ===
> >>
> >> This is broken because mm might be attached to different tasks.
> >> AFAIU this check is meant to allow quick backoff of the _calling_
> >> process so that it doesn't waste time when the context is killed
> >> already. I do not understand why should we care about any other context
> >> here? What is the actual problem this would solve?
> >
> > I agree with you, but still trying to understand what does Vlastimil mean
> > :).
>
> Well you wondered if we should stop caring about current, and I said that
> probably wouldn't be nice.
> As for caring about the other task, patch 3/7 does that for
> (MADV_COLD|MADV_PAGEOUT) so I just pointed out that the KSM case doesn't. AFAIU
> if we don't check the signals, we might be blocking the killed task from exiting?

I would have to double check but I do not think this would be a problem
because the remote task should take mmget to prevent address space to
vanish under its feet. That should also rule out the exclusive mmap_sem
usage from the exit path.
--
Michal Hocko
SUSE Labs

2020-03-11 00:43:58

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v7 5/7] mm: support both pid and pidfd for process_madvise

On Fri, Mar 06, 2020 at 12:14:19PM +0100, Vlastimil Babka wrote:
> On 3/2/20 8:36 PM, Minchan Kim wrote:
> > There is a demand[1] to support pid as well pidfd for process_madvise
> > to reduce unnecessary syscall to get pidfd if the user has control of
> > the target process(ie, they could guarantee the process is not gone
> > or pid is not reused).
> >
> > This patch aims for supporting both options like waitid(2). So, the
> > syscall is currently,
> >
> > int process_madvise(int which, pid_t pid, void *addr,
> > size_t length, int advise, unsigned long flag);
>
> This is again halfway between kernel and userspace description, so if we stick
> to userspace then it's:
>
> int process_madvise(idtype_t idtype, id_t id, void *addr,
> size_t length, int advice, unsigned long flags);

Yub.

>
>
> > @which is actually idtype_t for userspace libray and currently,
> > it supports P_PID and P_PIDFD.
> >
> > [1] https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Cc: Christian Brauner <[email protected]>
> > Reviewed-by: Suren Baghdasaryan <[email protected]>
> > Suggested-by: Kirill Tkhai <[email protected]>
> > Signed-off-by: Minchan Kim <[email protected]>
>
> Reviewed-by: Vlastimil Babka <[email protected]>

Thanks!

2020-05-08 18:38:45

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v7 5/7] mm: support both pid and pidfd for process_madvise

On Tue, Mar 10, 2020 at 05:42:51PM -0700, Minchan Kim wrote:
> On Fri, Mar 06, 2020 at 12:14:19PM +0100, Vlastimil Babka wrote:
> > On 3/2/20 8:36 PM, Minchan Kim wrote:
> > > There is a demand[1] to support pid as well pidfd for process_madvise
> > > to reduce unnecessary syscall to get pidfd if the user has control of
> > > the target process(ie, they could guarantee the process is not gone
> > > or pid is not reused).
> > >
> > > This patch aims for supporting both options like waitid(2). So, the
> > > syscall is currently,
> > >
> > > int process_madvise(int which, pid_t pid, void *addr,
> > > size_t length, int advise, unsigned long flag);
> >
> > This is again halfway between kernel and userspace description, so if we stick
> > to userspace then it's:
> >
> > int process_madvise(idtype_t idtype, id_t id, void *addr,
> > size_t length, int advice, unsigned long flags);
>
> Yub.
>

Hi Andrew,

Per Vlastimil's request, I changed "which and advise" with "idtype and
advice" in function prototype of description.
Could you replace the part in the description? Code is never changed.

Thanks.

From f11cfd023746ae67b89f2d84d960706ba6c5c911 Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Wed, 6 May 2020 13:54:40 +0000
Subject: [PATCH] mm/madvise: support both pid and pidfd for process_madvise

There is a demand[1] to support pid as well pidfd for process_madvise to
reduce unnecessary syscall to get pidfd if the user has control of the
target process(ie, they could guarantee the process is not gone or pid is
not reused).

This patch aims for supporting both options like waitid(2). So, the
syscall is currently,

int process_madvise(idtype_t idtype, id_t id, void *addr,
size_t length, int advice, unsigned long flags);

@which is actually idtype_t for userspace libray and currently, it
supports P_PID and P_PIDFD.

[1] https://lore.kernel.org/linux-mm/[email protected]/

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Suggested-by: Kirill Tkhai <[email protected]>
Reviewed-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Daniel Colascione <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Dias <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oleksandr Natalenko <[email protected]>
Cc: Sandeep Patil <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Sonny Rao <[email protected]>
Cc: Tim Murray <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

2020-05-08 23:08:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v7 5/7] mm: support both pid and pidfd for process_madvise

On Fri, 8 May 2020 11:36:53 -0700 Minchan Kim <[email protected]> wrote:

>
> ...
>
> Per Vlastimil's request, I changed "which and advise" with "idtype and
> advice" in function prototype of description.
> Could you replace the part in the description? Code is never changed.
>

Done, but...

>
> ...
>
> There is a demand[1] to support pid as well pidfd for process_madvise to
> reduce unnecessary syscall to get pidfd if the user has control of the
> target process(ie, they could guarantee the process is not gone or pid is
> not reused).
>
> This patch aims for supporting both options like waitid(2). So, the
> syscall is currently,
>
> int process_madvise(idtype_t idtype, id_t id, void *addr,
> size_t length, int advice, unsigned long flags);
>
> @which is actually idtype_t for userspace libray and currently, it
> supports P_PID and P_PIDFD.

What does "@which is actually idtype_t for userspace libray" mean? Can
you clarify and expand?

Also, does this userspace library exist? If so, where is it?

>
> ...
>

2020-05-09 12:51:29

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v7 5/7] mm: support both pid and pidfd for process_madvise

On Fri, May 08, 2020 at 04:04:15PM -0700, Andrew Morton wrote:
> On Fri, 8 May 2020 11:36:53 -0700 Minchan Kim <[email protected]> wrote:
>
> >
> > ...
> >
> > Per Vlastimil's request, I changed "which and advise" with "idtype and
> > advice" in function prototype of description.
> > Could you replace the part in the description? Code is never changed.
> >
>
> Done, but...
>
> >
> > ...
> >
> > There is a demand[1] to support pid as well pidfd for process_madvise to
> > reduce unnecessary syscall to get pidfd if the user has control of the
> > target process(ie, they could guarantee the process is not gone or pid is
> > not reused).
> >
> > This patch aims for supporting both options like waitid(2). So, the
> > syscall is currently,
> >
> > int process_madvise(idtype_t idtype, id_t id, void *addr,
> > size_t length, int advice, unsigned long flags);
> >
> > @which is actually idtype_t for userspace libray and currently, it
> > supports P_PID and P_PIDFD.
>
> What does "@which is actually idtype_t for userspace libray" mean? Can
> you clarify and expand?

If I may clarify, the only case where we've supported both pidfd and pid
in the same system call is waitid() to avoid adding a dedicated system
call for waiting and because waitid() already had this (imho insane)
argument type switching. The idtype_t thing comes from waitid() and is
located int sys/wait.h and is defined as

"The type idtype_t is defined as an enumeration type whose possible
values include at least the following:

P_ALL
P_PID
P_PGID
"

int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options);
If idtype is P_PID, waitid() shall wait for the child with a process ID equal to (pid_t)id.
If idtype is P_PGID, waitid() shall wait for any child with a process group ID equal to (pid_t)id.
If idtype is P_ALL, waitid() shall wait for any children and id is ignored.

I'm personally not a fan of this idtype_t thing and think this should
just have been
> > int pidfd_madvise(int pidfd, void *addr,
> > size_t length, int advice, unsigned long flags);
and call it a day.

Also, if I may ask, why is the flag argument "unsigned long"?
That's pretty unorthodox. The expectation is that flag arguments are
not word-size dependent and should usually use "unsigned int". All new
system calls follow this pattern too.

The current syscall layout will mean that on 64 bit systems you have 64
flag bits and on 32 bit you have 32 flag bits, I think. That has just
recently led to some problems with the clone() syscall (fixed in [1]
which I'm sending Monday) which has the same weird word-size-dependent
flag argument layout. If a system does sign-extension and a userspace
api or glibc uses e.g. an int for the flag argument in the system call
wrapper - which is fairly common - you can get sign extended and then
you end up with garbage in the upper 32 bits of your system call.

>
> Also, does this userspace library exist? If so, where is it?

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/commit/?h=fixes&id=3f2c788a13143620c5471ac96ac4f033fc9ac3f3

Christian

2020-05-09 23:16:33

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v7 5/7] mm: support both pid and pidfd for process_madvise

Hi Christian,

On Sat, May 09, 2020 at 02:48:17PM +0200, Christian Brauner wrote:
> On Fri, May 08, 2020 at 04:04:15PM -0700, Andrew Morton wrote:
> > On Fri, 8 May 2020 11:36:53 -0700 Minchan Kim <[email protected]> wrote:
> >
> > >
> > > ...
> > >
> > > Per Vlastimil's request, I changed "which and advise" with "idtype and
> > > advice" in function prototype of description.
> > > Could you replace the part in the description? Code is never changed.
> > >
> >
> > Done, but...
> >
> > >
> > > ...
> > >
> > > There is a demand[1] to support pid as well pidfd for process_madvise to
> > > reduce unnecessary syscall to get pidfd if the user has control of the
> > > target process(ie, they could guarantee the process is not gone or pid is
> > > not reused).
> > >
> > > This patch aims for supporting both options like waitid(2). So, the
> > > syscall is currently,
> > >
> > > int process_madvise(idtype_t idtype, id_t id, void *addr,
> > > size_t length, int advice, unsigned long flags);
> > >
> > > @which is actually idtype_t for userspace libray and currently, it
> > > supports P_PID and P_PIDFD.
> >
> > What does "@which is actually idtype_t for userspace libray" mean? Can
> > you clarify and expand?
>
> If I may clarify, the only case where we've supported both pidfd and pid
> in the same system call is waitid() to avoid adding a dedicated system
> call for waiting and because waitid() already had this (imho insane)
> argument type switching. The idtype_t thing comes from waitid() and is
> located int sys/wait.h and is defined as
>
> "The type idtype_t is defined as an enumeration type whose possible
> values include at least the following:
>
> P_ALL
> P_PID
> P_PGID
> "
>
> int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options);
> If idtype is P_PID, waitid() shall wait for the child with a process ID equal to (pid_t)id.
> If idtype is P_PGID, waitid() shall wait for any child with a process group ID equal to (pid_t)id.
> If idtype is P_ALL, waitid() shall wait for any children and id is ignored.
>
> I'm personally not a fan of this idtype_t thing and think this should
> just have been
> > > int pidfd_madvise(int pidfd, void *addr,
> > > size_t length, int advice, unsigned long flags);
> and call it a day.

That was the argument at that time, Daniel and I didn't want to have
pid along with pidfd even though Kirill strongly wanted to have it.
However you said " Overall, I don't particularly care how or if you
integrate pidfd here." at that time.

https://lore.kernel.org/linux-mm/20200113104256.5ujbplyec2sk4onn@wittgenstein/

I asked a question to Kirll at that time.

"
> Sounds like that you want to support both options for every upcoming API
> which deals with pid. I'm not sure how it's critical for process_madvise
> API this case. In general, we sacrifice some performance for the nicer one
> and later, once it's reported as hurdle for some workload, we could fix it
> via introducing new flag. What I don't like at this moment is to make
> syscall complicated with potential scenarios without real workload.

Yes, I suggest allowing both options for every new process api
"
https://lore.kernel.org/linux-mm/[email protected]/

You didn't give the opinion at that time, either(I expected you will
make some voice then). What I could do to proceed work was separate it
as different patch like this one to get more attention in future.
And now it works.

Let me clarify my side: I still don't like to introduce pid for new API
since we have pidfd. Since you just brought this issue again, I want to
hear *opinions* from others, again.

>
> Also, if I may ask, why is the flag argument "unsigned long"?
> That's pretty unorthodox. The expectation is that flag arguments are
> not word-size dependent and should usually use "unsigned int". All new
> system calls follow this pattern too.

Nothing special in this flag: Let me change it as "unsigned int".
I will send the change once we have an agreement on "pidfd" argument.

Thanks for the review, Christian!

2020-05-12 19:58:34

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v7 5/7] mm: support both pid and pidfd for process_madvise

On Sat, May 9, 2020 at 4:14 PM Minchan Kim <[email protected]> wrote:
>
> Hi Christian,
>
> On Sat, May 09, 2020 at 02:48:17PM +0200, Christian Brauner wrote:
> > On Fri, May 08, 2020 at 04:04:15PM -0700, Andrew Morton wrote:
> > > On Fri, 8 May 2020 11:36:53 -0700 Minchan Kim <[email protected]> wrote:
> > >
> > > >
> > > > ...
> > > >
> > > > Per Vlastimil's request, I changed "which and advise" with "idtype and
> > > > advice" in function prototype of description.
> > > > Could you replace the part in the description? Code is never changed.
> > > >
> > >
> > > Done, but...
> > >
> > > >
> > > > ...
> > > >
> > > > There is a demand[1] to support pid as well pidfd for process_madvise to
> > > > reduce unnecessary syscall to get pidfd if the user has control of the
> > > > target process(ie, they could guarantee the process is not gone or pid is
> > > > not reused).
> > > >
> > > > This patch aims for supporting both options like waitid(2). So, the
> > > > syscall is currently,
> > > >
> > > > int process_madvise(idtype_t idtype, id_t id, void *addr,
> > > > size_t length, int advice, unsigned long flags);
> > > >
> > > > @which is actually idtype_t for userspace libray and currently, it
> > > > supports P_PID and P_PIDFD.
> > >
> > > What does "@which is actually idtype_t for userspace libray" mean? Can
> > > you clarify and expand?
> >
> > If I may clarify, the only case where we've supported both pidfd and pid
> > in the same system call is waitid() to avoid adding a dedicated system
> > call for waiting and because waitid() already had this (imho insane)
> > argument type switching. The idtype_t thing comes from waitid() and is
> > located int sys/wait.h and is defined as
> >
> > "The type idtype_t is defined as an enumeration type whose possible
> > values include at least the following:
> >
> > P_ALL
> > P_PID
> > P_PGID
> > "
> >
> > int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options);
> > If idtype is P_PID, waitid() shall wait for the child with a process ID equal to (pid_t)id.
> > If idtype is P_PGID, waitid() shall wait for any child with a process group ID equal to (pid_t)id.
> > If idtype is P_ALL, waitid() shall wait for any children and id is ignored.
> >
> > I'm personally not a fan of this idtype_t thing and think this should
> > just have been
> > > > int pidfd_madvise(int pidfd, void *addr,
> > > > size_t length, int advice, unsigned long flags);
> > and call it a day.
>
> That was the argument at that time, Daniel and I didn't want to have
> pid along with pidfd even though Kirill strongly wanted to have it.
> However you said " Overall, I don't particularly care how or if you
> integrate pidfd here." at that time.
>
> https://lore.kernel.org/linux-mm/20200113104256.5ujbplyec2sk4onn@wittgenstein/
>
> I asked a question to Kirll at that time.
>
> "
> > Sounds like that you want to support both options for every upcoming API
> > which deals with pid. I'm not sure how it's critical for process_madvise
> > API this case. In general, we sacrifice some performance for the nicer one
> > and later, once it's reported as hurdle for some workload, we could fix it
> > via introducing new flag. What I don't like at this moment is to make
> > syscall complicated with potential scenarios without real workload.
>
> Yes, I suggest allowing both options for every new process api
> "
> https://lore.kernel.org/linux-mm/[email protected]/
>
> You didn't give the opinion at that time, either(I expected you will
> make some voice then). What I could do to proceed work was separate it
> as different patch like this one to get more attention in future.
> And now it works.
>
> Let me clarify my side: I still don't like to introduce pid for new API
> since we have pidfd. Since you just brought this issue again, I want to
> hear *opinions* from others, again.


IIRC Kirill's main complaint was that if we support only pidfds and
userspace has a pid of the process then it would have to convert that
pid into pidfd before calling process_madvise, which involves
additional syscall(s). The overhead would be more tangible if there
are multiple processes needing to be madvised.
I'm not sure how often such a need arises to madvise multiple
processes in a bulk like that and how critical is the overhead of
obtaining pidfd. With pid reuse possibility pid-based API will still
have the issue of possibly sending the request to a wrong process, so
this pidfd obtaining overhead arguably makes the usage more robust and
therefore is not a pure loss.

I don't have a real strong opinion against supporting pid in this
syscall but I think API maintainers should decide going forward
whether new APIs should support pid along with pidfd or switch to
pidfd only.
Thanks!

>
> >
> > Also, if I may ask, why is the flag argument "unsigned long"?
> > That's pretty unorthodox. The expectation is that flag arguments are
> > not word-size dependent and should usually use "unsigned int". All new
> > system calls follow this pattern too.
>
> Nothing special in this flag: Let me change it as "unsigned int".
> I will send the change once we have an agreement on "pidfd" argument.
>
> Thanks for the review, Christian!

2020-06-11 02:26:28

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH v7 7/7] mm/madvise: allow KSM hints for remote API

On Mon, Mar 2, 2020 at 8:36 PM Minchan Kim <[email protected]> wrote:
> From: Oleksandr Natalenko <[email protected]>
>
> It all began with the fact that KSM works only on memory that is marked
> by madvise(). And the only way to get around that is to either:
[...]
> To overcome this restriction, lets employ a new remote madvise API. This
> can be used by some small userspace helper daemon that will do auto-KSM
> job for us.
>
> I think of two major consumers of remote KSM hints:
[...]
> * heavy applications, that can be run in multiple instances, not
> limited to opensource ones like Firefox, but also those that cannot be
> modified since they are binary-only and, maybe, statically linked.

Just as a note, since you're mentioning Firefox as a usecase: Memory
deduplication between browser renderers creates new side channels and
is a questionable idea from a security standpoint. Memory
deduplication is (mostly) fine if either all involved processes are
trusted or no involved processes contain secrets, but browsers usually
run tons of untrusted code while at the same time containing lots of
valuable secrets.