2021-02-02 05:35:35

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

Initial version of process_madvise(2) manual page. Initial text was
extracted from [1], amended after fix [2] and more details added using
man pages of madvise(2) and process_vm_read(2) as examples. It also
includes the changes to required permission proposed in [3].

[1] https://lore.kernel.org/patchwork/patch/1297933/
[2] https://lkml.org/lkml/2020/12/8/1282
[3] https://patchwork.kernel.org/project/selinux/patch/[email protected]/#23888311

Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
---
changes in v2:
- Changed description of MADV_COLD per Michal Hocko's suggestion
- Applied fixes suggested by Michael Kerrisk
changes in v3:
- Added Michal's Reviewed-by
- Applied additional fixes suggested by Michael Kerrisk

NAME
process_madvise - give advice about use of memory to a process

SYNOPSIS
#include <sys/uio.h>

ssize_t process_madvise(int pidfd,
const struct iovec *iovec,
unsigned long vlen,
int advice,
unsigned int flags);

DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges of another process or the calling
process. It provides the advice to the address ranges described by iovec
and vlen. The goal of such advice is to improve system or application
performance.

The pidfd argument is a PID file descriptor (see pidfd_open(2)) that
specifies the process to which the advice is to be applied.

The pointer iovec points to an array of iovec structures, defined in
<sys/uio.h> as:

struct iovec {
void *iov_base; /* Starting address */
size_t iov_len; /* Number of bytes to transfer */
};

The iovec structure describes address ranges beginning at iov_base address
and with the size of iov_len bytes.

The vlen represents the number of elements in the iovec structure.

The advice argument is one of the values listed below.

Linux-specific advice values
The following Linux-specific advice values have no counterparts in the
POSIX-specified posix_madvise(3), and may or may not have counterparts
in the madvise(2) interface available on other implementations.

MADV_COLD (since Linux 5.4.1)
Deactive a given range of pages which will make them a more probable
reclaim target should there be a memory pressure. This is a
nondestructive operation. The advice might be ignored for some pages
in the range when it is not applicable.

MADV_PAGEOUT (since Linux 5.4.1)
Reclaim a given range of pages. This is done to free up memory occupied
by these pages. If a page is anonymous it will be swapped out. If a
page is file-backed and dirty it will be written back to the backing
storage. The advice might be ignored for some pages in the range when
it is not applicable.

The flags argument is reserved for future use; currently, this argument
must be specified as 0.

The value specified in the vlen argument must be less than or equal to
IOV_MAX (defined in <limits.h> or accessible via the call
sysconf(_SC_IOV_MAX)).

The vlen and iovec arguments are checked before applying any hints. If
the vlen is too big, or iovec is invalid, an error will be returned
immediately and no advice will be applied.

The hint might be applied to a part of iovec if one of its elements points
to an invalid memory region in the remote process. No further elements will
be processed beyond that point.

Permission to provide a hint to another process is governed by a ptrace
access mode PTRACE_MODE_READ_REALCREDS check (see ptrace(2)); in addition,
the caller must have the CAP_SYS_ADMIN capability due to performance
implications of applying the hint.

RETURN VALUE
On success, process_madvise() returns the number of bytes advised. This
return value may be less than the total number of requested bytes, if an
error occurred after some iovec elements were already processed. The caller
should check the return value to determine whether a partial advice
occurred.

On error, -1 is returned and errno is set to indicate the error.

ERRORS
EBADF pidfd is not a valid PID file descriptor.
EFAULT The memory described by iovec is outside the accessible address
space of the process referred to by pidfd.
EINVAL flags is not 0.
EINVAL The sum of the iov_len values of iovec overflows a ssize_t value.
EINVAL vlen is too large.
ENOMEM Could not allocate memory for internal copies of the iovec
structures.
EPERM The caller does not have permission to access the address space of
the process pidfd.
ESRCH The target process does not exist (i.e., it has terminated and been
waited on).

VERSIONS
This system call first appeared in Linux 5.10. Support for this system
call is optional, depending on the setting of the CONFIG_ADVISE_SYSCALLS
configuration option.

SEE ALSO
madvise(2), pidfd_open(2), process_vm_readv(2), process_vm_write(2)

man2/process_madvise.2 | 223 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 223 insertions(+)
create mode 100644 man2/process_madvise.2

diff --git a/man2/process_madvise.2 b/man2/process_madvise.2
new file mode 100644
index 000000000..24ff7cb3d
--- /dev/null
+++ b/man2/process_madvise.2
@@ -0,0 +1,223 @@
+.\" Copyright (C) 2021 Suren Baghdasaryan <[email protected]>
+.\" and Copyright (C) 2021 Minchan Kim <[email protected]>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date. The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein. The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.\" Commit ecb8ac8b1f146915aa6b96449b66dd48984caacc
+.\"
+.TH PROCESS_MADVISE 2 2021-01-12 "Linux" "Linux Programmer's Manual"
+.SH NAME
+process_madvise \- give advice about use of memory to a process
+.SH SYNOPSIS
+.nf
+.B #include <sys/uio.h>
+.PP
+.BI "ssize_t process_madvise(int " pidfd ,
+.BI " const struct iovec *" iovec ,
+.BI " unsigned long " vlen ,
+.BI " int " advice ,
+.BI " unsigned int " flags ");"
+.fi
+.SH DESCRIPTION
+The
+.BR process_madvise()
+system call is used to give advice or directions to the kernel about the
+address ranges of another process or the calling process.
+It provides the advice to the address ranges described by
+.I iovec
+and
+.IR vlen .
+The goal of such advice is to improve system or application performance.
+.PP
+The
+.I pidfd
+argument is a PID file descriptor (see
+.BR pidfd_open (2))
+that specifies the process to which the advice is to be applied.
+.PP
+The pointer
+.I iovec
+points to an array of
+.I iovec
+structures, defined in
+.IR <sys/uio.h>
+as:
+.PP
+.in +4n
+.EX
+struct iovec {
+ void *iov_base; /* Starting address */
+ size_t iov_len; /* Number of bytes to transfer */
+};
+.EE
+.in
+.PP
+The
+.I iovec
+structure describes address ranges beginning at
+.I iov_base
+address and with the size of
+.I iov_len
+bytes.
+.PP
+The
+.I vlen
+represents the number of elements in the
+.I iovec
+structure.
+.PP
+The
+.I advice
+argument is one of the values listed below.
+.\"
+.\" ======================================================================
+.\"
+.SS Linux-specific advice values
+The following Linux-specific
+.I advice
+values have no counterparts in the POSIX-specified
+.BR posix_madvise (3),
+and may or may not have counterparts in the
+.BR madvise (2)
+interface available on other implementations.
+.TP
+.BR MADV_COLD " (since Linux 5.4.1)"
+.\" commit 9c276cc65a58faf98be8e56962745ec99ab87636
+Deactive a given range of pages which will make them a more probable
+reclaim target should there be a memory pressure.
+This is a nondestructive operation.
+The advice might be ignored for some pages in the range when it is not
+applicable.
+.TP
+.BR MADV_PAGEOUT " (since Linux 5.4.1)"
+.\" commit 1a4e58cce84ee88129d5d49c064bd2852b481357
+Reclaim a given range of pages.
+This is done to free up memory occupied by these pages.
+If a page is anonymous it will be swapped out.
+If a page is file-backed and dirty it will be written back to the backing
+storage.
+The advice might be ignored for some pages in the range when it is not
+applicable.
+.PP
+The
+.I flags
+argument is reserved for future use; currently, this argument must be
+specified as 0.
+.PP
+The value specified in the
+.I vlen
+argument must be less than or equal to
+.BR IOV_MAX
+(defined in
+.I <limits.h>
+or accessible via the call
+.IR sysconf(_SC_IOV_MAX) ).
+.PP
+The
+.I vlen
+and
+.I iovec
+arguments are checked before applying any hints.
+If the
+.I vlen
+is too big, or
+.I iovec
+is invalid,
+an error will be returned immediately and no advice will be applied.
+.PP
+The hint might be applied to a part of
+.I iovec
+if one of its elements points to an invalid memory region in the
+remote process.
+No further elements will be processed beyond that point.
+.PP
+Permission to provide a hint to another process is governed by a
+ptrace access mode
+.B PTRACE_MODE_READ_REALCREDS
+check (see
+.BR ptrace (2));
+in addition, the caller must have the
+.B CAP_SYS_ADMIN
+capability due to performance implications of applying the hint.
+.SH RETURN VALUE
+On success, process_madvise() returns the number of bytes advised.
+This return value may be less than the total number of requested bytes,
+if an error occurred after some iovec elements were already processed.
+The caller should check the return value to determine whether a partial
+advice occurred.
+.PP
+On error, \-1 is returned and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EBADF
+.I pidfd
+is not a valid PID file descriptor.
+.TP
+.B EFAULT
+The memory described by
+.I iovec
+is outside the accessible address space of the process referred to by
+.IR pidfd .
+.TP
+.B EINVAL
+.I flags
+is not 0.
+.TP
+.B EINVAL
+The sum of the
+.I iov_len
+values of
+.I iovec
+overflows a
+.I ssize_t
+value.
+.TP
+.B EINVAL
+.I vlen
+is too large.
+.TP
+.B ENOMEM
+Could not allocate memory for internal copies of the
+.I iovec
+structures.
+.TP
+.B EPERM
+The caller does not have permission to access the address space of the process
+.IR pidfd .
+.TP
+.B ESRCH
+The target process does not exist (i.e., it has terminated and been waited on).
+.SH VERSIONS
+This system call first appeared in Linux 5.10.
+.\" commit ecb8ac8b1f146915aa6b96449b66dd48984caacc
+Support for this system call is optional,
+depending on the setting of the
+.B CONFIG_ADVISE_SYSCALLS
+configuration option.
+.SH SEE ALSO
+.BR madvise (2),
+.BR pidfd_open(2),
+.BR process_vm_readv (2),
+.BR process_vm_write (2)
--
2.30.0.365.g02bc693789-goog


Subject: Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

Hello Suren (and Minchan and Michal)

Thank you for the revisions!

I've applied this patch, and done a few light edits.

However, I have a questions about undocumented pieces in *madvise(2)*,
as well as one other question. See below.

On 2/2/21 6:30 AM, Suren Baghdasaryan wrote:
> Initial version of process_madvise(2) manual page. Initial text was
> extracted from [1], amended after fix [2] and more details added using
> man pages of madvise(2) and process_vm_read(2) as examples. It also
> includes the changes to required permission proposed in [3].
>
> [1] https://lore.kernel.org/patchwork/patch/1297933/
> [2] https://lkml.org/lkml/2020/12/8/1282
> [3] https://patchwork.kernel.org/project/selinux/patch/[email protected]/#23888311
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Reviewed-by: Michal Hocko <[email protected]>
> ---
> changes in v2:
> - Changed description of MADV_COLD per Michal Hocko's suggestion
> - Applied fixes suggested by Michael Kerrisk
> changes in v3:
> - Added Michal's Reviewed-by
> - Applied additional fixes suggested by Michael Kerrisk
>
> NAME
> process_madvise - give advice about use of memory to a process
>
> SYNOPSIS
> #include <sys/uio.h>
>
> ssize_t process_madvise(int pidfd,
> const struct iovec *iovec,
> unsigned long vlen,
> int advice,
> unsigned int flags);
>
> DESCRIPTION
> The process_madvise() system call is used to give advice or directions
> to the kernel about the address ranges of another process or the calling
> process. It provides the advice to the address ranges described by iovec
> and vlen. The goal of such advice is to improve system or application
> performance.
>
> The pidfd argument is a PID file descriptor (see pidfd_open(2)) that
> specifies the process to which the advice is to be applied.
>
> The pointer iovec points to an array of iovec structures, defined in
> <sys/uio.h> as:
>
> struct iovec {
> void *iov_base; /* Starting address */
> size_t iov_len; /* Number of bytes to transfer */
> };
>
> The iovec structure describes address ranges beginning at iov_base address
> and with the size of iov_len bytes.
>
> The vlen represents the number of elements in the iovec structure.
>
> The advice argument is one of the values listed below.
>
> Linux-specific advice values
> The following Linux-specific advice values have no counterparts in the
> POSIX-specified posix_madvise(3), and may or may not have counterparts
> in the madvise(2) interface available on other implementations.
>
> MADV_COLD (since Linux 5.4.1)

I just noticed these version numbers now, and thought: they can't be
right (because the system call appeared only in v5.11). So I removed
them. But, of course in another sense the version numbers are (nearly)
right, since these advice values were added for madvise(2) in Linux 5.4.
However, they are not documented in the madvise(2) manual page. Is it
correct to assume that MADV_COLD and MADV_PAGEOUT have exactly the same
meaning in madvise(2) (but just for the calling process, of course)?

> Deactive a given range of pages which will make them a more probable

I changed: s/Deactive/Deactivate/

> reclaim target should there be a memory pressure. This is a
> nondestructive operation. The advice might be ignored for some pages
> in the range when it is not applicable.
>
> MADV_PAGEOUT (since Linux 5.4.1)
> Reclaim a given range of pages. This is done to free up memory occupied
> by these pages. If a page is anonymous it will be swapped out. If a
> page is file-backed and dirty it will be written back to the backing
> storage. The advice might be ignored for some pages in the range when
> it is not applicable.

[...]

> The hint might be applied to a part of iovec if one of its elements points
> to an invalid memory region in the remote process. No further elements will
> be processed beyond that point.

Is the above scenario the one that leads to the partial advice case described in
RETURN VALUE? If yes, perhaps I should add some words to make that clearer.

You can see the light edits that I made in
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=e3ce016472a1b3ec5dffdeb23c98b9fef618a97b
and following that I restructured DESCRIPTION a little in
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=3aac0708a9acee5283e091461de6a8410bc921a6

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2021-02-03 00:56:14

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

Hi Michael,

On Tue, Feb 2, 2021 at 2:45 AM Michael Kerrisk (man-pages)
<[email protected]> wrote:
>
> Hello Suren (and Minchan and Michal)
>
> Thank you for the revisions!
>
> I've applied this patch, and done a few light edits.

Thanks!

>
> However, I have a questions about undocumented pieces in *madvise(2)*,
> as well as one other question. See below.
>
> On 2/2/21 6:30 AM, Suren Baghdasaryan wrote:
> > Initial version of process_madvise(2) manual page. Initial text was
> > extracted from [1], amended after fix [2] and more details added using
> > man pages of madvise(2) and process_vm_read(2) as examples. It also
> > includes the changes to required permission proposed in [3].
> >
> > [1] https://lore.kernel.org/patchwork/patch/1297933/
> > [2] https://lkml.org/lkml/2020/12/8/1282
> > [3] https://patchwork.kernel.org/project/selinux/patch/[email protected]/#23888311
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > Reviewed-by: Michal Hocko <[email protected]>
> > ---
> > changes in v2:
> > - Changed description of MADV_COLD per Michal Hocko's suggestion
> > - Applied fixes suggested by Michael Kerrisk
> > changes in v3:
> > - Added Michal's Reviewed-by
> > - Applied additional fixes suggested by Michael Kerrisk
> >
> > NAME
> > process_madvise - give advice about use of memory to a process
> >
> > SYNOPSIS
> > #include <sys/uio.h>
> >
> > ssize_t process_madvise(int pidfd,
> > const struct iovec *iovec,
> > unsigned long vlen,
> > int advice,
> > unsigned int flags);
> >
> > DESCRIPTION
> > The process_madvise() system call is used to give advice or directions
> > to the kernel about the address ranges of another process or the calling
> > process. It provides the advice to the address ranges described by iovec
> > and vlen. The goal of such advice is to improve system or application
> > performance.
> >
> > The pidfd argument is a PID file descriptor (see pidfd_open(2)) that
> > specifies the process to which the advice is to be applied.
> >
> > The pointer iovec points to an array of iovec structures, defined in
> > <sys/uio.h> as:
> >
> > struct iovec {
> > void *iov_base; /* Starting address */
> > size_t iov_len; /* Number of bytes to transfer */
> > };
> >
> > The iovec structure describes address ranges beginning at iov_base address
> > and with the size of iov_len bytes.
> >
> > The vlen represents the number of elements in the iovec structure.
> >
> > The advice argument is one of the values listed below.
> >
> > Linux-specific advice values
> > The following Linux-specific advice values have no counterparts in the
> > POSIX-specified posix_madvise(3), and may or may not have counterparts
> > in the madvise(2) interface available on other implementations.
> >
> > MADV_COLD (since Linux 5.4.1)
>
> I just noticed these version numbers now, and thought: they can't be
> right (because the system call appeared only in v5.11). So I removed
> them. But, of course in another sense the version numbers are (nearly)
> right, since these advice values were added for madvise(2) in Linux 5.4.
> However, they are not documented in the madvise(2) manual page. Is it
> correct to assume that MADV_COLD and MADV_PAGEOUT have exactly the same
> meaning in madvise(2) (but just for the calling process, of course)?

Correct. They should be added in the madvise(2) man page as well IMHO.

>
> > Deactive a given range of pages which will make them a more probable
>
> I changed: s/Deactive/Deactivate/

thanks!

>
> > reclaim target should there be a memory pressure. This is a
> > nondestructive operation. The advice might be ignored for some pages
> > in the range when it is not applicable.
> >
> > MADV_PAGEOUT (since Linux 5.4.1)
> > Reclaim a given range of pages. This is done to free up memory occupied
> > by these pages. If a page is anonymous it will be swapped out. If a
> > page is file-backed and dirty it will be written back to the backing
> > storage. The advice might be ignored for some pages in the range when
> > it is not applicable.
>
> [...]
>
> > The hint might be applied to a part of iovec if one of its elements points
> > to an invalid memory region in the remote process. No further elements will
> > be processed beyond that point.
>
> Is the above scenario the one that leads to the partial advice case described in
> RETURN VALUE? If yes, perhaps I should add some words to make that clearer.

Correct. This describes the case when partial advice happens.

>
> You can see the light edits that I made in
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=e3ce016472a1b3ec5dffdeb23c98b9fef618a97b
> and following that I restructured DESCRIPTION a little in
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=3aac0708a9acee5283e091461de6a8410bc921a6

The edits LGTM.
Thanks,
Suren.

>
> Thanks,
>
> Michael
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

Hello Suren,

On 2/2/21 11:12 PM, Suren Baghdasaryan wrote:
> Hi Michael,
>
> On Tue, Feb 2, 2021 at 2:45 AM Michael Kerrisk (man-pages)
> <[email protected]> wrote:
>>
>> Hello Suren (and Minchan and Michal)
>>
>> Thank you for the revisions!
>>
>> I've applied this patch, and done a few light edits.
>
> Thanks!
>
>>
>> However, I have a questions about undocumented pieces in *madvise(2)*,
>> as well as one other question. See below.
>>
>> On 2/2/21 6:30 AM, Suren Baghdasaryan wrote:
>>> Initial version of process_madvise(2) manual page. Initial text was
>>> extracted from [1], amended after fix [2] and more details added using
>>> man pages of madvise(2) and process_vm_read(2) as examples. It also
>>> includes the changes to required permission proposed in [3].
>>>
>>> [1] https://lore.kernel.org/patchwork/patch/1297933/
>>> [2] https://lkml.org/lkml/2020/12/8/1282
>>> [3] https://patchwork.kernel.org/project/selinux/patch/[email protected]/#23888311
>>>
>>> Signed-off-by: Suren Baghdasaryan <[email protected]>
>>> Reviewed-by: Michal Hocko <[email protected]>
>>> ---
>>> changes in v2:
>>> - Changed description of MADV_COLD per Michal Hocko's suggestion
>>> - Applied fixes suggested by Michael Kerrisk
>>> changes in v3:
>>> - Added Michal's Reviewed-by
>>> - Applied additional fixes suggested by Michael Kerrisk
>>>
>>> NAME
>>> process_madvise - give advice about use of memory to a process
>>>
>>> SYNOPSIS
>>> #include <sys/uio.h>
>>>
>>> ssize_t process_madvise(int pidfd,
>>> const struct iovec *iovec,
>>> unsigned long vlen,
>>> int advice,
>>> unsigned int flags);
>>>
>>> DESCRIPTION
>>> The process_madvise() system call is used to give advice or directions
>>> to the kernel about the address ranges of another process or the calling
>>> process. It provides the advice to the address ranges described by iovec
>>> and vlen. The goal of such advice is to improve system or application
>>> performance.
>>>
>>> The pidfd argument is a PID file descriptor (see pidfd_open(2)) that
>>> specifies the process to which the advice is to be applied.
>>>
>>> The pointer iovec points to an array of iovec structures, defined in
>>> <sys/uio.h> as:
>>>
>>> struct iovec {
>>> void *iov_base; /* Starting address */
>>> size_t iov_len; /* Number of bytes to transfer */
>>> };
>>>
>>> The iovec structure describes address ranges beginning at iov_base address
>>> and with the size of iov_len bytes.
>>>
>>> The vlen represents the number of elements in the iovec structure.
>>>
>>> The advice argument is one of the values listed below.
>>>
>>> Linux-specific advice values
>>> The following Linux-specific advice values have no counterparts in the
>>> POSIX-specified posix_madvise(3), and may or may not have counterparts
>>> in the madvise(2) interface available on other implementations.
>>>
>>> MADV_COLD (since Linux 5.4.1)
>>
>> I just noticed these version numbers now, and thought: they can't be
>> right (because the system call appeared only in v5.11). So I removed
>> them. But, of course in another sense the version numbers are (nearly)
>> right, since these advice values were added for madvise(2) in Linux 5.4.
>> However, they are not documented in the madvise(2) manual page. Is it
>> correct to assume that MADV_COLD and MADV_PAGEOUT have exactly the same
>> meaning in madvise(2) (but just for the calling process, of course)?
>
> Correct. They should be added in the madvise(2) man page as well IMHO.

So, I decided to move the description of MADV_COLD and MADV_PAGEOUT
to madvise(2) and refer to that page from the process_madvise(2)
page. This avoids repeating the same information in two places.

>>> Deactive a given range of pages which will make them a more probable
>>
>> I changed: s/Deactive/Deactivate/
>
> thanks!
>
>>
>>> reclaim target should there be a memory pressure. This is a
>>> nondestructive operation. The advice might be ignored for some pages
>>> in the range when it is not applicable.
>>>
>>> MADV_PAGEOUT (since Linux 5.4.1)
>>> Reclaim a given range of pages. This is done to free up memory occupied
>>> by these pages. If a page is anonymous it will be swapped out. If a
>>> page is file-backed and dirty it will be written back to the backing
>>> storage. The advice might be ignored for some pages in the range when
>>> it is not applicable.
>>
>> [...]
>>
>>> The hint might be applied to a part of iovec if one of its elements points
>>> to an invalid memory region in the remote process. No further elements will
>>> be processed beyond that point.
>>
>> Is the above scenario the one that leads to the partial advice case described in
>> RETURN VALUE? If yes, perhaps I should add some words to make that clearer.
>
> Correct. This describes the case when partial advice happens.

Thanks. I added a few words to clarify this.


>> You can see the light edits that I made in
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=e3ce016472a1b3ec5dffdeb23c98b9fef618a97b
>> and following that I restructured DESCRIPTION a little in
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=3aac0708a9acee5283e091461de6a8410bc921a6
>
> The edits LGTM.

Thanks for checking them.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2021-02-16 17:51:05

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

Hi Michael,

On Sat, Feb 13, 2021 at 2:04 PM Michael Kerrisk (man-pages)
<[email protected]> wrote:
>
> Hello Suren,
>
> On 2/2/21 11:12 PM, Suren Baghdasaryan wrote:
> > Hi Michael,
> >
> > On Tue, Feb 2, 2021 at 2:45 AM Michael Kerrisk (man-pages)
> > <[email protected]> wrote:
> >>
> >> Hello Suren (and Minchan and Michal)
> >>
> >> Thank you for the revisions!
> >>
> >> I've applied this patch, and done a few light edits.
> >
> > Thanks!
> >
> >>
> >> However, I have a questions about undocumented pieces in *madvise(2)*,
> >> as well as one other question. See below.
> >>
> >> On 2/2/21 6:30 AM, Suren Baghdasaryan wrote:
> >>> Initial version of process_madvise(2) manual page. Initial text was
> >>> extracted from [1], amended after fix [2] and more details added using
> >>> man pages of madvise(2) and process_vm_read(2) as examples. It also
> >>> includes the changes to required permission proposed in [3].
> >>>
> >>> [1] https://lore.kernel.org/patchwork/patch/1297933/
> >>> [2] https://lkml.org/lkml/2020/12/8/1282
> >>> [3] https://patchwork.kernel.org/project/selinux/patch/[email protected]/#23888311
> >>>
> >>> Signed-off-by: Suren Baghdasaryan <[email protected]>
> >>> Reviewed-by: Michal Hocko <[email protected]>
> >>> ---
> >>> changes in v2:
> >>> - Changed description of MADV_COLD per Michal Hocko's suggestion
> >>> - Applied fixes suggested by Michael Kerrisk
> >>> changes in v3:
> >>> - Added Michal's Reviewed-by
> >>> - Applied additional fixes suggested by Michael Kerrisk
> >>>
> >>> NAME
> >>> process_madvise - give advice about use of memory to a process
> >>>
> >>> SYNOPSIS
> >>> #include <sys/uio.h>
> >>>
> >>> ssize_t process_madvise(int pidfd,
> >>> const struct iovec *iovec,
> >>> unsigned long vlen,
> >>> int advice,
> >>> unsigned int flags);
> >>>
> >>> DESCRIPTION
> >>> The process_madvise() system call is used to give advice or directions
> >>> to the kernel about the address ranges of another process or the calling
> >>> process. It provides the advice to the address ranges described by iovec
> >>> and vlen. The goal of such advice is to improve system or application
> >>> performance.
> >>>
> >>> The pidfd argument is a PID file descriptor (see pidfd_open(2)) that
> >>> specifies the process to which the advice is to be applied.
> >>>
> >>> The pointer iovec points to an array of iovec structures, defined in
> >>> <sys/uio.h> as:
> >>>
> >>> struct iovec {
> >>> void *iov_base; /* Starting address */
> >>> size_t iov_len; /* Number of bytes to transfer */
> >>> };
> >>>
> >>> The iovec structure describes address ranges beginning at iov_base address
> >>> and with the size of iov_len bytes.
> >>>
> >>> The vlen represents the number of elements in the iovec structure.
> >>>
> >>> The advice argument is one of the values listed below.
> >>>
> >>> Linux-specific advice values
> >>> The following Linux-specific advice values have no counterparts in the
> >>> POSIX-specified posix_madvise(3), and may or may not have counterparts
> >>> in the madvise(2) interface available on other implementations.
> >>>
> >>> MADV_COLD (since Linux 5.4.1)
> >>
> >> I just noticed these version numbers now, and thought: they can't be
> >> right (because the system call appeared only in v5.11). So I removed
> >> them. But, of course in another sense the version numbers are (nearly)
> >> right, since these advice values were added for madvise(2) in Linux 5.4.
> >> However, they are not documented in the madvise(2) manual page. Is it
> >> correct to assume that MADV_COLD and MADV_PAGEOUT have exactly the same
> >> meaning in madvise(2) (but just for the calling process, of course)?
> >
> > Correct. They should be added in the madvise(2) man page as well IMHO.
>
> So, I decided to move the description of MADV_COLD and MADV_PAGEOUT
> to madvise(2) and refer to that page from the process_madvise(2)
> page. This avoids repeating the same information in two places.

Sounds good.

>
> >>> Deactive a given range of pages which will make them a more probable
> >>
> >> I changed: s/Deactive/Deactivate/
> >
> > thanks!
> >
> >>
> >>> reclaim target should there be a memory pressure. This is a
> >>> nondestructive operation. The advice might be ignored for some pages
> >>> in the range when it is not applicable.
> >>>
> >>> MADV_PAGEOUT (since Linux 5.4.1)
> >>> Reclaim a given range of pages. This is done to free up memory occupied
> >>> by these pages. If a page is anonymous it will be swapped out. If a
> >>> page is file-backed and dirty it will be written back to the backing
> >>> storage. The advice might be ignored for some pages in the range when
> >>> it is not applicable.
> >>
> >> [...]
> >>
> >>> The hint might be applied to a part of iovec if one of its elements points
> >>> to an invalid memory region in the remote process. No further elements will
> >>> be processed beyond that point.
> >>
> >> Is the above scenario the one that leads to the partial advice case described in
> >> RETURN VALUE? If yes, perhaps I should add some words to make that clearer.
> >
> > Correct. This describes the case when partial advice happens.
>
> Thanks. I added a few words to clarify this.

Any link where I can see the final version?

>
>
> >> You can see the light edits that I made in
> >> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=e3ce016472a1b3ec5dffdeb23c98b9fef618a97b
> >> and following that I restructured DESCRIPTION a little in
> >> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=3aac0708a9acee5283e091461de6a8410bc921a6
> >
> > The edits LGTM.
>
> Thanks for checking them.
>
> Cheers,
>
> Michael
>

Thanks,
Suren.

>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

Subject: Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

Hello Suren,

>> Thanks. I added a few words to clarify this.>
> Any link where I can see the final version?

Sure:
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/process_madvise.2

Also rendered below.

Thanks,

Michael

NAME
process_madvise - give advice about use of memory to a process

SYNOPSIS
#include <sys/uio.h>

ssize_t process_madvise(int pidfd, const struct iovec *iovec,
size_t vlen, int advice,
unsigned int flags);

Note: There is no glibc wrapper for this system call; see NOTES.

DESCRIPTION
The process_madvise() system call is used to give advice or direc‐
tions to the kernel about the address ranges of another process or
of the calling process. It provides the advice for the address
ranges described by iovec and vlen. The goal of such advice is to
improve system or application performance.

The pidfd argument is a PID file descriptor (see pidfd_open(2))
that specifies the process to which the advice is to be applied.

The pointer iovec points to an array of iovec structures, defined
in <sys/uio.h> as:

struct iovec {
void *iov_base; /* Starting address */
size_t iov_len; /* Length of region */
};

The iovec structure describes address ranges beginning at iov_base
address and with the size of iov_len bytes.

The vlen specifies the number of elements in the iovec structure.
This value must be less than or equal to IOV_MAX (defined in <lim‐
its.h> or accessible via the call sysconf(_SC_IOV_MAX)).

The advice argument is one of the following values:

MADV_COLD
See madvise(2).

MADV_PAGEOUT
See madvise(2).

The flags argument is reserved for future use; currently, this ar‐
gument must be specified as 0.

The vlen and iovec arguments are checked before applying any ad‐
vice. If vlen is too big, or iovec is invalid, then an error will
be returned immediately and no advice will be applied.

The advice might be applied to only a part of iovec if one of its
elements points to an invalid memory region in the remote process.
No further elements will be processed beyond that point. (See the
discussion regarding partial advice in RETURN VALUE.)

Permission to apply advice to another process is governed by a
ptrace access mode PTRACE_MODE_READ_REALCREDS check (see
ptrace(2)); in addition, because of the performance implications
of applying the advice, the caller must have the CAP_SYS_ADMIN ca‐
pability.

RETURN VALUE
On success, process_madvise() returns the number of bytes advised.
This return value may be less than the total number of requested
bytes, if an error occurred after some iovec elements were already
processed. The caller should check the return value to determine
whether a partial advice occurred.

On error, -1 is returned and errno is set to indicate the error.

ERRORS
EBADF pidfd is not a valid PID file descriptor.

EFAULT The memory described by iovec is outside the accessible ad‐
dress space of the process referred to by pidfd.

EINVAL flags is not 0.

EINVAL The sum of the iov_len values of iovec overflows a ssize_t
value.

EINVAL vlen is too large.

ENOMEM Could not allocate memory for internal copies of the iovec
structures.

EPERM The caller does not have permission to access the address
space of the process pidfd.

ESRCH The target process does not exist (i.e., it has terminated
and been waited on).

VERSIONS
This system call first appeared in Linux 5.10. Support for this
system call is optional, depending on the setting of the CON‐
FIG_ADVISE_SYSCALLS configuration option.

CONFORMING TO
The process_madvise() system call is Linux-specific.

NOTES
Glibc does not provide a wrapper for this system call; call it us‐
ing syscall(2).

SEE ALSO
madvise(2), pidfd_open(2), process_vm_readv(2),
process_vm_write(2)


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2021-02-18 19:26:19

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

On Wed, Feb 17, 2021 at 11:55 PM Michael Kerrisk (man-pages)
<[email protected]> wrote:
>
> Hello Suren,
>
> >> Thanks. I added a few words to clarify this.>
> > Any link where I can see the final version?
>
> Sure:
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/process_madvise.2
>
> Also rendered below.

Looks great. Thanks for improving it, Michael!

>
> Thanks,
>
> Michael
>
> NAME
> process_madvise - give advice about use of memory to a process
>
> SYNOPSIS
> #include <sys/uio.h>
>
> ssize_t process_madvise(int pidfd, const struct iovec *iovec,
> size_t vlen, int advice,
> unsigned int flags);
>
> Note: There is no glibc wrapper for this system call; see NOTES.
>
> DESCRIPTION
> The process_madvise() system call is used to give advice or direc‐
> tions to the kernel about the address ranges of another process or
> of the calling process. It provides the advice for the address
> ranges described by iovec and vlen. The goal of such advice is to
> improve system or application performance.
>
> The pidfd argument is a PID file descriptor (see pidfd_open(2))
> that specifies the process to which the advice is to be applied.
>
> The pointer iovec points to an array of iovec structures, defined
> in <sys/uio.h> as:
>
> struct iovec {
> void *iov_base; /* Starting address */
> size_t iov_len; /* Length of region */
> };
>
> The iovec structure describes address ranges beginning at iov_base
> address and with the size of iov_len bytes.
>
> The vlen specifies the number of elements in the iovec structure.
> This value must be less than or equal to IOV_MAX (defined in <lim‐
> its.h> or accessible via the call sysconf(_SC_IOV_MAX)).
>
> The advice argument is one of the following values:
>
> MADV_COLD
> See madvise(2).
>
> MADV_PAGEOUT
> See madvise(2).
>
> The flags argument is reserved for future use; currently, this ar‐
> gument must be specified as 0.
>
> The vlen and iovec arguments are checked before applying any ad‐
> vice. If vlen is too big, or iovec is invalid, then an error will
> be returned immediately and no advice will be applied.
>
> The advice might be applied to only a part of iovec if one of its
> elements points to an invalid memory region in the remote process.
> No further elements will be processed beyond that point. (See the
> discussion regarding partial advice in RETURN VALUE.)
>
> Permission to apply advice to another process is governed by a
> ptrace access mode PTRACE_MODE_READ_REALCREDS check (see
> ptrace(2)); in addition, because of the performance implications
> of applying the advice, the caller must have the CAP_SYS_ADMIN ca‐
> pability.
>
> RETURN VALUE
> On success, process_madvise() returns the number of bytes advised.
> This return value may be less than the total number of requested
> bytes, if an error occurred after some iovec elements were already
> processed. The caller should check the return value to determine
> whether a partial advice occurred.
>
> On error, -1 is returned and errno is set to indicate the error.
>
> ERRORS
> EBADF pidfd is not a valid PID file descriptor.
>
> EFAULT The memory described by iovec is outside the accessible ad‐
> dress space of the process referred to by pidfd.
>
> EINVAL flags is not 0.
>
> EINVAL The sum of the iov_len values of iovec overflows a ssize_t
> value.
>
> EINVAL vlen is too large.
>
> ENOMEM Could not allocate memory for internal copies of the iovec
> structures.
>
> EPERM The caller does not have permission to access the address
> space of the process pidfd.
>
> ESRCH The target process does not exist (i.e., it has terminated
> and been waited on).
>
> VERSIONS
> This system call first appeared in Linux 5.10. Support for this
> system call is optional, depending on the setting of the CON‐
> FIG_ADVISE_SYSCALLS configuration option.
>
> CONFORMING TO
> The process_madvise() system call is Linux-specific.
>
> NOTES
> Glibc does not provide a wrapper for this system call; call it us‐
> ing syscall(2).
>
> SEE ALSO
> madvise(2), pidfd_open(2), process_vm_readv(2),
> process_vm_write(2)
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/