2022-03-09 05:35:12

by Charan Teja Kalla

[permalink] [raw]
Subject: [PATCH] mm: madvise: return correct bytes advised with process_madvise

The process_madvise() system call returns error even after processing
some VMA's passed in the 'struct iovec' vector list which leaves the
user confused to know where to restart the advise next. It is also
against this syscall man page[1] documentation where it mentions that
"return value may be less than the total number of requested bytes, if
an error occurred after some iovec elements were already processed.".

Consider a user passed 10 VMA's in the 'struct iovec' vector list of
which 9 are processed but one. Then it just returns the error caused on
that failed VMA despite the first 9 VMA's processed, leaving the user
confused about on which VMA it is failed. Returning the number of bytes
processed here can help the user to know which VMA it is failed on and
thus can retry/skip the advise on that VMA.

[1]https://man7.org/linux/man-pages/man2/process_madvise.2.html.

Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API"
Signed-off-by: Charan Teja Kalla <[email protected]>
---
mm/madvise.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 38d0f51..d3b49b3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1426,15 +1426,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,

while (iov_iter_count(&iter)) {
iovec = iov_iter_iovec(&iter);
+ /*
+ * Even when [start, end) passed to do_madvise covers
+ * some unmapped addresses, it continues processing with
+ * returning ENOMEM at the end. Thus consider the range
+ * as processed when do_madvise() returns ENOMEM.
+ * This makes process_madvise() never returns ENOMEM.
+ */
ret = do_madvise(mm, (unsigned long)iovec.iov_base,
iovec.iov_len, behavior);
- if (ret < 0)
+ if (ret < 0 && ret != -ENOMEM)
break;
iov_iter_advance(&iter, iovec.iov_len);
}

- if (ret == 0)
- ret = total_len - iov_iter_count(&iter);
+ ret = (total_len - iov_iter_count(&iter)) ? : ret;

release_mm:
mmput(mm);
--
2.7.4


2022-03-10 09:52:38

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: madvise: return correct bytes advised with process_madvise

On Wed, Mar 09, 2022 at 10:57:59AM +0530, Charan Teja Kalla wrote:
> The process_madvise() system call returns error even after processing
> some VMA's passed in the 'struct iovec' vector list which leaves the
> user confused to know where to restart the advise next. It is also
> against this syscall man page[1] documentation where it mentions that
> "return value may be less than the total number of requested bytes, if
> an error occurred after some iovec elements were already processed.".
>
> Consider a user passed 10 VMA's in the 'struct iovec' vector list of
> which 9 are processed but one. Then it just returns the error caused on
> that failed VMA despite the first 9 VMA's processed, leaving the user
> confused about on which VMA it is failed. Returning the number of bytes
> processed here can help the user to know which VMA it is failed on and
> thus can retry/skip the advise on that VMA.
>
> [1]https://man7.org/linux/man-pages/man2/process_madvise.2.html.
>
> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API"
> Signed-off-by: Charan Teja Kalla <[email protected]>
> ---
> mm/madvise.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 38d0f51..d3b49b3 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1426,15 +1426,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>
> while (iov_iter_count(&iter)) {
> iovec = iov_iter_iovec(&iter);
> + /*
> + * Even when [start, end) passed to do_madvise covers
> + * some unmapped addresses, it continues processing with
> + * returning ENOMEM at the end. Thus consider the range
> + * as processed when do_madvise() returns ENOMEM.
> + * This makes process_madvise() never returns ENOMEM.
> + */

Looks like that this patch has two things. first, returns processed
bytes instead of error in case of error. Second, keep working on
rest vmas on -ENOMEM due to unmapped hole.

First thing totally makes sense to me(that's exactly I wanted to
do but somehow missed) so it should go stable tree. However,
second stuff might be arguble so it would be great if you split
the patch.

> ret = do_madvise(mm, (unsigned long)iovec.iov_base,
> iovec.iov_len, behavior);
> - if (ret < 0)
> + if (ret < 0 && ret != -ENOMEM)
> break;
> iov_iter_advance(&iter, iovec.iov_len);
> }
>
> - if (ret == 0)
> - ret = total_len - iov_iter_count(&iter);
> + ret = (total_len - iov_iter_count(&iter)) ? : ret;
>
> release_mm:
> mmput(mm);
> --
> 2.7.4
>

2022-03-10 11:00:10

by Charan Teja Kalla

[permalink] [raw]
Subject: Re: [PATCH] mm: madvise: return correct bytes advised with process_madvise


Thanks Minchan for your comment!!
On 3/9/2022 10:17 PM, Minchan Kim wrote:
>> @@ -1426,15 +1426,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>>
>> while (iov_iter_count(&iter)) {
>> iovec = iov_iter_iovec(&iter);
>> + /*
>> + * Even when [start, end) passed to do_madvise covers
>> + * some unmapped addresses, it continues processing with
>> + * returning ENOMEM at the end. Thus consider the range
>> + * as processed when do_madvise() returns ENOMEM.
>> + * This makes process_madvise() never returns ENOMEM.
>> + */
> Looks like that this patch has two things. first, returns processed
> bytes instead of error in case of error. Second, keep working on
> rest vmas on -ENOMEM due to unmapped hole.
>
> First thing totally makes sense to me(that's exactly I wanted to
> do but somehow missed) so it should go stable tree. However,
> second stuff might be arguble so it would be great if you split
> the patch.

Sure, then will split the patch in V2.

>

2022-03-10 14:31:28

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH] mm: madvise: return correct bytes advised with process_madvise



> On Mar 9, 2022, at 8:47 AM, Minchan Kim <[email protected]> wrote:
>
> On Wed, Mar 09, 2022 at 10:57:59AM +0530, Charan Teja Kalla wrote:
>> The process_madvise() system call returns error even after processing
>> some VMA's passed in the 'struct iovec' vector list which leaves the
>> user confused to know where to restart the advise next. It is also
>> against this syscall man page[1] documentation where it mentions that
>> "return value may be less than the total number of requested bytes, if
>> an error occurred after some iovec elements were already processed.".
>>
>> Consider a user passed 10 VMA's in the 'struct iovec' vector list of
>> which 9 are processed but one. Then it just returns the error caused on
>> that failed VMA despite the first 9 VMA's processed, leaving the user
>> confused about on which VMA it is failed. Returning the number of bytes
>> processed here can help the user to know which VMA it is failed on and
>> thus can retry/skip the advise on that VMA.
>>
>> [1]https://man7.org/linux/man-pages/man2/process_madvise.2.html.
>>
>> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API"
>> Signed-off-by: Charan Teja Kalla <[email protected]>
>> ---
>> mm/madvise.c | 12 +++++++++---
>> 1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 38d0f51..d3b49b3 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -1426,15 +1426,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>>
>> while (iov_iter_count(&iter)) {
>> iovec = iov_iter_iovec(&iter);
>> + /*
>> + * Even when [start, end) passed to do_madvise covers
>> + * some unmapped addresses, it continues processing with
>> + * returning ENOMEM at the end. Thus consider the range
>> + * as processed when do_madvise() returns ENOMEM.
>> + * This makes process_madvise() never returns ENOMEM.
>> + */
>
> Looks like that this patch has two things. first, returns processed
> bytes instead of error in case of error. Second, keep working on
> rest vmas on -ENOMEM due to unmapped hole.
>
> First thing totally makes sense to me(that's exactly I wanted to
> do but somehow missed) so it should go stable tree. However,
> second stuff might be arguble so it would be great if you split
> the patch.

I fully understand and relate to the basic motivation of this
patch.

The ENOMEM that this patch checks for, IIUC, is the ENOMEM that is
returned on unmapped holes. Such ENOMEM does not appear, according to
the man page, to be a valid reason to return ENOMEM to userspace.
Presumably process_madvise() is expected to skip unmapped holes
and not to fail because of them.

Having said that, I do not think that the check that the patch does
is clean or clearly documented.

In addition, this patch (and some work on process_madvise()) raise
in my mind a couple of questions:

1. There are other errors that process_madvise might encounter
and can be propagated back to userspace, but are not
documented. For instance if can_madv_lru_vma() fails on
MADV_COLD, userspace will get EINVAL. EINVAL is not documented
as a valid error-code for such case in either madvise() and
process_madvise() man pages.

2. If an advice fails due to other reason, specifically
split_huge_page(), there might be partial success within a
VMA. I guess for now it is fine to ignore such failures
since there is no guarantee on the OS behavior following
most advices (excluding MADV_DONTNEED).

Regards,
Nadav

2022-03-10 18:04:07

by Charan Teja Kalla

[permalink] [raw]
Subject: Re: [PATCH] mm: madvise: return correct bytes advised with process_madvise

Thanks Amit for the inputs!!

On 3/10/2022 12:20 AM, Nadav Amit wrote:
> ---
> mm/madvise.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 38d0f51..d3b49b3 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1426,15 +1426,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>
> while (iov_iter_count(&iter)) {
> iovec = iov_iter_iovec(&iter);
> + /*
> + * Even when [start, end) passed to do_madvise covers
> + * some unmapped addresses, it continues processing with
> + * returning ENOMEM at the end. Thus consider the range
> + * as processed when do_madvise() returns ENOMEM.
> + * This makes process_madvise() never returns ENOMEM.
> + */
>
> I fully understand and relate to the basic motivation of this
> patch.
>
> The ENOMEM that this patch checks for, IIUC, is the ENOMEM that is
> returned on unmapped holes. Such ENOMEM does not appear, according to
> the man page, to be a valid reason to return ENOMEM to userspace.
> Presumably process_madvise() is expected to skip unmapped holes
> and not to fail because of them>
True, that ENOMEM represents the VMA passed contains the unmapped holes.
Pasting the Documentation of do_madvise():
* -ENOMEM - addresses in the specified range are not currently
* mapped, or are outside the AS of the process.

Internally process_madvise() calls do_madvise() in a loop by passing the
vma it received in 'struct iovec'. And I too agree here that
process_madvise() is expected to process the unmapped holes.

> Having said that, I do not think that the check that the patch does
> is clean or clearly documented.

If it is about the Documentation, how about adding: "Since
process_madvise() is expected to process unmapped holes, never return
ENOMEM received from do_madvise() to user". If the code changes can be
made further cleaner, please suggest.

>
> In addition, this patch (and some work on process_madvise()) raise
> in my mind a couple of questions:
>
> 1. There are other errors that process_madvise might encounter
> and can be propagated back to userspace, but are not
> documented. For instance if can_madv_lru_vma() fails on
> MADV_COLD, userspace will get EINVAL. EINVAL is not documented
> as a valid error-code for such case in either madvise() and
> process_madvise() man pages.

I agree here with the man page documentations too and felt the same
while going through them. For the mentioned case too, in the madvise[1]
man page, EINVAL return type is only talked for MADV_DONTNEED and
MADV_REMOVE. It should also contains for MADV_PAGEOUT, MADV_COLD and as
well for MADV_FREE. The other missing return types, which I came across,
in process_madvise are:
EINVAL - return from process_madvise_behavior_valid().
EINTR - from mm_access()
EACCES - from mm_access()

Thanks,
Charan