Under error conditions, process_madvise() is not returning the exact
bytes processed in a iovec element thus user may repeat the advise on
vma ranges contained in the iovec element despite those ranges are
already processed. This problem is partially solved with commit
08095d6310a7 ("mm: madvise: skip unmapped vma holes passed to
process_madvise") for ENOMEM return types. These patches try to solve
the problem for other error return types.
Starting this as new discussion, as the back ground for these changes
are coming from below patches, which are already merged into linus tree:
1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5bd009c7c9a9e888077c07535dc0c70aeab242c3
2) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=08095d6310a7ce43256b4251577bc66a25c6e1a6
and lore archives for the above changes:
1) V2: https://lore.kernel.org/linux-mm/[email protected]/
2) V1: https://lore.kernel.org/linux-mm/[email protected]/
Charan Teja Kalla (1):
Revert "mm: madvise: skip unmapped vma holes passed to
process_madvise"
Charan Teja Reddy (1):
mm: madvise: return exact bytes advised with process_madvise under
error
mm/madvise.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 88 insertions(+), 11 deletions(-)
--
2.7.4
This reverts commit 08095d6310a7 ("mm: madvise: skip unmapped vma holes
passed to process_madvise") as process_madvise() fails to return exact
processed bytes at other cases too. As an example: if the
process_madvise() hits mlocked pages after processing some initial bytes
passed in [start, end), it just returns EINVAL though some bytes are
processed. Thus making an exception only for ENOMEM is partially fixing
the problem of returning the proper advised bytes.
Thus revert this patch and return proper bytes advised, if there any,
for all the error types in the following patch.
Signed-off-by: Charan Teja Kalla <[email protected]>
---
mm/madvise.c | 9 +--------
1 file changed, 1 insertion(+), 8 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 39b712f..0d8fd17 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1433,16 +1433,9 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
while (iov_iter_count(&iter)) {
iovec = iov_iter_iovec(&iter);
- /*
- * do_madvise returns ENOMEM if unmapped holes are present
- * in the passed VMA. process_madvise() is expected to skip
- * unmapped holes passed to it in the 'struct iovec' list
- * and not fail because of them. Thus treat -ENOMEM return
- * from do_madvise as valid and continue processing.
- */
ret = do_madvise(mm, (unsigned long)iovec.iov_base,
iovec.iov_len, behavior);
- if (ret < 0 && ret != -ENOMEM)
+ if (ret < 0)
break;
iov_iter_advance(&iter, iovec.iov_len);
}
--
2.7.4
On Wed 23-03-22 20:54:09, Charan Teja Kalla wrote:
> This reverts commit 08095d6310a7 ("mm: madvise: skip unmapped vma holes
> passed to process_madvise") as process_madvise() fails to return exact
> processed bytes at other cases too. As an example: if the
> process_madvise() hits mlocked pages after processing some initial bytes
> passed in [start, end), it just returns EINVAL though some bytes are
> processed. Thus making an exception only for ENOMEM is partially fixing
> the problem of returning the proper advised bytes.
>
> Thus revert this patch and return proper bytes advised, if there any,
> for all the error types in the following patch.
I do agree with the revert. I am not sure the above really is a proper
justification though. 08095d6310a7 was changing one (arguably) dubious
semantic by another one without a proper justification and wider
consensus which I would expect from a patch which changes an existing
semantic. Not to mention it being marked for stable tree.
But let's not nit pick on that now. Let's send this revert ASAP and use
some more time to discuss the semantic and whether any change is really
required.
> Signed-off-by: Charan Teja Kalla <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> mm/madvise.c | 9 +--------
> 1 file changed, 1 insertion(+), 8 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 39b712f..0d8fd17 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1433,16 +1433,9 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>
> while (iov_iter_count(&iter)) {
> iovec = iov_iter_iovec(&iter);
> - /*
> - * do_madvise returns ENOMEM if unmapped holes are present
> - * in the passed VMA. process_madvise() is expected to skip
> - * unmapped holes passed to it in the 'struct iovec' list
> - * and not fail because of them. Thus treat -ENOMEM return
> - * from do_madvise as valid and continue processing.
> - */
> ret = do_madvise(mm, (unsigned long)iovec.iov_base,
> iovec.iov_len, behavior);
> - if (ret < 0 && ret != -ENOMEM)
> + if (ret < 0)
> break;
> iov_iter_advance(&iter, iovec.iov_len);
> }
> --
> 2.7.4
--
Michal Hocko
SUSE Labs
Thanks Michal.
On 3/24/2022 6:18 PM, Michal Hocko wrote:
> On Wed 23-03-22 20:54:09, Charan Teja Kalla wrote:
>> This reverts commit 08095d6310a7 ("mm: madvise: skip unmapped vma holes
>> passed to process_madvise") as process_madvise() fails to return exact
>> processed bytes at other cases too. As an example: if the
>> process_madvise() hits mlocked pages after processing some initial bytes
>> passed in [start, end), it just returns EINVAL though some bytes are
>> processed. Thus making an exception only for ENOMEM is partially fixing
>> the problem of returning the proper advised bytes.
>>
>> Thus revert this patch and return proper bytes advised, if there any,
>> for all the error types in the following patch.
>
> I do agree with the revert. I am not sure the above really is a proper
> justification though. 08095d6310a7 was changing one (arguably) dubious
> semantic by another one without a proper justification and wider
> consensus which I would expect from a patch which changes an existing
> semantic. Not to mention it being marked for stable tree.
Thanks for pointing this out. Since 08095d6310a7 is marked for stable
tree, doing the same for this change.
Cc: <[email protected]> # 5.10+
>
> But let's not nit pick on that now. Let's send this revert ASAP and use
> some more time to discuss the semantic and whether any change is really
> required.
>
>> Signed-off-by: Charan Teja Kalla <[email protected]>
>
> Acked-by: Michal Hocko <[email protected]>
>
Thanks for the quick ack.
>> ---
>> mm/madvise.c | 9 +--------
>> 1 file changed, 1 insertion(+), 8 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 39b712f..0d8fd17 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -1433,16 +1433,9 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>>
>> while (iov_iter_count(&iter)) {
>> iovec = iov_iter_iovec(&iter);
>> - /*
>> - * do_madvise returns ENOMEM if unmapped holes are present
>> - * in the passed VMA. process_madvise() is expected to skip
>> - * unmapped holes passed to it in the 'struct iovec' list
>> - * and not fail because of them. Thus treat -ENOMEM return
>> - * from do_madvise as valid and continue processing.
>> - */
>> ret = do_madvise(mm, (unsigned long)iovec.iov_base,
>> iovec.iov_len, behavior);
>> - if (ret < 0 && ret != -ENOMEM)
>> + if (ret < 0)
>> break;
>> iov_iter_advance(&iter, iovec.iov_len);
>> }
>> --
>> 2.7.4
>
From: Charan Teja Reddy <[email protected]>
The commit 5bd009c7c9a9 ("mm: madvise: return correct bytes advised with
process_madvise") fixes the issue to return number of bytes that are
successfully advised before hitting error with iovec elements
processing. But, when the user passed unmapped ranges in iovec, the
syscall ignores these holes and continues processing and returns ENOMEM
in the end, which is same as madvise semantic. This is a problem for
vector processing where user may want to know how many bytes were
exactly processed in a iovec element to make better decissions in the
user space. As in ENOMEM case, we processed all bytes in a iovec element
but still returned error which will confuse the user whether it is
failed or succeeded to advise.
As an example, consider below ranges were passed by the user in struct
iovec: iovec1(ranges: vma1), iovec2(ranges: vma2 -- vma3 -- hole) and
iovec3(ranges: vma4). In the current implementation, it fully advise
iovec1 and iovec2 but just returns number of processed bytes as iovec1
range. Then user may repeat the processing of iovec2, which is already
processed, which then returns with ENOMEM. Then user may want to skip
iovec2 and starts processing from iovec3. Here because of wrong return
processed bytes, iovec2 is processed twice. This problem is solved with
commit 08095d6310a7 ("mm: madvise: skip unmapped vma holes passed to
process_madvise"), where the user now returns iovec1 and iovec2 as
processed and he may restart from iovec3. Some problems with this
patch are that:
1) User may wanted to be notified as unmapped address ranges were
passed by returning ENOMEM[1].
2) It didn't consider the case where there exists partially advised
bytes with other error types too, eg EINVAL. Thus fixing only for ENOMEM
is partially solving the problem[2].
3) Even if no vma is found in the passed iovec range, it is still
considered as processed instead of returning ENOMEM.
These can be fixed by having process_madvise() with its own
semantics[3], different from madvise(), where it will have its own
iterator and returns exact bytes it addressed. Now process_madvise()
stops iterating if it encounters a hole or an invalid vma and returns
the bytes till processed in that iovec element. In the above example, it
first returns the processed bytes as the ranges of iovec1(vma1) and
iovec2(vma2, vma3) so that user can exactly know that hole/invalid vma
exists after vma3 in the passed iovec elements. And thus user can skip
hole/invalid vma in the next retry and starts processing from iovec3.
[1]https://lore.kernel.org/linux-mm/[email protected]/
[2]https://lore.kernel.org/linux-mm/[email protected]/
[3]https://lore.kernel.org/linux-mm/[email protected]/
Signed-off-by: Charan Teja Reddy <[email protected]>
---
mm/madvise.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 87 insertions(+), 3 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 0d8fd17..9169b16 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1381,6 +1381,89 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
return do_madvise(current->mm, start, len_in, behavior);
}
+/*
+ * TODO: Add documentation for process_madvise()
+ */
+static int do_process_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
+ int behavior, size_t *partial_bytes_advised)
+{
+ unsigned long end, tmp;
+ struct vm_area_struct *vma, *prev;
+ int error = -EINVAL;
+ size_t len;
+ size_t tmp_bytes_advised = 0;
+ struct blk_plug plug;
+
+ *partial_bytes_advised = 0;
+ /*
+ * TODO: Move these checks to a common function to be used by both
+ * madvise() and process_madvise().
+ */
+ start = untagged_addr(start);
+ if (!PAGE_ALIGNED(start))
+ return error;
+ len = PAGE_ALIGN(len_in);
+
+ /* Check to see whether len was rounded up from small -ve to zero */
+ if (len_in && !len)
+ return error;
+
+ end = start + len;
+ if (end < start)
+ return error;
+
+ error = 0;
+ if (end == start)
+ return error;
+
+ mmap_read_lock(mm);
+
+ vma = find_vma_prev(mm, start, &prev);
+ if (vma && start > vma->vm_start)
+ prev = vma;
+
+ blk_start_plug(&plug);
+ for (;;) {
+ /*
+ * It it hits a unmapped address range in the [start, end),
+ * stop processing and return ENOMEM.
+ */
+ if (!vma || start < vma->vm_start) {
+ error = -ENOMEM;
+ goto out;
+ }
+
+ tmp = vma->vm_end;
+ if (end < tmp)
+ tmp = end;
+
+ error = madvise_vma_behavior(vma, &prev, start, tmp, behavior);
+ if (error)
+ goto out;
+ tmp_bytes_advised += tmp - start;
+ start = tmp;
+ if (prev && start < prev->vm_end)
+ start = prev->vm_end;
+ if (start >= end)
+ goto out;
+ if (prev)
+ vma = prev->vm_next;
+ else
+ vma = find_vma(mm, start);
+ }
+out:
+ /*
+ * partial_bytes_advised may contain non-zero bytes indicating
+ * the number of bytes advised before failure. Holds zero incase
+ * of success.
+ */
+ *partial_bytes_advised = error ? tmp_bytes_advised : 0;
+ blk_finish_plug(&plug);
+ mmap_read_unlock(mm);
+
+ return error;
+}
+
SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
size_t, vlen, int, behavior, unsigned int, flags)
{
@@ -1391,6 +1474,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
struct task_struct *task;
struct mm_struct *mm;
size_t total_len;
+ size_t partial_bytes_advised;
unsigned int f_flags;
if (flags != 0) {
@@ -1433,14 +1517,14 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
while (iov_iter_count(&iter)) {
iovec = iov_iter_iovec(&iter);
- ret = do_madvise(mm, (unsigned long)iovec.iov_base,
- iovec.iov_len, behavior);
+ ret = do_process_madvise(mm, (unsigned long)iovec.iov_base,
+ iovec.iov_len, behavior, &partial_bytes_advised);
if (ret < 0)
break;
iov_iter_advance(&iter, iovec.iov_len);
}
- ret = (total_len - iov_iter_count(&iter)) ? : ret;
+ ret = (total_len - iov_iter_count(&iter) + partial_bytes_advised) ? : ret;
release_mm:
mmput(mm);
--
2.7.4