This series contains two patches that fix vma merge/split for userfaultfd
on two separate issues. The patchset is based on akpm/mm-hotfixes-unstable
with 2f628010799e reverted (where patch 1 should be used to replace it
which seems to be the plan we reached).
The major changes comparing to the patches I attached to the reply:
- Fixed up patch 1 on vma_prev() side effect pointed out by Liam, further
I simplified it to just bring back the two lines missing, so even shorter.
- Add fixes tags for both patches, I decided to copy stable for both
patch in this version, even though patch 2 is more or less tentative
(as I don't see anything wrong besides vma didn't trigger a merge).
Patch 1 fixes a regression since 6.1+ due to something we overlooked when
converting to maple tree apis. The plan is we use patch 1 to replace the
commit "2f628010799e (mm: userfaultfd: avoid passing an invalid range to
vma_merge())" in mm-hostfixes-unstable tree if possible, so as to bring
uffd vma operations back aligned with the rest code again.
Patch 2 fixes a long standing issue that vma can be left unmerged even if
we can for either uffd register or unregister.
Many thanks to Lorenzo on either noticing this issue from the assert
movement patch, looking at this problem, and also provided a reproducer on
the unmerged vma issue [1].
Please have a look, thanks.
[1] https://gist.github.com/lorenzo-stoakes/a11a10f5f479e7a977fc456331266e0e
Peter Xu (2):
mm/uffd: Fix vma operation where start addr cuts part of vma
mm/uffd: Allow vma to merge as much as possible
fs/userfaultfd.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
--
2.39.1
It seems vma merging with uffd paths is broken with either
register/unregister, where right now we can feed wrong parameters to
vma_merge() and it's found by recent patch which moved asserts upwards in
vma_merge() by Lorenzo Stoakes:
https://lore.kernel.org/all/[email protected]/
The problem is in the current code base we didn't fixup "prev" for the case
where "start" address can be within the "prev" vma section. In that case
we should have "prev" points to the current vma rather than the previous
one when feeding to vma_merge().
This patch will eliminate the report and make sure vma_merge() calls will
become legal again.
One thing to mention is that the "Fixes: 29417d292bd0" below is there only
to help explain where the warning can start to trigger, the real commit to
fix should be 69dbe6daf104. Commit 29417d292bd0 helps us to identify the
issue, but unfortunately we may want to keep it in Fixes too just to ease
kernel backporters for easier tracking.
Cc: Lorenzo Stoakes <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Reported-by: Mark Rutland <[email protected]>
Fixes: 29417d292bd0 ("mm/mmap/vma_merge: always check invariants")
Fixes: 69dbe6daf104 ("userfaultfd: use maple tree iterator to iterate VMAs")
Closes: https://lore.kernel.org/all/[email protected]/
Cc: linux-stable <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 0fd96d6e39ce..17c8c345dac4 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1459,6 +1459,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
vma_iter_set(&vmi, start);
prev = vma_prev(&vmi);
+ if (vma->vm_start < start)
+ prev = vma;
ret = 0;
for_each_vma_range(vmi, vma, end) {
@@ -1625,6 +1627,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
vma_iter_set(&vmi, start);
prev = vma_prev(&vmi);
+ if (vma->vm_start < start)
+ prev = vma;
+
ret = 0;
for_each_vma_range(vmi, vma, end) {
cond_resched();
--
2.39.1
We used to not pass in the pgoff correctly when register/unregister uffd
regions, it caused incorrect behavior on vma merging and can cause
mergeable vmas being separate after ioctls return.
For example, when we have:
vma1(range 0-9, with uffd), vma2(range 10-19, no uffd)
Then someone unregisters uffd on range (5-9), it should logically become:
vma1(range 0-4, with uffd), vma2(range 5-19, no uffd)
But with current code we'll have:
vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd)
This patch allows such merge to happen correctly before ioctl returns.
This behavior seems to have existed since the 1st day of uffd. Since pgoff
for vma_merge() is only used to identify the possibility of vma merging,
meanwhile here what we did was always passing in a pgoff smaller than what
we should, so there should have no other side effect besides not merging
it. Let's still tentatively copy stable for this, even though I don't see
anything will go wrong besides vma being split (which is mostly not user
visible).
Cc: Andrea Arcangeli <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: linux-stable <[email protected]>
Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 17c8c345dac4..4e800bb7d2ab 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1332,6 +1332,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
bool basic_ioctls;
unsigned long start, end, vma_end;
struct vma_iterator vmi;
+ pgoff_t pgoff;
user_uffdio_register = (struct uffdio_register __user *) arg;
@@ -1484,8 +1485,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
vma_end = min(end, vma->vm_end);
new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
+ pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags,
- vma->anon_vma, vma->vm_file, vma->vm_pgoff,
+ vma->anon_vma, vma->vm_file, pgoff,
vma_policy(vma),
((struct vm_userfaultfd_ctx){ ctx }),
anon_vma_name(vma));
@@ -1565,6 +1567,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
unsigned long start, end, vma_end;
const void __user *buf = (void __user *)arg;
struct vma_iterator vmi;
+ pgoff_t pgoff;
ret = -EFAULT;
if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister)))
@@ -1667,8 +1670,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
uffd_wp_range(vma, start, vma_end - start, false);
new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
+ pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags,
- vma->anon_vma, vma->vm_file, vma->vm_pgoff,
+ vma->anon_vma, vma->vm_file, pgoff,
vma_policy(vma),
NULL_VM_UFFD_CTX, anon_vma_name(vma));
if (prev) {
--
2.39.1
On Wed, May 17, 2023 at 11:04:08AM -0400, Peter Xu wrote:
> We used to not pass in the pgoff correctly when register/unregister uffd
> regions, it caused incorrect behavior on vma merging and can cause
> mergeable vmas being separate after ioctls return.
>
> For example, when we have:
>
> vma1(range 0-9, with uffd), vma2(range 10-19, no uffd)
>
> Then someone unregisters uffd on range (5-9), it should logically become:
>
> vma1(range 0-4, with uffd), vma2(range 5-19, no uffd)
>
> But with current code we'll have:
>
> vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd)
>
> This patch allows such merge to happen correctly before ioctl returns.
>
> This behavior seems to have existed since the 1st day of uffd. Since pgoff
> for vma_merge() is only used to identify the possibility of vma merging,
> meanwhile here what we did was always passing in a pgoff smaller than what
> we should, so there should have no other side effect besides not merging
> it. Let's still tentatively copy stable for this, even though I don't see
> anything will go wrong besides vma being split (which is mostly not user
> visible).
>
Maybe a Reported-by me since I discovered the fragmentation was already
happening via the repro? :)
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Mike Rapoport (IBM) <[email protected]>
> Cc: linux-stable <[email protected]>
> Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
> Signed-off-by: Peter Xu <[email protected]>
> ---
> fs/userfaultfd.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 17c8c345dac4..4e800bb7d2ab 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1332,6 +1332,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> bool basic_ioctls;
> unsigned long start, end, vma_end;
> struct vma_iterator vmi;
> + pgoff_t pgoff;
>
> user_uffdio_register = (struct uffdio_register __user *) arg;
>
> @@ -1484,8 +1485,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> vma_end = min(end, vma->vm_end);
>
> new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
> + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags,
> - vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> + vma->anon_vma, vma->vm_file, pgoff,
> vma_policy(vma),
> ((struct vm_userfaultfd_ctx){ ctx }),
> anon_vma_name(vma));
> @@ -1565,6 +1567,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> unsigned long start, end, vma_end;
> const void __user *buf = (void __user *)arg;
> struct vma_iterator vmi;
> + pgoff_t pgoff;
>
> ret = -EFAULT;
> if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister)))
> @@ -1667,8 +1670,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> uffd_wp_range(vma, start, vma_end - start, false);
>
> new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
> + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags,
> - vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> + vma->anon_vma, vma->vm_file, pgoff,
> vma_policy(vma),
> NULL_VM_UFFD_CTX, anon_vma_name(vma));
> if (prev) {
> --
> 2.39.1
>
Acked-by: Lorenzo Stoakes <[email protected]>
On Wed, May 17, 2023 at 11:04:07AM -0400, Peter Xu wrote:
> It seems vma merging with uffd paths is broken with either
> register/unregister, where right now we can feed wrong parameters to
> vma_merge() and it's found by recent patch which moved asserts upwards in
> vma_merge() by Lorenzo Stoakes:
>
> https://lore.kernel.org/all/[email protected]/
>
> The problem is in the current code base we didn't fixup "prev" for the case
> where "start" address can be within the "prev" vma section. In that case
> we should have "prev" points to the current vma rather than the previous
> one when feeding to vma_merge().
This doesn't seem quite correct, perhaps - "where start is contained within vma
but not clamped to its start. We need to convert this into case 4 which permits
subdivision of prev by assigning vma to prev. As we loop, each subsequent VMA
will be clamped to the start."
>
> This patch will eliminate the report and make sure vma_merge() calls will
> become legal again.
>
> One thing to mention is that the "Fixes: 29417d292bd0" below is there only
> to help explain where the warning can start to trigger, the real commit to
> fix should be 69dbe6daf104. Commit 29417d292bd0 helps us to identify the
> issue, but unfortunately we may want to keep it in Fixes too just to ease
> kernel backporters for easier tracking.
>
> Cc: Lorenzo Stoakes <[email protected]>
> Cc: Mike Rapoport (IBM) <[email protected]>
> Cc: Liam R. Howlett <[email protected]>
> Reported-by: Mark Rutland <[email protected]>
> Fixes: 29417d292bd0 ("mm/mmap/vma_merge: always check invariants")
> Fixes: 69dbe6daf104 ("userfaultfd: use maple tree iterator to iterate VMAs")
> Closes: https://lore.kernel.org/all/[email protected]/
> Cc: linux-stable <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> fs/userfaultfd.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 0fd96d6e39ce..17c8c345dac4 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1459,6 +1459,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>
> vma_iter_set(&vmi, start);
> prev = vma_prev(&vmi);
> + if (vma->vm_start < start)
> + prev = vma;
>
> ret = 0;
> for_each_vma_range(vmi, vma, end) {
> @@ -1625,6 +1627,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>
> vma_iter_set(&vmi, start);
> prev = vma_prev(&vmi);
> + if (vma->vm_start < start)
> + prev = vma;
> +
> ret = 0;
> for_each_vma_range(vmi, vma, end) {
> cond_resched();
> --
> 2.39.1
>
Other than that looks good:-
Reviewed-by: Lorenzo Stoakes <[email protected]>
* Peter Xu <[email protected]> [230517 11:04]:
> It seems vma merging with uffd paths is broken with either
> register/unregister, where right now we can feed wrong parameters to
> vma_merge() and it's found by recent patch which moved asserts upwards in
> vma_merge() by Lorenzo Stoakes:
>
> https://lore.kernel.org/all/[email protected]/
>
> The problem is in the current code base we didn't fixup "prev" for the case
> where "start" address can be within the "prev" vma section. In that case
> we should have "prev" points to the current vma rather than the previous
> one when feeding to vma_merge().
>
> This patch will eliminate the report and make sure vma_merge() calls will
> become legal again.
>
> One thing to mention is that the "Fixes: 29417d292bd0" below is there only
> to help explain where the warning can start to trigger, the real commit to
> fix should be 69dbe6daf104. Commit 29417d292bd0 helps us to identify the
> issue, but unfortunately we may want to keep it in Fixes too just to ease
> kernel backporters for easier tracking.
>
> Cc: Lorenzo Stoakes <[email protected]>
> Cc: Mike Rapoport (IBM) <[email protected]>
> Cc: Liam R. Howlett <[email protected]>
> Reported-by: Mark Rutland <[email protected]>
> Fixes: 29417d292bd0 ("mm/mmap/vma_merge: always check invariants")
> Fixes: 69dbe6daf104 ("userfaultfd: use maple tree iterator to iterate VMAs")
> Closes: https://lore.kernel.org/all/[email protected]/
> Cc: linux-stable <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
> ---
> fs/userfaultfd.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 0fd96d6e39ce..17c8c345dac4 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1459,6 +1459,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>
> vma_iter_set(&vmi, start);
> prev = vma_prev(&vmi);
> + if (vma->vm_start < start)
> + prev = vma;
>
> ret = 0;
> for_each_vma_range(vmi, vma, end) {
> @@ -1625,6 +1627,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>
> vma_iter_set(&vmi, start);
> prev = vma_prev(&vmi);
> + if (vma->vm_start < start)
> + prev = vma;
> +
> ret = 0;
> for_each_vma_range(vmi, vma, end) {
> cond_resched();
> --
> 2.39.1
>
* Peter Xu <[email protected]> [230517 11:04]:
> We used to not pass in the pgoff correctly when register/unregister uffd
> regions, it caused incorrect behavior on vma merging and can cause
> mergeable vmas being separate after ioctls return.
>
> For example, when we have:
>
> vma1(range 0-9, with uffd), vma2(range 10-19, no uffd)
>
> Then someone unregisters uffd on range (5-9), it should logically become:
>
> vma1(range 0-4, with uffd), vma2(range 5-19, no uffd)
>
> But with current code we'll have:
>
> vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd)
>
> This patch allows such merge to happen correctly before ioctl returns.
>
> This behavior seems to have existed since the 1st day of uffd. Since pgoff
> for vma_merge() is only used to identify the possibility of vma merging,
> meanwhile here what we did was always passing in a pgoff smaller than what
> we should, so there should have no other side effect besides not merging
> it. Let's still tentatively copy stable for this, even though I don't see
> anything will go wrong besides vma being split (which is mostly not user
> visible).
>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Mike Rapoport (IBM) <[email protected]>
> Cc: linux-stable <[email protected]>
> Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
> Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
> ---
> fs/userfaultfd.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 17c8c345dac4..4e800bb7d2ab 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1332,6 +1332,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> bool basic_ioctls;
> unsigned long start, end, vma_end;
> struct vma_iterator vmi;
> + pgoff_t pgoff;
>
> user_uffdio_register = (struct uffdio_register __user *) arg;
>
> @@ -1484,8 +1485,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> vma_end = min(end, vma->vm_end);
>
> new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
> + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags,
> - vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> + vma->anon_vma, vma->vm_file, pgoff,
> vma_policy(vma),
> ((struct vm_userfaultfd_ctx){ ctx }),
> anon_vma_name(vma));
> @@ -1565,6 +1567,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> unsigned long start, end, vma_end;
> const void __user *buf = (void __user *)arg;
> struct vma_iterator vmi;
> + pgoff_t pgoff;
>
> ret = -EFAULT;
> if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister)))
> @@ -1667,8 +1670,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> uffd_wp_range(vma, start, vma_end - start, false);
>
> new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
> + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags,
> - vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> + vma->anon_vma, vma->vm_file, pgoff,
> vma_policy(vma),
> NULL_VM_UFFD_CTX, anon_vma_name(vma));
> if (prev) {
> --
> 2.39.1
>
On Wed, May 17, 2023 at 06:20:55PM +0100, Lorenzo Stoakes wrote:
> On Wed, May 17, 2023 at 11:04:07AM -0400, Peter Xu wrote:
> > It seems vma merging with uffd paths is broken with either
> > register/unregister, where right now we can feed wrong parameters to
> > vma_merge() and it's found by recent patch which moved asserts upwards in
> > vma_merge() by Lorenzo Stoakes:
> >
> > https://lore.kernel.org/all/[email protected]/
> >
> > The problem is in the current code base we didn't fixup "prev" for the case
> > where "start" address can be within the "prev" vma section. In that case
> > we should have "prev" points to the current vma rather than the previous
> > one when feeding to vma_merge().
>
> This doesn't seem quite correct, perhaps - "where start is contained within vma
> but not clamped to its start. We need to convert this into case 4 which permits
> subdivision of prev by assigning vma to prev. As we loop, each subsequent VMA
> will be clamped to the start."
I think it covers more than case 4 - it can also be case 0 where no merge
will happen?
>
> >
> > This patch will eliminate the report and make sure vma_merge() calls will
> > become legal again.
> >
> > One thing to mention is that the "Fixes: 29417d292bd0" below is there only
> > to help explain where the warning can start to trigger, the real commit to
> > fix should be 69dbe6daf104. Commit 29417d292bd0 helps us to identify the
> > issue, but unfortunately we may want to keep it in Fixes too just to ease
> > kernel backporters for easier tracking.
> >
> > Cc: Lorenzo Stoakes <[email protected]>
> > Cc: Mike Rapoport (IBM) <[email protected]>
> > Cc: Liam R. Howlett <[email protected]>
> > Reported-by: Mark Rutland <[email protected]>
> > Fixes: 29417d292bd0 ("mm/mmap/vma_merge: always check invariants")
> > Fixes: 69dbe6daf104 ("userfaultfd: use maple tree iterator to iterate VMAs")
> > Closes: https://lore.kernel.org/all/[email protected]/
> > Cc: linux-stable <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > fs/userfaultfd.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 0fd96d6e39ce..17c8c345dac4 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -1459,6 +1459,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> >
> > vma_iter_set(&vmi, start);
> > prev = vma_prev(&vmi);
> > + if (vma->vm_start < start)
> > + prev = vma;
> >
> > ret = 0;
> > for_each_vma_range(vmi, vma, end) {
> > @@ -1625,6 +1627,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> >
> > vma_iter_set(&vmi, start);
> > prev = vma_prev(&vmi);
> > + if (vma->vm_start < start)
> > + prev = vma;
> > +
> > ret = 0;
> > for_each_vma_range(vmi, vma, end) {
> > cond_resched();
> > --
> > 2.39.1
> >
>
> Other than that looks good:-
>
> Reviewed-by: Lorenzo Stoakes <[email protected]>
Thanks to both on the quick reviews!
--
Peter Xu
On Wed, May 17, 2023 at 02:37:41PM -0400, Peter Xu wrote:
> On Wed, May 17, 2023 at 06:20:55PM +0100, Lorenzo Stoakes wrote:
> > On Wed, May 17, 2023 at 11:04:07AM -0400, Peter Xu wrote:
> > > It seems vma merging with uffd paths is broken with either
> > > register/unregister, where right now we can feed wrong parameters to
> > > vma_merge() and it's found by recent patch which moved asserts upwards in
> > > vma_merge() by Lorenzo Stoakes:
> > >
> > > https://lore.kernel.org/all/[email protected]/
> > >
> > > The problem is in the current code base we didn't fixup "prev" for the case
> > > where "start" address can be within the "prev" vma section. In that case
> > > we should have "prev" points to the current vma rather than the previous
> > > one when feeding to vma_merge().
> >
> > This doesn't seem quite correct, perhaps - "where start is contained within vma
> > but not clamped to its start. We need to convert this into case 4 which permits
> > subdivision of prev by assigning vma to prev. As we loop, each subsequent VMA
> > will be clamped to the start."
>
> I think it covers more than case 4 - it can also be case 0 where no merge
> will happen?
Ugh please let's not call a case that doesn't merge by a number :P but sure of
course it might also not merge.
>
> >
> > >
> > > This patch will eliminate the report and make sure vma_merge() calls will
> > > become legal again.
> > >
> > > One thing to mention is that the "Fixes: 29417d292bd0" below is there only
> > > to help explain where the warning can start to trigger, the real commit to
> > > fix should be 69dbe6daf104. Commit 29417d292bd0 helps us to identify the
> > > issue, but unfortunately we may want to keep it in Fixes too just to ease
> > > kernel backporters for easier tracking.
> > >
> > > Cc: Lorenzo Stoakes <[email protected]>
> > > Cc: Mike Rapoport (IBM) <[email protected]>
> > > Cc: Liam R. Howlett <[email protected]>
> > > Reported-by: Mark Rutland <[email protected]>
> > > Fixes: 29417d292bd0 ("mm/mmap/vma_merge: always check invariants")
> > > Fixes: 69dbe6daf104 ("userfaultfd: use maple tree iterator to iterate VMAs")
> > > Closes: https://lore.kernel.org/all/[email protected]/
> > > Cc: linux-stable <[email protected]>
> > > Signed-off-by: Peter Xu <[email protected]>
> > > ---
> > > fs/userfaultfd.c | 5 +++++
> > > 1 file changed, 5 insertions(+)
> > >
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index 0fd96d6e39ce..17c8c345dac4 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -1459,6 +1459,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > >
> > > vma_iter_set(&vmi, start);
> > > prev = vma_prev(&vmi);
> > > + if (vma->vm_start < start)
> > > + prev = vma;
> > >
> > > ret = 0;
> > > for_each_vma_range(vmi, vma, end) {
> > > @@ -1625,6 +1627,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > >
> > > vma_iter_set(&vmi, start);
> > > prev = vma_prev(&vmi);
> > > + if (vma->vm_start < start)
> > > + prev = vma;
> > > +
> > > ret = 0;
> > > for_each_vma_range(vmi, vma, end) {
> > > cond_resched();
> > > --
> > > 2.39.1
> > >
> >
> > Other than that looks good:-
> >
> > Reviewed-by: Lorenzo Stoakes <[email protected]>
>
> Thanks to both on the quick reviews!
No problem!
>
> --
> Peter Xu
>
On Wed, May 17, 2023 at 06:23:18PM +0100, Lorenzo Stoakes wrote:
> Maybe a Reported-by me since I discovered the fragmentation was already
> happening via the repro? :)
Sure! I'll add it when/if there's a repost. Thanks.
--
Peter Xu
On Wed, May 17, 2023 at 07:40:59PM +0100, Lorenzo Stoakes wrote:
> On Wed, May 17, 2023 at 02:37:41PM -0400, Peter Xu wrote:
> > On Wed, May 17, 2023 at 06:20:55PM +0100, Lorenzo Stoakes wrote:
> > > On Wed, May 17, 2023 at 11:04:07AM -0400, Peter Xu wrote:
> > > > It seems vma merging with uffd paths is broken with either
> > > > register/unregister, where right now we can feed wrong parameters to
> > > > vma_merge() and it's found by recent patch which moved asserts upwards in
> > > > vma_merge() by Lorenzo Stoakes:
> > > >
> > > > https://lore.kernel.org/all/[email protected]/
> > > >
> > > > The problem is in the current code base we didn't fixup "prev" for the case
> > > > where "start" address can be within the "prev" vma section. In that case
> > > > we should have "prev" points to the current vma rather than the previous
> > > > one when feeding to vma_merge().
> > >
> > > This doesn't seem quite correct, perhaps - "where start is contained within vma
> > > but not clamped to its start. We need to convert this into case 4 which permits
> > > subdivision of prev by assigning vma to prev. As we loop, each subsequent VMA
> > > will be clamped to the start."
> >
> > I think it covers more than case 4 - it can also be case 0 where no merge
> > will happen?
>
> Ugh please let's not call a case that doesn't merge by a number :P but sure of
> course it might also not merge.
To me the original paragraph was still fine. But if you prefer your version
(which I'm perfectly fine either way if you'd like to spell out what cases
it'll trigger), it'll be:
It's possible that "start" is contained within vma but not clamped to its
start. We need to convert this into either "cannot merge" case or "can
merge" case 4 which permits subdivision of prev by assigning vma to
prev. As we loop, each subsequent VMA will be clamped to the start.
Does that look good to you?
Thanks,
--
Peter Xu
On Wed, May 17, 2023 at 02:54:39PM -0400, Peter Xu wrote:
> On Wed, May 17, 2023 at 07:40:59PM +0100, Lorenzo Stoakes wrote:
> > On Wed, May 17, 2023 at 02:37:41PM -0400, Peter Xu wrote:
> > > On Wed, May 17, 2023 at 06:20:55PM +0100, Lorenzo Stoakes wrote:
> > > > On Wed, May 17, 2023 at 11:04:07AM -0400, Peter Xu wrote:
> > > > > It seems vma merging with uffd paths is broken with either
> > > > > register/unregister, where right now we can feed wrong parameters to
> > > > > vma_merge() and it's found by recent patch which moved asserts upwards in
> > > > > vma_merge() by Lorenzo Stoakes:
> > > > >
> > > > > https://lore.kernel.org/all/[email protected]/
> > > > >
> > > > > The problem is in the current code base we didn't fixup "prev" for the case
> > > > > where "start" address can be within the "prev" vma section. In that case
> > > > > we should have "prev" points to the current vma rather than the previous
> > > > > one when feeding to vma_merge().
> > > >
> > > > This doesn't seem quite correct, perhaps - "where start is contained within vma
> > > > but not clamped to its start. We need to convert this into case 4 which permits
> > > > subdivision of prev by assigning vma to prev. As we loop, each subsequent VMA
> > > > will be clamped to the start."
> > >
> > > I think it covers more than case 4 - it can also be case 0 where no merge
> > > will happen?
> >
> > Ugh please let's not call a case that doesn't merge by a number :P but sure of
> > course it might also not merge.
>
> To me the original paragraph was still fine. But if you prefer your version
> (which I'm perfectly fine either way if you'd like to spell out what cases
> it'll trigger), it'll be:
>
> It's possible that "start" is contained within vma but not clamped to its
> start. We need to convert this into either "cannot merge" case or "can
> merge" case 4 which permits subdivision of prev by assigning vma to
> prev. As we loop, each subsequent VMA will be clamped to the start.
>
> Does that look good to you?
>
Looks good to me, thanks for taking the time!
> Thanks,
>
> --
> Peter Xu
>