When debugging issues with a workload using SysV shmem, Michal Hocko has
come up with a reproducer that shows how a series of mprotect()
operations can result in an elevated shm_nattch and thus leak of the
resource.
The problem is caused by wrong assumptions in vma_merge() commit
714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in
mergeability test"). The shmem vmas have a vma_ops->close callback
that decrements shm_nattch, and we remove the vma without calling it.
vma_merge() has thus historically avoided merging vma's with
vma_ops->close and commit 714965ca8252 was supposed to keep it that way.
It relaxed the checks for vma_ops->close in can_vma_merge_after()
assuming that it is never called on a vma that would be a candidate for
removal. However, the vma_merge() code does also use the result of this
check in the decision to remove a different vma in the merge case 7.
A robust solution would be to refactor vma_merge() code in a way that
the vma_ops->close check is only done for vma's that are actually going
to be removed, and not as part of the preliminary checks. That would
both solve the existing bug, and also allow additional merges that the
checks currently prevent unnecessarily in some cases.
However to fix the existing bug first with a minimized risk, and for
easier stable backports, this patch only adds a vma_ops->close check to
the buggy case 7 specifically. All other cases of vma removal are
covered by the can_vma_merge_before() check that includes the test for
vma_ops->close.
The reproducer code, adapted from Michal Hocko's code:
int main(int argc, char *argv[]) {
int segment_id;
size_t segment_size = 20 * PAGE_SIZE;
char * sh_mem;
struct shmid_ds shmid_ds;
key_t key = 0x1234;
segment_id = shmget(key, segment_size,
IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR);
sh_mem = (char *)shmat(segment_id, NULL, 0);
mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_NONE);
mprotect(sh_mem + PAGE_SIZE, PAGE_SIZE, PROT_WRITE);
mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_WRITE);
shmdt(sh_mem);
shmctl(segment_id, IPC_STAT, &shmid_ds);
printf("nattch after shmdt(): %lu (expected: 0)\n", shmid_ds.shm_nattch);
if (shmctl(segment_id, IPC_RMID, 0))
printf("IPCRM failed %d\n", errno);
return (shmid_ds.shm_nattch) ? 1 : 0;
}
Fixes: 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test")
Reported-by: Michal Hocko <[email protected]>
Cc: <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/mmap.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index d89770eaab6b..a4238373ee9b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -954,10 +954,19 @@ static struct vm_area_struct
} else if (merge_prev) { /* case 2 */
if (curr) {
vma_start_write(curr);
- err = dup_anon_vma(prev, curr, &anon_dup);
if (end == curr->vm_end) { /* case 7 */
+ /*
+ * can_vma_merge_after() assumed we would not be
+ * removing prev vma, so it skipped the check
+ * for vm_ops->close, but we are removing curr
+ */
+ if (curr->vm_ops && curr->vm_ops->close)
+ err = -EINVAL;
+ else
+ err = dup_anon_vma(prev, curr, &anon_dup);
remove = curr;
} else { /* case 5 */
+ err = dup_anon_vma(prev, curr, &anon_dup);
adjust = curr;
adj_start = (end - curr->vm_start);
}
--
2.43.1
* Vlastimil Babka <[email protected]> [240222 11:56]:
> When debugging issues with a workload using SysV shmem, Michal Hocko has
> come up with a reproducer that shows how a series of mprotect()
> operations can result in an elevated shm_nattch and thus leak of the
> resource.
>
> The problem is caused by wrong assumptions in vma_merge() commit
> 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in
> mergeability test"). The shmem vmas have a vma_ops->close callback
> that decrements shm_nattch, and we remove the vma without calling it.
>
> vma_merge() has thus historically avoided merging vma's with
> vma_ops->close and commit 714965ca8252 was supposed to keep it that way.
> It relaxed the checks for vma_ops->close in can_vma_merge_after()
> assuming that it is never called on a vma that would be a candidate for
> removal. However, the vma_merge() code does also use the result of this
> check in the decision to remove a different vma in the merge case 7.
>
> A robust solution would be to refactor vma_merge() code in a way that
> the vma_ops->close check is only done for vma's that are actually going
> to be removed, and not as part of the preliminary checks. That would
> both solve the existing bug, and also allow additional merges that the
> checks currently prevent unnecessarily in some cases.
>
> However to fix the existing bug first with a minimized risk, and for
> easier stable backports, this patch only adds a vma_ops->close check to
> the buggy case 7 specifically. All other cases of vma removal are
> covered by the can_vma_merge_before() check that includes the test for
> vma_ops->close.
>
> The reproducer code, adapted from Michal Hocko's code:
>
> int main(int argc, char *argv[]) {
> int segment_id;
> size_t segment_size = 20 * PAGE_SIZE;
> char * sh_mem;
> struct shmid_ds shmid_ds;
>
> key_t key = 0x1234;
> segment_id = shmget(key, segment_size,
> IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR);
> sh_mem = (char *)shmat(segment_id, NULL, 0);
>
> mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_NONE);
>
> mprotect(sh_mem + PAGE_SIZE, PAGE_SIZE, PROT_WRITE);
>
> mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_WRITE);
>
> shmdt(sh_mem);
>
> shmctl(segment_id, IPC_STAT, &shmid_ds);
> printf("nattch after shmdt(): %lu (expected: 0)\n", shmid_ds.shm_nattch);
>
> if (shmctl(segment_id, IPC_RMID, 0))
> printf("IPCRM failed %d\n", errno);
> return (shmid_ds.shm_nattch) ? 1 : 0;
> }
>
> Fixes: 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test")
> Reported-by: Michal Hocko <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---
> mm/mmap.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d89770eaab6b..a4238373ee9b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -954,10 +954,19 @@ static struct vm_area_struct
> } else if (merge_prev) { /* case 2 */
> if (curr) {
> vma_start_write(curr);
> - err = dup_anon_vma(prev, curr, &anon_dup);
> if (end == curr->vm_end) { /* case 7 */
> + /*
> + * can_vma_merge_after() assumed we would not be
> + * removing prev vma, so it skipped the check
> + * for vm_ops->close, but we are removing curr
> + */
> + if (curr->vm_ops && curr->vm_ops->close)
> + err = -EINVAL;
> + else
> + err = dup_anon_vma(prev, curr, &anon_dup);
> remove = curr;
This separates the check for potentially merging previous to a later
failure case. Would it be better to check:
if (curr && curr->vm_ops && curr->vm_ops->close)
and not set merge_prev = true, ie we cannot merge with the predecessor?
That way we would exit as merge_prev == false.
We would have the added benefit of not having to look at merge_prev &
merge_next case with this vm_ops->close in mind (case 1 and 6).. because
I'm pretty sure we can currently get to case 6 in this way:
merge_prev = true
check for merge_next.. can_vma_merge_before(next...);
is_mergeable_vma(next.... , true);
if (true && next->vm_ops && next->vm_ops->close) /* Fine for next.. */
Remove curr by case 6 without checking curr->vm_ops &&
curr->vm_ops->close
If I am correct, then are we blaming the right commit?
Perhaps we should just fail earlier when we find a curr with the close
ops?
> } else { /* case 5 */
> + err = dup_anon_vma(prev, curr, &anon_dup);
> adjust = curr;
> adj_start = (end - curr->vm_start);
> }
> --
> 2.43.1
>
* Liam R. Howlett <[email protected]> [240222 13:56]:
> * Vlastimil Babka <[email protected]> [240222 11:56]:
> > When debugging issues with a workload using SysV shmem, Michal Hocko has
> > come up with a reproducer that shows how a series of mprotect()
> > operations can result in an elevated shm_nattch and thus leak of the
> > resource.
> >
> > The problem is caused by wrong assumptions in vma_merge() commit
> > 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in
> > mergeability test"). The shmem vmas have a vma_ops->close callback
> > that decrements shm_nattch, and we remove the vma without calling it.
> >
> > vma_merge() has thus historically avoided merging vma's with
> > vma_ops->close and commit 714965ca8252 was supposed to keep it that way.
> > It relaxed the checks for vma_ops->close in can_vma_merge_after()
> > assuming that it is never called on a vma that would be a candidate for
> > removal. However, the vma_merge() code does also use the result of this
> > check in the decision to remove a different vma in the merge case 7.
> >
> > A robust solution would be to refactor vma_merge() code in a way that
> > the vma_ops->close check is only done for vma's that are actually going
> > to be removed, and not as part of the preliminary checks. That would
> > both solve the existing bug, and also allow additional merges that the
> > checks currently prevent unnecessarily in some cases.
> >
> > However to fix the existing bug first with a minimized risk, and for
> > easier stable backports, this patch only adds a vma_ops->close check to
> > the buggy case 7 specifically. All other cases of vma removal are
> > covered by the can_vma_merge_before() check that includes the test for
> > vma_ops->close.
> >
> > The reproducer code, adapted from Michal Hocko's code:
> >
> > int main(int argc, char *argv[]) {
> > int segment_id;
> > size_t segment_size = 20 * PAGE_SIZE;
> > char * sh_mem;
> > struct shmid_ds shmid_ds;
> >
> > key_t key = 0x1234;
> > segment_id = shmget(key, segment_size,
> > IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR);
> > sh_mem = (char *)shmat(segment_id, NULL, 0);
> >
> > mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_NONE);
> >
> > mprotect(sh_mem + PAGE_SIZE, PAGE_SIZE, PROT_WRITE);
> >
> > mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_WRITE);
> >
> > shmdt(sh_mem);
> >
> > shmctl(segment_id, IPC_STAT, &shmid_ds);
> > printf("nattch after shmdt(): %lu (expected: 0)\n", shmid_ds.shm_nattch);
> >
> > if (shmctl(segment_id, IPC_RMID, 0))
> > printf("IPCRM failed %d\n", errno);
> > return (shmid_ds.shm_nattch) ? 1 : 0;
> > }
> >
> > Fixes: 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test")
> > Reported-by: Michal Hocko <[email protected]>
> > Cc: <[email protected]>
> > Signed-off-by: Vlastimil Babka <[email protected]>
> > ---
> > mm/mmap.c | 11 ++++++++++-
> > 1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index d89770eaab6b..a4238373ee9b 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -954,10 +954,19 @@ static struct vm_area_struct
> > } else if (merge_prev) { /* case 2 */
> > if (curr) {
> > vma_start_write(curr);
> > - err = dup_anon_vma(prev, curr, &anon_dup);
> > if (end == curr->vm_end) { /* case 7 */
> > + /*
> > + * can_vma_merge_after() assumed we would not be
> > + * removing prev vma, so it skipped the check
> > + * for vm_ops->close, but we are removing curr
> > + */
> > + if (curr->vm_ops && curr->vm_ops->close)
> > + err = -EINVAL;
> > + else
> > + err = dup_anon_vma(prev, curr, &anon_dup);
> > remove = curr;
>
> This separates the check for potentially merging previous to a later
> failure case. Would it be better to check:
> if (curr && curr->vm_ops && curr->vm_ops->close)
>
> and not set merge_prev = true, ie we cannot merge with the predecessor?
>
> That way we would exit as merge_prev == false.
>
> We would have the added benefit of not having to look at merge_prev &
> merge_next case with this vm_ops->close in mind (case 1 and 6).. because
> I'm pretty sure we can currently get to case 6 in this way:
>
> merge_prev = true
> check for merge_next.. can_vma_merge_before(next...);
> is_mergeable_vma(next.... , true);
> if (true && next->vm_ops && next->vm_ops->close) /* Fine for next.. */
>
> Remove curr by case 6 without checking curr->vm_ops &&
> curr->vm_ops->close
>
> If I am correct, then are we blaming the right commit?
I am not correct.
The file check will ensure the same ops, so the file and ops must match.
As long as both are checked on one VMA then it will work as required.
>
> Perhaps we should just fail earlier when we find a curr with the close
> ops?
I'd rather fail earlier, but it's not a big deal.
>
> > } else { /* case 5 */
> > + err = dup_anon_vma(prev, curr, &anon_dup);
> > adjust = curr;
> > adj_start = (end - curr->vm_start);
> > }
> > --
> > 2.43.1
> >
On Thu, Feb 22, 2024 at 05:55:50PM +0100, Vlastimil Babka wrote:
> When debugging issues with a workload using SysV shmem, Michal Hocko has
> come up with a reproducer that shows how a series of mprotect()
> operations can result in an elevated shm_nattch and thus leak of the
> resource.
>
> The problem is caused by wrong assumptions in vma_merge() commit
> 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in
> mergeability test"). The shmem vmas have a vma_ops->close callback
> that decrements shm_nattch, and we remove the vma without calling it.
>
> vma_merge() has thus historically avoided merging vma's with
> vma_ops->close and commit 714965ca8252 was supposed to keep it that way.
> It relaxed the checks for vma_ops->close in can_vma_merge_after()
> assuming that it is never called on a vma that would be a candidate for
> removal. However, the vma_merge() code does also use the result of this
> check in the decision to remove a different vma in the merge case 7.
>
> A robust solution would be to refactor vma_merge() code in a way that
> the vma_ops->close check is only done for vma's that are actually going
> to be removed, and not as part of the preliminary checks. That would
> both solve the existing bug, and also allow additional merges that the
> checks currently prevent unnecessarily in some cases.
Let's do that pretty soon :) this is a bit of an ugly fix but
understandable to do it in this form to make it easier to backport (+
perhaps generate some CVEs? :)
>
> However to fix the existing bug first with a minimized risk, and for
> easier stable backports, this patch only adds a vma_ops->close check to
> the buggy case 7 specifically. All other cases of vma removal are
> covered by the can_vma_merge_before() check that includes the test for
> vma_ops->close.
I concur, all the other cases require merge_next which would have invoked
can_vma_merge_before() that calls is_mergeable_vma() with may_remove_vma
set to true hence performs the close check.
>
> The reproducer code, adapted from Michal Hocko's code:
>
> int main(int argc, char *argv[]) {
> int segment_id;
> size_t segment_size = 20 * PAGE_SIZE;
> char * sh_mem;
> struct shmid_ds shmid_ds;
>
> key_t key = 0x1234;
> segment_id = shmget(key, segment_size,
> IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR);
> sh_mem = (char *)shmat(segment_id, NULL, 0);
>
> mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_NONE);
>
> mprotect(sh_mem + PAGE_SIZE, PAGE_SIZE, PROT_WRITE);
>
> mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_WRITE);
>
> shmdt(sh_mem);
>
> shmctl(segment_id, IPC_STAT, &shmid_ds);
> printf("nattch after shmdt(): %lu (expected: 0)\n", shmid_ds.shm_nattch);
>
> if (shmctl(segment_id, IPC_RMID, 0))
> printf("IPCRM failed %d\n", errno);
> return (shmid_ds.shm_nattch) ? 1 : 0;
> }
>
> Fixes: 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test")
> Reported-by: Michal Hocko <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---
> mm/mmap.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d89770eaab6b..a4238373ee9b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -954,10 +954,19 @@ static struct vm_area_struct
> } else if (merge_prev) { /* case 2 */
> if (curr) {
> vma_start_write(curr);
> - err = dup_anon_vma(prev, curr, &anon_dup);
> if (end == curr->vm_end) { /* case 7 */
> + /*
> + * can_vma_merge_after() assumed we would not be
> + * removing prev vma, so it skipped the check
> + * for vm_ops->close, but we are removing curr
> + */
> + if (curr->vm_ops && curr->vm_ops->close)
> + err = -EINVAL;
> + else
> + err = dup_anon_vma(prev, curr, &anon_dup);
> remove = curr;
> } else { /* case 5 */
> + err = dup_anon_vma(prev, curr, &anon_dup);
This (ironically) duplicates code, could we pull this out of the if/else
and put it afterwards like:
if (!err)
err = dup_anon_vma(prev, curr, &anon_dup);
> adjust = curr;
> adj_start = (end - curr->vm_start);
> }
> --
> 2.43.1
>
On 2/22/24 20:27, Liam R. Howlett wrote:
> * Liam R. Howlett <[email protected]> [240222 13:56]:
>> * Vlastimil Babka <[email protected]> [240222 11:56]:
>> This separates the check for potentially merging previous to a later
>> failure case. Would it be better to check:
>> if (curr && curr->vm_ops && curr->vm_ops->close)
>>
>> and not set merge_prev = true, ie we cannot merge with the predecessor?
Good suggestion, thanks!
>> That way we would exit as merge_prev == false.
>>
>> We would have the added benefit of not having to look at merge_prev &
>> merge_next case with this vm_ops->close in mind (case 1 and 6).. because
>> I'm pretty sure we can currently get to case 6 in this way:
>>
>> merge_prev = true
>> check for merge_next.. can_vma_merge_before(next...);
>> is_mergeable_vma(next.... , true);
>> if (true && next->vm_ops && next->vm_ops->close) /* Fine for next.. */
>>
>> Remove curr by case 6 without checking curr->vm_ops &&
>> curr->vm_ops->close
>>
>> If I am correct, then are we blaming the right commit?
It was bisected with no nondeterminism in the test, so yeah.
> I am not correct.
> The file check will ensure the same ops, so the file and ops must match.
> As long as both are checked on one VMA then it will work as required.
Right, otherwise we would have bigger issues even before the buggy commit,
we were never checking curr's vma_ops before.
>>
>> Perhaps we should just fail earlier when we find a curr with the close
>> ops?
>
> I'd rather fail earlier, but it's not a big deal.
Your suggestion will indeed result in a nicer and more obvious code, so will
do, thanks!
>>
>> > } else { /* case 5 */
>> > + err = dup_anon_vma(prev, curr, &anon_dup);
>> > adjust = curr;
>> > adj_start = (end - curr->vm_start);
>> > }
>> > --
>> > 2.43.1
>> >
On 2/22/24 22:19, Vlastimil Babka wrote:
> On 2/22/24 20:27, Liam R. Howlett wrote:
>> * Liam R. Howlett <[email protected]> [240222 13:56]:
>>> * Vlastimil Babka <[email protected]> [240222 11:56]:
>>> This separates the check for potentially merging previous to a later
>>> failure case. Would it be better to check:
>>> if (curr && curr->vm_ops && curr->vm_ops->close)
>>>
>>> and not set merge_prev = true, ie we cannot merge with the predecessor?
>
> Good suggestion, thanks!
Or actually as Lorenzo informed me, it would affect case 5 as well and we
don't want that. And special casing 5 vs 7 that early would be ugly again.
So I'll just do the code dedup Lorenzo suggested...
>>> That way we would exit as merge_prev == false.
>>>
>>> We would have the added benefit of not having to look at merge_prev &
>>> merge_next case with this vm_ops->close in mind (case 1 and 6).. because
>>> I'm pretty sure we can currently get to case 6 in this way:
>>>
>>> merge_prev = true
>>> check for merge_next.. can_vma_merge_before(next...);
>>> is_mergeable_vma(next.... , true);
>>> if (true && next->vm_ops && next->vm_ops->close) /* Fine for next.. */
>>>
>>> Remove curr by case 6 without checking curr->vm_ops &&
>>> curr->vm_ops->close
>>>
>>> If I am correct, then are we blaming the right commit?
>
> It was bisected with no nondeterminism in the test, so yeah.
>
>> I am not correct.
>> The file check will ensure the same ops, so the file and ops must match.
>> As long as both are checked on one VMA then it will work as required.
>
> Right, otherwise we would have bigger issues even before the buggy commit,
> we were never checking curr's vma_ops before.
>
>>>
>>> Perhaps we should just fail earlier when we find a curr with the close
>>> ops?
>>
>> I'd rather fail earlier, but it's not a big deal.
>
> Your suggestion will indeed result in a nicer and more obvious code, so will
> do, thanks!
>
>>>
>>> > } else { /* case 5 */
>>> > + err = dup_anon_vma(prev, curr, &anon_dup);
>>> > adjust = curr;
>>> > adj_start = (end - curr->vm_start);
>>> > }
>>> > --
>>> > 2.43.1
>>> >
>