Peter,
does b76b46902c2d ("mm/hugetlb: fix missing hugetlb_lock for resv
uncharge") really have any security implications? I fail to see any but
UFFD is not really my area so I might be missing something very easily.
On Mon 20-05-24 11:48:36, Greg KH wrote:
> Description
> ===========
>
> In the Linux kernel, the following vulnerability has been resolved:
>
> mm/hugetlb: fix missing hugetlb_lock for resv uncharge
>
> There is a recent report on UFFDIO_COPY over hugetlb:
>
> https://lore.kernel.org/all/[email protected]/
>
> 350: lockdep_assert_held(&hugetlb_lock);
>
> Should be an issue in hugetlb but triggered in an userfault context, where
> it goes into the unlikely path where two threads modifying the resv map
> together. Mike has a fix in that path for resv uncharge but it looks like
> the locking criteria was overlooked: hugetlb_cgroup_uncharge_folio_rsvd()
> will update the cgroup pointer, so it requires to be called with the lock
> held.
>
> The Linux kernel CVE team has assigned CVE-2024-36000 to this issue.
>
>
> Affected and fixed versions
> ===========================
>
> Issue introduced in 5.10 with commit 79aa925bf239 and fixed in 6.1.91 with commit 4c806333efea
> Issue introduced in 5.10 with commit 79aa925bf239 and fixed in 6.6.30 with commit f6c5d21db16a
> Issue introduced in 5.10 with commit 79aa925bf239 and fixed in 6.8.9 with commit 538faabf31e9
> Issue introduced in 5.10 with commit 79aa925bf239 and fixed in 6.9 with commit b76b46902c2d
> Issue introduced in 5.9.7 with commit f87004c0b2bd
>
> Please see https://www.kernel.org for a full list of currently supported
> kernel versions by the kernel community.
>
> Unaffected versions might change over time as fixes are backported to
> older supported kernel versions. The official CVE entry at
> https://cve.org/CVERecord/?id=CVE-2024-36000
> will be updated if fixes are backported, please check that for the most
> up to date information about this issue.
>
>
> Affected files
> ==============
>
> The file(s) affected by this issue are:
> mm/hugetlb.c
>
>
> Mitigation
> ==========
>
> The Linux kernel CVE team recommends that you update to the latest
> stable kernel version for this, and many other bugfixes. Individual
> changes are never tested alone, but rather are part of a larger kernel
> release. Cherry-picking individual commits is not recommended or
> supported by the Linux kernel community at all. If however, updating to
> the latest release is impossible, the individual changes to resolve this
> issue can be found at these commits:
> https://git.kernel.org/stable/c/4c806333efea1000a2a9620926f560ad2e1ca7cc
> https://git.kernel.org/stable/c/f6c5d21db16a0910152ec8aa9d5a7aed72694505
> https://git.kernel.org/stable/c/538faabf31e9c53d8c870d114846fda958a0de10
> https://git.kernel.org/stable/c/b76b46902c2d0395488c8412e1116c2486cdfcb2
--
Michal Hocko
SUSE Labs
On Mon, May 20, 2024 at 05:14:58PM +0200, Michal Hocko wrote:
> Peter,
Hi, Michal,
> does b76b46902c2d ("mm/hugetlb: fix missing hugetlb_lock for resv
> uncharge") really have any security implications? I fail to see any but
> UFFD is not really my area so I might be missing something very easily.
AFAIU that issue wasn't userfault specific, but a generic issue for hugetlb
- I believe that can also trigger in other paths whoever try to call
alloc_hugetlb_folio(), and UFFDIO_COPY is one user of it.
I looked at that and provided a fix only because the report originated from
the uffd report, so Andrew normally pointing those to me, and since I
looked anyway I tried to fix that.
Here in general what I can see is that the lock is needed since this
commit:
commit 94ae8ba7176666d1e7d8bbb9f93670a27540b6a8
Author: Aneesh Kumar K.V <[email protected]>
Date: Tue Jul 31 16:42:35 2012 -0700
hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to active list.
That commit mentioned that we rely on the lock to make sure all hugetlb
folios on the active list will have a valid memcg. However I'm not sure
whether it's still required now (after all that's 2012..), e.g., I'm
looking at hugetlb_cgroup_css_offline(), and hugetlb_cgroup_move_parent()
looks all safe to even take empty memcg folios with the latest code at
least:
/*
* We can have pages in active list without any cgroup
* ie, hugepage with less than 3 pages. We can safely
* ignore those pages.
*/
if (!page_hcg || page_hcg != h_cg)
goto out;
In short, I don't know any further security implications on this problem
besides LOCKDEP enabled. But I don't think I fully understand the hugetlb
reservation code, so please just take that with a grain of salt. E.g.,
right now we do the hugetlb_cgroup_uncharge_folio_rsvd(), then could it
happen that this folio will still be used finally and got injected into the
pgtables (after all, alloc_hugetlb_folio() will still return this folio to
the caller with a success), and would that be a problem if this folio has
its _hugetlb_cgroup_rsvd==NULL? That looks like another question besides
this specific problem, though..
Thanks,
>
> On Mon 20-05-24 11:48:36, Greg KH wrote:
> > Description
> > ===========
> >
> > In the Linux kernel, the following vulnerability has been resolved:
> >
> > mm/hugetlb: fix missing hugetlb_lock for resv uncharge
> >
> > There is a recent report on UFFDIO_COPY over hugetlb:
> >
> > https://lore.kernel.org/all/[email protected]/
> >
> > 350: lockdep_assert_held(&hugetlb_lock);
> >
> > Should be an issue in hugetlb but triggered in an userfault context, where
> > it goes into the unlikely path where two threads modifying the resv map
> > together. Mike has a fix in that path for resv uncharge but it looks like
> > the locking criteria was overlooked: hugetlb_cgroup_uncharge_folio_rsvd()
> > will update the cgroup pointer, so it requires to be called with the lock
> > held.
> >
> > The Linux kernel CVE team has assigned CVE-2024-36000 to this issue.
> >
> >
> > Affected and fixed versions
> > ===========================
> >
> > Issue introduced in 5.10 with commit 79aa925bf239 and fixed in 6.1.91 with commit 4c806333efea
> > Issue introduced in 5.10 with commit 79aa925bf239 and fixed in 6.6.30 with commit f6c5d21db16a
> > Issue introduced in 5.10 with commit 79aa925bf239 and fixed in 6.8.9 with commit 538faabf31e9
> > Issue introduced in 5.10 with commit 79aa925bf239 and fixed in 6.9 with commit b76b46902c2d
> > Issue introduced in 5.9.7 with commit f87004c0b2bd
> >
> > Please see https://www.kernel.org for a full list of currently supported
> > kernel versions by the kernel community.
> >
> > Unaffected versions might change over time as fixes are backported to
> > older supported kernel versions. The official CVE entry at
> > https://cve.org/CVERecord/?id=CVE-2024-36000
> > will be updated if fixes are backported, please check that for the most
> > up to date information about this issue.
> >
> >
> > Affected files
> > ==============
> >
> > The file(s) affected by this issue are:
> > mm/hugetlb.c
> >
> >
> > Mitigation
> > ==========
> >
> > The Linux kernel CVE team recommends that you update to the latest
> > stable kernel version for this, and many other bugfixes. Individual
> > changes are never tested alone, but rather are part of a larger kernel
> > release. Cherry-picking individual commits is not recommended or
> > supported by the Linux kernel community at all. If however, updating to
> > the latest release is impossible, the individual changes to resolve this
> > issue can be found at these commits:
> > https://git.kernel.org/stable/c/4c806333efea1000a2a9620926f560ad2e1ca7cc
> > https://git.kernel.org/stable/c/f6c5d21db16a0910152ec8aa9d5a7aed72694505
> > https://git.kernel.org/stable/c/538faabf31e9c53d8c870d114846fda958a0de10
> > https://git.kernel.org/stable/c/b76b46902c2d0395488c8412e1116c2486cdfcb2
>
> --
> Michal Hocko
> SUSE Labs
>
--
Peter Xu
Let me add Oscar,
On Tue 21-05-24 15:38:45, Peter Xu wrote:
> On Mon, May 20, 2024 at 05:14:58PM +0200, Michal Hocko wrote:
> > Peter,
>
> Hi, Michal,
>
> > does b76b46902c2d ("mm/hugetlb: fix missing hugetlb_lock for resv
> > uncharge") really have any security implications? I fail to see any but
> > UFFD is not really my area so I might be missing something very easily.
>
> AFAIU that issue wasn't userfault specific, but a generic issue for hugetlb
> - I believe that can also trigger in other paths whoever try to call
> alloc_hugetlb_folio(), and UFFDIO_COPY is one user of it.
>
> I looked at that and provided a fix only because the report originated from
> the uffd report, so Andrew normally pointing those to me, and since I
> looked anyway I tried to fix that.
OK, I see. Thanks for the clarification.
> Here in general what I can see is that the lock is needed since this
> commit:
>
> commit 94ae8ba7176666d1e7d8bbb9f93670a27540b6a8
> Author: Aneesh Kumar K.V <[email protected]>
> Date: Tue Jul 31 16:42:35 2012 -0700
>
> hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to active list.
>
> That commit mentioned that we rely on the lock to make sure all hugetlb
> folios on the active list will have a valid memcg. However I'm not sure
> whether it's still required now (after all that's 2012..), e.g., I'm
> looking at hugetlb_cgroup_css_offline(), and hugetlb_cgroup_move_parent()
> looks all safe to even take empty memcg folios with the latest code at
> least:
>
> /*
> * We can have pages in active list without any cgroup
> * ie, hugepage with less than 3 pages. We can safely
> * ignore those pages.
> */
> if (!page_hcg || page_hcg != h_cg)
> goto out;
>
> In short, I don't know any further security implications on this problem
> besides LOCKDEP enabled. But I don't think I fully understand the hugetlb
> reservation code, so please just take that with a grain of salt. E.g.,
> right now we do the hugetlb_cgroup_uncharge_folio_rsvd(), then could it
> happen that this folio will still be used finally and got injected into the
> pgtables (after all, alloc_hugetlb_folio() will still return this folio to
> the caller with a success), and would that be a problem if this folio has
> its _hugetlb_cgroup_rsvd==NULL? That looks like another question besides
> this specific problem, though..
Oscar, could you have a look please?
--
Michal Hocko
SUSE Labs
On Thu, May 23, 2024 at 09:30:59AM +0200, Michal Hocko wrote:
> Let me add Oscar,
Thanks
>
> On Tue 21-05-24 15:38:45, Peter Xu wrote:
> > That commit mentioned that we rely on the lock to make sure all hugetlb
> > folios on the active list will have a valid memcg. However I'm not sure
> > whether it's still required now (after all that's 2012..), e.g., I'm
> > looking at hugetlb_cgroup_css_offline(), and hugetlb_cgroup_move_parent()
> > looks all safe to even take empty memcg folios with the latest code at
> > least:
> >
> > /*
> > * We can have pages in active list without any cgroup
> > * ie, hugepage with less than 3 pages. We can safely
> > * ignore those pages.
> > */
> > if (!page_hcg || page_hcg != h_cg)
> > goto out;
Ok, I had a look at hugetlb_cgroup implementation.
First time looking at that code, so bear with me.
I looked back at commit
commit 94ae8ba7176666d1e7d8bbb9f93670a27540b6a8 (HEAD)
Author: Aneesh Kumar K.V <[email protected]>
Date: Tue Jul 31 16:42:35 2012 -0700
hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to active list.
to understand why the lock was needed.
On the theoretical part:
And we could have
CPU0 CPU1
dequeue_huge_page_vma
dequeue_huge_page_node
move_page_to_active_list
release_lock
hugetlb_cgroup_pre_destroy
for_each_page_in_active_list
<-- pages empty cgroups are skipped -->
hugetlb_cgroup_move_parent
move_page_to_parent
hugetlb_cgroup_commit_charge <-- too late
page[2].lru.next = (void *)h_cg;
So, the above page should have been moved to the parent, but since by the time
we were checking the activelist this page did not have any cgroup attach ot it,
it was skipped.
Notice I said theoretical because I noticed that
cgroup_call_pre_destroy()->hugetlb_cgroup_pre_destroy() is called from
cgroup_rmdir(). I am not sure under which circumstances cgroup_rmdir()
can succeed (does the cgroup refcount have dropped to 0?)
Now onto the current days.
After Peter's fix, we had:
CPU0 CPU1
dequeue_huge_page_vma
dequeue_hugetlb_folio_node_exact
move_page_to_active_list
hugetlb_cgroup_commit_charge
: folio->_hugetlb_cgroup = cgroup
hugetlb_cgroup_commit_charge_rsvd
: folio->_hugetlb_cgroup_rsvd = cgroup
release lock
hugetlb_cgroup_css_offline
take lock
for_each_page_in_active_list
hugetlb_cgroup_move_parent
: set folio->_hugetlb_cgroup
: _hugetlb_cgroup_rsvd is not set
: still contains the old value
release lock
__hugetlb_cgroup_uncharge_folio_rsvd
hugetlb_cgroup_from_folio_rsvd
...
folio->_hugetlb_cgroup_rsvd = NUL
So, as you can see, we charged a folio to the parent, and this folio still
contains the previous _hugetlb_cgroup_rsvd, when it should have been NULLed.
All this meaning that since no page_counter_uncharge() was called for
the _hugetlb_group_rsvd, parent still has the old page_counter's for cgroup_rsvd.
Then, if before hugetlb_cgroup_from_folio_rsvd() gets called, someone allocates a
new cgroup under this parent, it will get both the page_counters from the
_hugetlb_cgroup and the _hugetlb_cgroup_rsvd.
And then page_counter_set_max() gets called for both counters.
What implies for that new cgroup to have those page_counters?
I do not really know.
Maybe it does not let it allocate a new page when it should? Or it does look like
it has more memory reserved than it actually does?
--
Oscar Salvador
SUSE Labs
On Thu 23-05-24 12:33:37, Oscar Salvador wrote:
> On Thu, May 23, 2024 at 09:30:59AM +0200, Michal Hocko wrote:
> > Let me add Oscar,
>
> Thanks
>
> >
> > On Tue 21-05-24 15:38:45, Peter Xu wrote:
> > > That commit mentioned that we rely on the lock to make sure all hugetlb
> > > folios on the active list will have a valid memcg. However I'm not sure
> > > whether it's still required now (after all that's 2012..), e.g., I'm
> > > looking at hugetlb_cgroup_css_offline(), and hugetlb_cgroup_move_parent()
> > > looks all safe to even take empty memcg folios with the latest code at
> > > least:
> > >
> > > /*
> > > * We can have pages in active list without any cgroup
> > > * ie, hugepage with less than 3 pages. We can safely
> > > * ignore those pages.
> > > */
> > > if (!page_hcg || page_hcg != h_cg)
> > > goto out;
>
> Ok, I had a look at hugetlb_cgroup implementation.
> First time looking at that code, so bear with me.
>
> I looked back at commit
>
> commit 94ae8ba7176666d1e7d8bbb9f93670a27540b6a8 (HEAD)
> Author: Aneesh Kumar K.V <[email protected]>
> Date: Tue Jul 31 16:42:35 2012 -0700
>
> hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to active list.
>
> to understand why the lock was needed.
>
> On the theoretical part:
>
> And we could have
>
> CPU0 CPU1
> dequeue_huge_page_vma
> dequeue_huge_page_node
> move_page_to_active_list
> release_lock
> hugetlb_cgroup_pre_destroy
> for_each_page_in_active_list
> <-- pages empty cgroups are skipped -->
> hugetlb_cgroup_move_parent
> move_page_to_parent
> hugetlb_cgroup_commit_charge <-- too late
> page[2].lru.next = (void *)h_cg;
>
> So, the above page should have been moved to the parent, but since by the time
> we were checking the activelist this page did not have any cgroup attach ot it,
> it was skipped.
>
> Notice I said theoretical because I noticed that
> cgroup_call_pre_destroy()->hugetlb_cgroup_pre_destroy() is called from
> cgroup_rmdir(). I am not sure under which circumstances cgroup_rmdir()
> can succeed (does the cgroup refcount have dropped to 0?)
Now, it just cannot have any tasks attached nor any subgroups. So is the
race actually possible?
--
Michal Hocko
SUSE Labs
On Thu, May 23, 2024 at 12:33:37PM +0200, Oscar Salvador wrote:
> On the theoretical part:
>
> And we could have
>
> CPU0 CPU1
> dequeue_huge_page_vma
> dequeue_huge_page_node
> move_page_to_active_list
> release_lock
> hugetlb_cgroup_pre_destroy
> for_each_page_in_active_list
> <-- pages empty cgroups are skipped -->
> hugetlb_cgroup_move_parent
> move_page_to_parent
> hugetlb_cgroup_commit_charge <-- too late
> page[2].lru.next = (void *)h_cg;
Would this happen with/without the patch? IIUC the patch didn't change
this path yet on hugetlb_cgroup_commit_charge(), and AFAIU the release_lock
is always covering the commit charge, with/without my patch:
spin_lock_irq(&hugetlb_lock);
folio = dequeue_hugetlb_folio_vma(h, vma, addr, avoid_reserve, gbl_chg);
...
hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
if (deferred_reserve) {
hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
h_cg, folio);
}
spin_unlock_irq(&hugetlb_lock);
What the previous patch changed, IMHO, is when the rare race happened first
on reservation, and I think Mike used to describe that with a rich comment,
which can be against a concurrent hugetlb_reserve_pages():
if (unlikely(map_chg > map_commit)) {
/*
* The page was added to the reservation map between
* vma_needs_reservation and vma_commit_reservation.
* This indicates a race with hugetlb_reserve_pages.
* Adjust for the subpool count incremented above AND
* in hugetlb_reserve_pages for the same page. Also,
* the reservation count added in hugetlb_reserve_pages
* no longer applies.
*/
long rsv_adjust;
rsv_adjust = hugepage_subpool_put_pages(spool, 1);
hugetlb_acct_memory(h, -rsv_adjust);
if (deferred_reserve) {
spin_lock_irq(&hugetlb_lock);
hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
pages_per_huge_page(h), folio);
spin_unlock_irq(&hugetlb_lock);
}
}
This should be after the point of hugetlb_cgroup_commit_charge(), and when
without the lock the problem is we can have concurrent accessor / updater
to the memcg.
However here after a 2nd look I don't even see at least the css offliner
would update the _hugetlb_cgroup_rsvd at all here.. so I'm not sure whether
a race could happen. I meant, hugetlb_cgroup_move_parent() doesn't even
touch _hugetlb_cgroup_rsvd which is the object that can race. It only
does:
set_hugetlb_cgroup(folio, parent);
While in this case it's only about _hugetlb_cgroup.
It's pretty confusing to me here, doesn't it mean that when someone offline
a child_cg here we'll still leave the folio's _hugetlb_cgroup_rsvd pointing
to it, even if _hugetlb_cgroup starting to point to parent?... I was
expecting hugetlb_cgroup_move_parent() also move the rsvd cg here too.
The other thing is, when hugetlb_cgroup_move_parent() does the cg movement,
does it need to css_put() ref on the child cg and css_tryget() on the
parent, just like what we did in __hugetlb_cgroup_charge_cgroup(), at least
for _hugetlb_cgroup?
I really don't know enough on these areas to tell, perhaps I missed
something. But maybe any of you may have some idea.. In general, I think
besides LOCKDEP the lock is definitely needed to at least make sure things
like:
__set_hugetlb_cgroup(folio, NULL, rsvd);
page_counter_uncharge(),
To be an atomic op, otherwise two threads can see the old memcg
concurrently and maybe they'll uncharge counter twice. But again currently
I don't know how that can be triggered if hugetlb_cgroup_move_parent() is
not even touching resv cg..
Thanks,
--
Peter Xu
On Thu, May 23, 2024 at 11:42:30AM -0400, Peter Xu wrote:
> I really don't know enough on these areas to tell, perhaps I missed
> something. But maybe any of you may have some idea.. In general, I think
> besides LOCKDEP the lock is definitely needed to at least make sure things
> like:
>
> __set_hugetlb_cgroup(folio, NULL, rsvd);
I do not think this is a problem, you are only setting folio->_hugetlb_cgroup_rsvd
to the hugetlb cgroup.
And no one else should fiddle with that folio.
> page_counter_uncharge(),
This on the hand might be another story:
page_counter_uncharge
new = atomic_long_sub_return(nr_pages, &counter->usage)
propagate_protected_usage
The first atomic_long_sub_return is ok because it is an atomic one, so
whoever comes last will not see e.g: a half-updated value.
But propagate_protected_usage() is a bit more convoluted as involves a bunch of
atomic operations and comparasions that in case they are not serialized, the counters
will not be consistent, which means that any charge/uncharge operation that comes after
might not reflect reality.
So I guess we could end up with scenarios where cgroups would not get as many pages as
they should, or maybe more pages than they should.
If this reasoning is accurate, I am leaning towards taking this as a security fix.
--
Oscar Salvador
SUSE Labs