Hello,
I noticed that recent upstream kernels don't account the xarray nodes
of the page cache to the allocating cgroup, like we used to do for the
radix tree nodes.
This results in broken isolation for cgrouped apps, allowing them to
escape their containment and harm other cgroups and the system with an
excessive build-up of nonresident information.
It also breaks thrashing/refault detection because the page cache
lives in a different domain than the xarray nodes, and so the shadow
shrinker can reclaim nonresident information way too early when there
isn't much cache in the root cgroup.
This appears to be the culprit:
commit a28334862993b5c6a8766f6963ee69048403817c
Author: Matthew Wilcox <[email protected]>
Date: Tue Dec 5 19:04:20 2017 -0500
page cache: Finish XArray conversion
With no more radix tree API users left, we can drop the GFP flags
and use xa_init() instead of INIT_RADIX_TREE().
Signed-off-by: Matthew Wilcox <[email protected]>
diff --git a/fs/inode.c b/fs/inode.c
index 42f6d25f32a5..9b808986d440 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -349,7 +349,7 @@ EXPORT_SYMBOL(inc_nlink);
static void __address_space_init_once(struct address_space *mapping)
{
- INIT_RADIX_TREE(&mapping->i_pages, GFP_ATOMIC | __GFP_ACCOUNT);
+ xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ);
init_rwsem(&mapping->i_mmap_rwsem);
INIT_LIST_HEAD(&mapping->private_list);
spin_lock_init(&mapping->private_lock);
It fairly blatantly drops __GFP_ACCOUNT.
I'm not quite sure how to fix this, since the xarray code doesn't seem
to have per-tree gfp flags anymore like the radix tree did. We cannot
add SLAB_ACCOUNT to the radix_tree_node_cachep slab cache. And the
xarray api doesn't seem to really support gfp flags, either (xas_nomem
does, but the optimistic internal allocations have fixed gfp flags).
On Thu, May 23, 2019 at 01:43:49PM -0400, Johannes Weiner wrote:
> I noticed that recent upstream kernels don't account the xarray nodes
> of the page cache to the allocating cgroup, like we used to do for the
> radix tree nodes.
>
> This results in broken isolation for cgrouped apps, allowing them to
> escape their containment and harm other cgroups and the system with an
> excessive build-up of nonresident information.
>
> It also breaks thrashing/refault detection because the page cache
> lives in a different domain than the xarray nodes, and so the shadow
> shrinker can reclaim nonresident information way too early when there
> isn't much cache in the root cgroup.
>
> I'm not quite sure how to fix this, since the xarray code doesn't seem
> to have per-tree gfp flags anymore like the radix tree did. We cannot
> add SLAB_ACCOUNT to the radix_tree_node_cachep slab cache. And the
> xarray api doesn't seem to really support gfp flags, either (xas_nomem
> does, but the optimistic internal allocations have fixed gfp flags).
Would it be a problem to always add __GFP_ACCOUNT to the fixed flags?
I don't really understand cgroups.
On Thu, May 23, 2019 at 11:37 AM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, May 23, 2019 at 01:43:49PM -0400, Johannes Weiner wrote:
> > I noticed that recent upstream kernels don't account the xarray nodes
> > of the page cache to the allocating cgroup, like we used to do for the
> > radix tree nodes.
> >
> > This results in broken isolation for cgrouped apps, allowing them to
> > escape their containment and harm other cgroups and the system with an
> > excessive build-up of nonresident information.
> >
> > It also breaks thrashing/refault detection because the page cache
> > lives in a different domain than the xarray nodes, and so the shadow
> > shrinker can reclaim nonresident information way too early when there
> > isn't much cache in the root cgroup.
> >
> > I'm not quite sure how to fix this, since the xarray code doesn't seem
> > to have per-tree gfp flags anymore like the radix tree did. We cannot
> > add SLAB_ACCOUNT to the radix_tree_node_cachep slab cache. And the
> > xarray api doesn't seem to really support gfp flags, either (xas_nomem
> > does, but the optimistic internal allocations have fixed gfp flags).
>
> Would it be a problem to always add __GFP_ACCOUNT to the fixed flags?
> I don't really understand cgroups.
Does xarray cache allocated nodes, something like radix tree's:
static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
For the cached one, no __GFP_ACCOUNT flag.
Also some users of xarray may not want __GFP_ACCOUNT. That's the
reason we had __GFP_ACCOUNT for page cache instead of hard coding it
in radix tree.
Shakeel
On Thu, May 23, 2019 at 11:49:41AM -0700, Shakeel Butt wrote:
> On Thu, May 23, 2019 at 11:37 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Thu, May 23, 2019 at 01:43:49PM -0400, Johannes Weiner wrote:
> > > I noticed that recent upstream kernels don't account the xarray nodes
> > > of the page cache to the allocating cgroup, like we used to do for the
> > > radix tree nodes.
> > >
> > > This results in broken isolation for cgrouped apps, allowing them to
> > > escape their containment and harm other cgroups and the system with an
> > > excessive build-up of nonresident information.
> > >
> > > It also breaks thrashing/refault detection because the page cache
> > > lives in a different domain than the xarray nodes, and so the shadow
> > > shrinker can reclaim nonresident information way too early when there
> > > isn't much cache in the root cgroup.
> > >
> > > I'm not quite sure how to fix this, since the xarray code doesn't seem
> > > to have per-tree gfp flags anymore like the radix tree did. We cannot
> > > add SLAB_ACCOUNT to the radix_tree_node_cachep slab cache. And the
> > > xarray api doesn't seem to really support gfp flags, either (xas_nomem
> > > does, but the optimistic internal allocations have fixed gfp flags).
> >
> > Would it be a problem to always add __GFP_ACCOUNT to the fixed flags?
> > I don't really understand cgroups.
>
> Does xarray cache allocated nodes, something like radix tree's:
>
> static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>
> For the cached one, no __GFP_ACCOUNT flag.
No. That was the point of the XArray conversion; no cached nodes.
> Also some users of xarray may not want __GFP_ACCOUNT. That's the
> reason we had __GFP_ACCOUNT for page cache instead of hard coding it
> in radix tree.
This is what I don't understand -- why would someone not want
__GFP_ACCOUNT? For a shared resource? But the page cache is a shared
resource. So what is a good example of a time when an allocation should
_not_ be accounted to the cgroup?
On Thu, May 23, 2019 at 03:21:17PM -0400, Johannes Weiner wrote:
> On Thu, May 23, 2019 at 12:00:32PM -0700, Matthew Wilcox wrote:
> > On Thu, May 23, 2019 at 11:49:41AM -0700, Shakeel Butt wrote:
> > > On Thu, May 23, 2019 at 11:37 AM Matthew Wilcox <[email protected]> wrote:
> > > > On Thu, May 23, 2019 at 01:43:49PM -0400, Johannes Weiner wrote:
> > > > > I noticed that recent upstream kernels don't account the xarray nodes
> > > > > of the page cache to the allocating cgroup, like we used to do for the
> > > > > radix tree nodes.
> > > > >
> > > > > This results in broken isolation for cgrouped apps, allowing them to
> > > > > escape their containment and harm other cgroups and the system with an
> > > > > excessive build-up of nonresident information.
> > > > >
> > > > > It also breaks thrashing/refault detection because the page cache
> > > > > lives in a different domain than the xarray nodes, and so the shadow
> > > > > shrinker can reclaim nonresident information way too early when there
> > > > > isn't much cache in the root cgroup.
> > > > >
> > > > > I'm not quite sure how to fix this, since the xarray code doesn't seem
> > > > > to have per-tree gfp flags anymore like the radix tree did. We cannot
> > > > > add SLAB_ACCOUNT to the radix_tree_node_cachep slab cache. And the
> > > > > xarray api doesn't seem to really support gfp flags, either (xas_nomem
> > > > > does, but the optimistic internal allocations have fixed gfp flags).
> > > >
> > > > Would it be a problem to always add __GFP_ACCOUNT to the fixed flags?
> > > > I don't really understand cgroups.
> >
> > > Also some users of xarray may not want __GFP_ACCOUNT. That's the
> > > reason we had __GFP_ACCOUNT for page cache instead of hard coding it
> > > in radix tree.
> >
> > This is what I don't understand -- why would someone not want
> > __GFP_ACCOUNT? For a shared resource? But the page cache is a shared
> > resource. So what is a good example of a time when an allocation should
> > _not_ be accounted to the cgroup?
>
> We used to cgroup-account every slab charge to cgroups per default,
> until we changed it to a whitelist behavior:
>
> commit b2a209ffa605994cbe3c259c8584ba1576d3310c
> Author: Vladimir Davydov <[email protected]>
> Date: Thu Jan 14 15:18:05 2016 -0800
>
> Revert "kernfs: do not account ino_ida allocations to memcg"
>
> Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc,
> alloc_kmem_pages call) are accounted to memory cgroup automatically.
> Callers have to explicitly opt out if they don't want/need accounting
> for some reason. Such a design decision leads to several problems:
>
> - kmalloc users are highly sensitive to failures, many of them
> implicitly rely on the fact that kmalloc never fails, while memcg
> makes failures quite plausible.
Doesn't apply here. The allocation under spinlock is expected to fail,
and then we'll use xas_nomem() with the caller's specified GFP flags
which may or may not include __GFP_ACCOUNT.
> - A lot of objects are shared among different containers by design.
> Accounting such objects to one of containers is just unfair.
> Moreover, it might lead to pinning a dead memcg along with its kmem
> caches, which aren't tiny, which might result in noticeable increase
> in memory consumption for no apparent reason in the long run.
These objects are in the slab of radix_tree_nodes, and we'll already be
accounting page cache nodes to the cgroup, so accounting random XArray
nodes to the cgroups isn't going to make the problem worse.
> - There are tons of short-lived objects. Accounting them to memcg will
> only result in slight noise and won't change the overall picture, but
> we still have to pay accounting overhead.
XArray nodes are generally not short-lived objects.
On Thu, May 23, 2019 at 12:00:32PM -0700, Matthew Wilcox wrote:
> On Thu, May 23, 2019 at 11:49:41AM -0700, Shakeel Butt wrote:
> > On Thu, May 23, 2019 at 11:37 AM Matthew Wilcox <[email protected]> wrote:
> > >
> > > On Thu, May 23, 2019 at 01:43:49PM -0400, Johannes Weiner wrote:
> > > > I noticed that recent upstream kernels don't account the xarray nodes
> > > > of the page cache to the allocating cgroup, like we used to do for the
> > > > radix tree nodes.
> > > >
> > > > This results in broken isolation for cgrouped apps, allowing them to
> > > > escape their containment and harm other cgroups and the system with an
> > > > excessive build-up of nonresident information.
> > > >
> > > > It also breaks thrashing/refault detection because the page cache
> > > > lives in a different domain than the xarray nodes, and so the shadow
> > > > shrinker can reclaim nonresident information way too early when there
> > > > isn't much cache in the root cgroup.
> > > >
> > > > I'm not quite sure how to fix this, since the xarray code doesn't seem
> > > > to have per-tree gfp flags anymore like the radix tree did. We cannot
> > > > add SLAB_ACCOUNT to the radix_tree_node_cachep slab cache. And the
> > > > xarray api doesn't seem to really support gfp flags, either (xas_nomem
> > > > does, but the optimistic internal allocations have fixed gfp flags).
> > >
> > > Would it be a problem to always add __GFP_ACCOUNT to the fixed flags?
> > > I don't really understand cgroups.
> >
> > Does xarray cache allocated nodes, something like radix tree's:
> >
> > static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
> >
> > For the cached one, no __GFP_ACCOUNT flag.
>
> No. That was the point of the XArray conversion; no cached nodes.
>
> > Also some users of xarray may not want __GFP_ACCOUNT. That's the
> > reason we had __GFP_ACCOUNT for page cache instead of hard coding it
> > in radix tree.
>
> This is what I don't understand -- why would someone not want
> __GFP_ACCOUNT? For a shared resource? But the page cache is a shared
> resource. So what is a good example of a time when an allocation should
> _not_ be accounted to the cgroup?
We used to cgroup-account every slab charge to cgroups per default,
until we changed it to a whitelist behavior:
commit b2a209ffa605994cbe3c259c8584ba1576d3310c
Author: Vladimir Davydov <[email protected]>
Date: Thu Jan 14 15:18:05 2016 -0800
Revert "kernfs: do not account ino_ida allocations to memcg"
Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc,
alloc_kmem_pages call) are accounted to memory cgroup automatically.
Callers have to explicitly opt out if they don't want/need accounting
for some reason. Such a design decision leads to several problems:
- kmalloc users are highly sensitive to failures, many of them
implicitly rely on the fact that kmalloc never fails, while memcg
makes failures quite plausible.
- A lot of objects are shared among different containers by design.
Accounting such objects to one of containers is just unfair.
Moreover, it might lead to pinning a dead memcg along with its kmem
caches, which aren't tiny, which might result in noticeable increase
in memory consumption for no apparent reason in the long run.
- There are tons of short-lived objects. Accounting them to memcg will
only result in slight noise and won't change the overall picture, but
we still have to pay accounting overhead.
For more info, see
- http://lkml.kernel.org/r/20151105144002.GB15111%40dhcp22.suse.cz
- http://lkml.kernel.org/r/20151106090555.GK29259@esperanza
Therefore this patchset switches to the white list policy. Now kmalloc
users have to explicitly opt in by passing __GFP_ACCOUNT flag.
Currently, the list of accounted objects is quite limited and only
includes those allocations that (1) are known to be easily triggered
from userspace and (2) can fail gracefully (for the full list see patch
no. 6) and it still misses many object types. However, accounting only
those objects should be a satisfactory approximation of the behavior we
used to have for most sane workloads.
The arguments would be the same here. Additional allocation overhead,
memory allocated on behalf of a shared facility, long-lived objects
pinning random, unrelated cgroups indefinitely.
The page cache is a sufficiently big user whose size can be directly
attributed to workload behavior, and can be controlled / reclaimed
under memory pressure. That's why it's accounted.
The same isn't true for random drivers using xarray, ida etc. It
shouldn't be implicit in the xarray semantics.
On Thu, May 23, 2019 at 12:41:30PM -0700, Matthew Wilcox wrote:
> On Thu, May 23, 2019 at 03:21:17PM -0400, Johannes Weiner wrote:
> > On Thu, May 23, 2019 at 12:00:32PM -0700, Matthew Wilcox wrote:
> > > On Thu, May 23, 2019 at 11:49:41AM -0700, Shakeel Butt wrote:
> > > > On Thu, May 23, 2019 at 11:37 AM Matthew Wilcox <[email protected]> wrote:
> > > > > On Thu, May 23, 2019 at 01:43:49PM -0400, Johannes Weiner wrote:
> > > > > > I noticed that recent upstream kernels don't account the xarray nodes
> > > > > > of the page cache to the allocating cgroup, like we used to do for the
> > > > > > radix tree nodes.
> > > > > >
> > > > > > This results in broken isolation for cgrouped apps, allowing them to
> > > > > > escape their containment and harm other cgroups and the system with an
> > > > > > excessive build-up of nonresident information.
> > > > > >
> > > > > > It also breaks thrashing/refault detection because the page cache
> > > > > > lives in a different domain than the xarray nodes, and so the shadow
> > > > > > shrinker can reclaim nonresident information way too early when there
> > > > > > isn't much cache in the root cgroup.
> > > > > >
> > > > > > I'm not quite sure how to fix this, since the xarray code doesn't seem
> > > > > > to have per-tree gfp flags anymore like the radix tree did. We cannot
> > > > > > add SLAB_ACCOUNT to the radix_tree_node_cachep slab cache. And the
> > > > > > xarray api doesn't seem to really support gfp flags, either (xas_nomem
> > > > > > does, but the optimistic internal allocations have fixed gfp flags).
> > > > >
> > > > > Would it be a problem to always add __GFP_ACCOUNT to the fixed flags?
> > > > > I don't really understand cgroups.
> > >
> > > > Also some users of xarray may not want __GFP_ACCOUNT. That's the
> > > > reason we had __GFP_ACCOUNT for page cache instead of hard coding it
> > > > in radix tree.
> > >
> > > This is what I don't understand -- why would someone not want
> > > __GFP_ACCOUNT? For a shared resource? But the page cache is a shared
> > > resource. So what is a good example of a time when an allocation should
> > > _not_ be accounted to the cgroup?
> >
> > We used to cgroup-account every slab charge to cgroups per default,
> > until we changed it to a whitelist behavior:
> >
> > commit b2a209ffa605994cbe3c259c8584ba1576d3310c
> > Author: Vladimir Davydov <[email protected]>
> > Date: Thu Jan 14 15:18:05 2016 -0800
> >
> > Revert "kernfs: do not account ino_ida allocations to memcg"
> >
> > Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc,
> > alloc_kmem_pages call) are accounted to memory cgroup automatically.
> > Callers have to explicitly opt out if they don't want/need accounting
> > for some reason. Such a design decision leads to several problems:
> >
> > - kmalloc users are highly sensitive to failures, many of them
> > implicitly rely on the fact that kmalloc never fails, while memcg
> > makes failures quite plausible.
>
> Doesn't apply here. The allocation under spinlock is expected to fail,
> and then we'll use xas_nomem() with the caller's specified GFP flags
> which may or may not include __GFP_ACCOUNT.
>
> > - A lot of objects are shared among different containers by design.
> > Accounting such objects to one of containers is just unfair.
> > Moreover, it might lead to pinning a dead memcg along with its kmem
> > caches, which aren't tiny, which might result in noticeable increase
> > in memory consumption for no apparent reason in the long run.
>
> These objects are in the slab of radix_tree_nodes, and we'll already be
> accounting page cache nodes to the cgroup, so accounting random XArray
> nodes to the cgroups isn't going to make the problem worse.
There is no single radix_tree_nodes cache. When cgroup accounting is
requested, we clone per-cgroup instances of the slab cache each with
their own object slabs. The reclaimable page cache / shadow nodes do
not share slab pages with other radix tree users.
> > - There are tons of short-lived objects. Accounting them to memcg will
> > only result in slight noise and won't change the overall picture, but
> > we still have to pay accounting overhead.
>
> XArray nodes are generally not short-lived objects.
I'm not exactly sure what you're trying to argue.
My point is that we cannot have random drivers' internal data
structures charge to and pin cgroups indefinitely just because they
happen to do the modprobing or otherwise interact with the driver.
It makes no sense in terms of performance or cgroup semantics.
On Thu, May 23, 2019 at 03:59:33PM -0400, Johannes Weiner wrote:
> My point is that we cannot have random drivers' internal data
> structures charge to and pin cgroups indefinitely just because they
> happen to do the modprobing or otherwise interact with the driver.
>
> It makes no sense in terms of performance or cgroup semantics.
But according to Roman, you already have that problem with the page
cache.
https://lore.kernel.org/linux-mm/20190522222254.GA5700@castle/T/
So this argument doesn't make sense to me.
On Fri, May 24, 2019 at 09:11:46AM -0700, Matthew Wilcox wrote:
> On Thu, May 23, 2019 at 03:59:33PM -0400, Johannes Weiner wrote:
> > My point is that we cannot have random drivers' internal data
> > structures charge to and pin cgroups indefinitely just because they
> > happen to do the modprobing or otherwise interact with the driver.
> >
> > It makes no sense in terms of performance or cgroup semantics.
>
> But according to Roman, you already have that problem with the page
> cache.
> https://lore.kernel.org/linux-mm/20190522222254.GA5700@castle/T/
>
> So this argument doesn't make sense to me.
You haven't addressed the rest of the argument though: why every user
of the xarray, and data structures based on it, should incur the
performance cost of charging memory to a cgroup, even when we have no
interest in tracking those allocations on behalf of a cgroup.
Which brings me to repeating the semantics argument: it doesn't make
sense to charge e.g. driver memory, which is arguably a shared system
resource, to whoever cgroup happens to do the modprobe / ioctl etc.
Anyway, this seems like a fairly serious regression, and it would make
sense to find a self-contained, backportable fix instead of something
that has subtle implications for every user of the xarray / ida code.
On Fri, May 24, 2019 at 10:06 AM Johannes Weiner <[email protected]> wrote:
>
> On Fri, May 24, 2019 at 09:11:46AM -0700, Matthew Wilcox wrote:
> > On Thu, May 23, 2019 at 03:59:33PM -0400, Johannes Weiner wrote:
> > > My point is that we cannot have random drivers' internal data
> > > structures charge to and pin cgroups indefinitely just because they
> > > happen to do the modprobing or otherwise interact with the driver.
> > >
> > > It makes no sense in terms of performance or cgroup semantics.
> >
> > But according to Roman, you already have that problem with the page
> > cache.
> > https://lore.kernel.org/linux-mm/20190522222254.GA5700@castle/T/
> >
> > So this argument doesn't make sense to me.
>
> You haven't addressed the rest of the argument though: why every user
> of the xarray, and data structures based on it, should incur the
> performance cost of charging memory to a cgroup, even when we have no
> interest in tracking those allocations on behalf of a cgroup.
>
> Which brings me to repeating the semantics argument: it doesn't make
> sense to charge e.g. driver memory, which is arguably a shared system
> resource, to whoever cgroup happens to do the modprobe / ioctl etc.
>
> Anyway, this seems like a fairly serious regression, and it would make
> sense to find a self-contained, backportable fix instead of something
> that has subtle implications for every user of the xarray / ida code.
Adding to Johannes point, one concrete example of xarray we don't want
to charge is swapper_spaces. Swap is a system level resource. It does
not make any sense to charge the swap overhead to a job and also it
will have negative consequences like pinning zombies.
Shakeel