2022-02-28 07:51:08

by Vasily Averin

[permalink] [raw]
Subject: [PATCH RFC] net: memcg accounting for veth devices

Following one-liner running inside memcg-limited container consumes
huge number of host memory and can trigger global OOM.

for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done

Patch accounts most part of these allocations and can protect host.
---[cut]---
It is not polished, and perhaps should be splitted.
obviously it affects other kind of netdevices too.
Unfortunately I'm not sure that I will have enough time to handle it properly
and decided to publish current patch version as is.
OpenVz workaround it by using per-container limit for number of
available netdevices, but upstream does not have any kind of
per-container configuration.
------

Signed-off-by: Vasily Averin <[email protected]>
---
drivers/net/veth.c | 2 +-
fs/kernfs/mount.c | 2 +-
fs/proc/proc_sysctl.c | 3 ++-
net/core/neighbour.c | 4 ++--
net/ipv4/devinet.c | 2 +-
net/ipv6/addrconf.c | 6 +++---
6 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 354a963075c5..6e0b4a9d0843 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1307,7 +1307,7 @@ static int veth_alloc_queues(struct net_device *dev)
struct veth_priv *priv = netdev_priv(dev);
int i;

- priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL);
+ priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL_ACCOUNT);
if (!priv->rq)
return -ENOMEM;

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index cfa79715fc1a..2881aeeaa880 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -391,7 +391,7 @@ void __init kernfs_init(void)
{
kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
sizeof(struct kernfs_node),
- 0, SLAB_PANIC, NULL);
+ 0, SLAB_PANIC | SLAB_ACCOUNT, NULL);

/* Creates slab cache for kernfs inode attributes */
kernfs_iattrs_cache = kmem_cache_create("kernfs_iattrs_cache",
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 7d9cfc730bd4..e20ce8198a44 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -1333,7 +1333,8 @@ struct ctl_table_header *__register_sysctl_table(
nr_entries++;

header = kzalloc(sizeof(struct ctl_table_header) +
- sizeof(struct ctl_node)*nr_entries, GFP_KERNEL);
+ sizeof(struct ctl_node)*nr_entries,
+ GFP_KERNEL_ACCOUNT);
if (!header)
return NULL;

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index ec0bf737b076..66a4445421f1 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1665,7 +1665,7 @@ struct neigh_parms *neigh_parms_alloc(struct net_device *dev,
struct net *net = dev_net(dev);
const struct net_device_ops *ops = dev->netdev_ops;

- p = kmemdup(&tbl->parms, sizeof(*p), GFP_KERNEL);
+ p = kmemdup(&tbl->parms, sizeof(*p), GFP_KERNEL_ACCOUNT);
if (p) {
p->tbl = tbl;
refcount_set(&p->refcnt, 1);
@@ -3728,7 +3728,7 @@ int neigh_sysctl_register(struct net_device *dev, struct neigh_parms *p,
char neigh_path[ sizeof("net//neigh/") + IFNAMSIZ + IFNAMSIZ ];
char *p_name;

- t = kmemdup(&neigh_sysctl_template, sizeof(*t), GFP_KERNEL);
+ t = kmemdup(&neigh_sysctl_template, sizeof(*t), GFP_KERNEL_ACCOUNT);
if (!t)
goto err;

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index fba2bffd65f7..47523fe5b891 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -2566,7 +2566,7 @@ static int __devinet_sysctl_register(struct net *net, char *dev_name,
struct devinet_sysctl_table *t;
char path[sizeof("net/ipv4/conf/") + IFNAMSIZ];

- t = kmemdup(&devinet_sysctl, sizeof(*t), GFP_KERNEL);
+ t = kmemdup(&devinet_sysctl, sizeof(*t), GFP_KERNEL_ACCOUNT);
if (!t)
goto out;

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index f927c199a93c..9d903342bc41 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -358,7 +358,7 @@ static int snmp6_alloc_dev(struct inet6_dev *idev)
if (!idev->stats.icmpv6dev)
goto err_icmp;
idev->stats.icmpv6msgdev = kzalloc(sizeof(struct icmpv6msg_mib_device),
- GFP_KERNEL);
+ GFP_KERNEL_ACCOUNT);
if (!idev->stats.icmpv6msgdev)
goto err_icmpmsg;

@@ -382,7 +382,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
if (dev->mtu < IPV6_MIN_MTU)
return ERR_PTR(-EINVAL);

- ndev = kzalloc(sizeof(struct inet6_dev), GFP_KERNEL);
+ ndev = kzalloc(sizeof(struct inet6_dev), GFP_KERNEL_ACCOUNT);
if (!ndev)
return ERR_PTR(err);

@@ -7023,7 +7023,7 @@ static int __addrconf_sysctl_register(struct net *net, char *dev_name,
struct ctl_table *table;
char path[sizeof("net/ipv6/conf/") + IFNAMSIZ];

- table = kmemdup(addrconf_sysctl, sizeof(addrconf_sysctl), GFP_KERNEL);
+ table = kmemdup(addrconf_sysctl, sizeof(addrconf_sysctl), GFP_KERNEL_ACCOUNT);
if (!table)
goto out;

--
2.25.1


2022-02-28 16:24:14

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH RFC] net: memcg accounting for veth devices

On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
> Following one-liner running inside memcg-limited container consumes
> huge number of host memory and can trigger global OOM.
>
> for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done
>
> Patch accounts most part of these allocations and can protect host.
> ---[cut]---
> It is not polished, and perhaps should be splitted.
> obviously it affects other kind of netdevices too.
> Unfortunately I'm not sure that I will have enough time to handle it properly
> and decided to publish current patch version as is.
> OpenVz workaround it by using per-container limit for number of
> available netdevices, but upstream does not have any kind of
> per-container configuration.
> ------

Should this just be a new ucount limit on kernel/ucount.c and have veth
use something like inc_ucount(current_user_ns(), current_euid(), UCOUNT_VETH)?

This might be abusing ucounts though, not sure, Eric?

Luis
>
> Signed-off-by: Vasily Averin <[email protected]>
> ---
> drivers/net/veth.c | 2 +-
> fs/kernfs/mount.c | 2 +-
> fs/proc/proc_sysctl.c | 3 ++-
> net/core/neighbour.c | 4 ++--
> net/ipv4/devinet.c | 2 +-
> net/ipv6/addrconf.c | 6 +++---
> 6 files changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 354a963075c5..6e0b4a9d0843 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -1307,7 +1307,7 @@ static int veth_alloc_queues(struct net_device *dev)
> struct veth_priv *priv = netdev_priv(dev);
> int i;
> - priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL);
> + priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL_ACCOUNT);
> if (!priv->rq)
> return -ENOMEM;
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index cfa79715fc1a..2881aeeaa880 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -391,7 +391,7 @@ void __init kernfs_init(void)
> {
> kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
> sizeof(struct kernfs_node),
> - 0, SLAB_PANIC, NULL);
> + 0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
> /* Creates slab cache for kernfs inode attributes */
> kernfs_iattrs_cache = kmem_cache_create("kernfs_iattrs_cache",
> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> index 7d9cfc730bd4..e20ce8198a44 100644
> --- a/fs/proc/proc_sysctl.c
> +++ b/fs/proc/proc_sysctl.c
> @@ -1333,7 +1333,8 @@ struct ctl_table_header *__register_sysctl_table(
> nr_entries++;
> header = kzalloc(sizeof(struct ctl_table_header) +
> - sizeof(struct ctl_node)*nr_entries, GFP_KERNEL);
> + sizeof(struct ctl_node)*nr_entries,
> + GFP_KERNEL_ACCOUNT);
> if (!header)
> return NULL;
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index ec0bf737b076..66a4445421f1 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -1665,7 +1665,7 @@ struct neigh_parms *neigh_parms_alloc(struct net_device *dev,
> struct net *net = dev_net(dev);
> const struct net_device_ops *ops = dev->netdev_ops;
> - p = kmemdup(&tbl->parms, sizeof(*p), GFP_KERNEL);
> + p = kmemdup(&tbl->parms, sizeof(*p), GFP_KERNEL_ACCOUNT);
> if (p) {
> p->tbl = tbl;
> refcount_set(&p->refcnt, 1);
> @@ -3728,7 +3728,7 @@ int neigh_sysctl_register(struct net_device *dev, struct neigh_parms *p,
> char neigh_path[ sizeof("net//neigh/") + IFNAMSIZ + IFNAMSIZ ];
> char *p_name;
> - t = kmemdup(&neigh_sysctl_template, sizeof(*t), GFP_KERNEL);
> + t = kmemdup(&neigh_sysctl_template, sizeof(*t), GFP_KERNEL_ACCOUNT);
> if (!t)
> goto err;
> diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
> index fba2bffd65f7..47523fe5b891 100644
> --- a/net/ipv4/devinet.c
> +++ b/net/ipv4/devinet.c
> @@ -2566,7 +2566,7 @@ static int __devinet_sysctl_register(struct net *net, char *dev_name,
> struct devinet_sysctl_table *t;
> char path[sizeof("net/ipv4/conf/") + IFNAMSIZ];
> - t = kmemdup(&devinet_sysctl, sizeof(*t), GFP_KERNEL);
> + t = kmemdup(&devinet_sysctl, sizeof(*t), GFP_KERNEL_ACCOUNT);
> if (!t)
> goto out;
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index f927c199a93c..9d903342bc41 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -358,7 +358,7 @@ static int snmp6_alloc_dev(struct inet6_dev *idev)
> if (!idev->stats.icmpv6dev)
> goto err_icmp;
> idev->stats.icmpv6msgdev = kzalloc(sizeof(struct icmpv6msg_mib_device),
> - GFP_KERNEL);
> + GFP_KERNEL_ACCOUNT);
> if (!idev->stats.icmpv6msgdev)
> goto err_icmpmsg;
> @@ -382,7 +382,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
> if (dev->mtu < IPV6_MIN_MTU)
> return ERR_PTR(-EINVAL);
> - ndev = kzalloc(sizeof(struct inet6_dev), GFP_KERNEL);
> + ndev = kzalloc(sizeof(struct inet6_dev), GFP_KERNEL_ACCOUNT);
> if (!ndev)
> return ERR_PTR(err);
> @@ -7023,7 +7023,7 @@ static int __addrconf_sysctl_register(struct net *net, char *dev_name,
> struct ctl_table *table;
> char path[sizeof("net/ipv6/conf/") + IFNAMSIZ];
> - table = kmemdup(addrconf_sysctl, sizeof(addrconf_sysctl), GFP_KERNEL);
> + table = kmemdup(addrconf_sysctl, sizeof(addrconf_sysctl), GFP_KERNEL_ACCOUNT);
> if (!table)
> goto out;
> --
> 2.25.1
>

2022-03-01 19:09:59

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH RFC] net: memcg accounting for veth devices

On Mon, Feb 28, 2022 at 06:36:58AM -0800, Luis Chamberlain wrote:
> On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
> > Following one-liner running inside memcg-limited container consumes
> > huge number of host memory and can trigger global OOM.
> >
> > for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done
> >
> > Patch accounts most part of these allocations and can protect host.
> > ---[cut]---
> > It is not polished, and perhaps should be splitted.
> > obviously it affects other kind of netdevices too.
> > Unfortunately I'm not sure that I will have enough time to handle it
> properly
> > and decided to publish current patch version as is.
> > OpenVz workaround it by using per-container limit for number of
> > available netdevices, but upstream does not have any kind of
> > per-container configuration.
> > ------

> Should this just be a new ucount limit on kernel/ucount.c and have veth
> use something like inc_ucount(current_user_ns(), current_euid(),
> UCOUNT_VETH)?

> This might be abusing ucounts though, not sure, Eric?


For admins of systems running multiple workloads, there is no easy way
to set such limits for each workload. Some may genuinely need more veth
than others. From admin's perspective it is preferred to have minimal
knobs to set and if these objects are charged to memcg then the memcg
limits would limit them. There was similar situation for inotify
instances where fs sysctl inotify/max_user_instances already limits the
inotify instances but we memcg charged them to not worry about setting
such limits. See ac7b79fd190b ("inotify, memcg: account inotify
instances to kmemcg").

2022-03-01 19:47:14

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH RFC] net: memcg accounting for veth devices

On Tue, Mar 01, 2022 at 10:09:17AM -0800, Shakeel Butt wrote:
> On Mon, Feb 28, 2022 at 06:36:58AM -0800, Luis Chamberlain wrote:
> > On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
> > > Following one-liner running inside memcg-limited container consumes
> > > huge number of host memory and can trigger global OOM.
> > >
> > > for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done
> > >
> > > Patch accounts most part of these allocations and can protect host.
> > > ---[cut]---
> > > It is not polished, and perhaps should be splitted.
> > > obviously it affects other kind of netdevices too.
> > > Unfortunately I'm not sure that I will have enough time to handle it
> > properly
> > > and decided to publish current patch version as is.
> > > OpenVz workaround it by using per-container limit for number of
> > > available netdevices, but upstream does not have any kind of
> > > per-container configuration.
> > > ------
>
> > Should this just be a new ucount limit on kernel/ucount.c and have veth
> > use something like inc_ucount(current_user_ns(), current_euid(),
> > UCOUNT_VETH)?
>
> > This might be abusing ucounts though, not sure, Eric?
>
>
> For admins of systems running multiple workloads, there is no easy way
> to set such limits for each workload.

That's why defaults would exist. Today's ulimits IMHO are insane and
some are arbitrarily large.

> Some may genuinely need more veth
> than others.

So why not make it a high sensible but not enough to OOM a typical system?

But again, I'd like to hear whether or not a ulimit for veth is a mi-use
of ulimits or if its the right place. If it's not then perhaps the
driver can just have its own atomic max definition.

> From admin's perspective it is preferred to have minimal
> knobs to set and if these objects are charged to memcg then the memcg
> limits would limit them. There was similar situation for inotify
> instances where fs sysctl inotify/max_user_instances already limits the
> inotify instances but we memcg charged them to not worry about setting
> such limits. See ac7b79fd190b ("inotify, memcg: account inotify
> instances to kmemcg").

Yes but we want sensible defaults out of the box. What those should be
IMHO might be work which needs to be figured out well.

IMHO today's ulimits are a bit over the top today. This is off slightly
off topic but for instance play with:

git clone https://github.com/ColinIanKing/stress-ng
cd stress-ng
make -j 8
echo 0 > /proc/sys/vm/oom_dump_tasks
i=1; while true; do echo "RUNNING TEST $i"; ./stress-ng --unshare 8192 --unshare-ops 10000; sleep 1; let i=$i+1; done

If you see:

[ 217.798124] cgroup: fork rejected by pids controller in
/user.slice/user-1000.slice/session-1.scope

Edit /usr/lib/systemd/system/user-.slice.d/10-defaults.conf to be:

[Slice]
TasksMax=MAX_TASKS|infinity

Even though we have max_threads set to 61343, things ulimits have a
different limit set, and what this means is the above can end up easily
creating over 1048576 (17 times max_threads) threads all eagerly doing
nothing to just exit, essentially allowing a sort of fork bomb on exit.
Your system may or not fall to its knees.

Luis

2022-03-02 01:50:28

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH RFC] net: memcg accounting for veth devices

On Tue, Mar 01, 2022 at 02:50:06PM -0600, Eric W. Biederman wrote:
> Luis Chamberlain <[email protected]> writes:
>
> > On Tue, Mar 01, 2022 at 10:09:17AM -0800, Shakeel Butt wrote:
> >> On Mon, Feb 28, 2022 at 06:36:58AM -0800, Luis Chamberlain wrote:
> >> > On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
> >> > > Following one-liner running inside memcg-limited container consumes
> >> > > huge number of host memory and can trigger global OOM.
> >> > >
> >> > > for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done
> >> > >
> >> > > Patch accounts most part of these allocations and can protect host.
> >> > > ---[cut]---
> >> > > It is not polished, and perhaps should be splitted.
> >> > > obviously it affects other kind of netdevices too.
> >> > > Unfortunately I'm not sure that I will have enough time to handle it
> >> > properly
> >> > > and decided to publish current patch version as is.
> >> > > OpenVz workaround it by using per-container limit for number of
> >> > > available netdevices, but upstream does not have any kind of
> >> > > per-container configuration.
> >> > > ------
> >>
> >> > Should this just be a new ucount limit on kernel/ucount.c and have veth
> >> > use something like inc_ucount(current_user_ns(), current_euid(),
> >> > UCOUNT_VETH)?
> >>
> >> > This might be abusing ucounts though, not sure, Eric?
> >>
> >>
> >> For admins of systems running multiple workloads, there is no easy way
> >> to set such limits for each workload.
> >
> > That's why defaults would exist. Today's ulimits IMHO are insane and
> > some are arbitrarily large.
>
> My perspective is that we have two basic kinds of limits.
>
> Limits to catch programs that go out of control hopefully before they
> bring down the entire system. This is the purpose I see of rlimits and
> ucounts. Such limits should be set by default so large that no one has
> to care unless their program is broken.
>
> Limits to contain programs and keep them from having a negative impact
> on other programs. Generally this is the role I see the cgroups
> playing. This limits must be much more tightly managed.
>
> The problem with veth that was reported was that the memory cgroup
> limits fails to contain veth's allocations and veth manages to affect
> process outside the memory cgroup where the veth ``lives''. The effect
> is an OOM but the problem is that it is affecting processes out of the
> memory control group.

Given no upper bound was used in the commit log, it seems to have
presented a use case where both types of limits might need to be
considered though.

> Part of the reason for the recent ucount work is so that ordinary users
> can create user namespaces and root in that user namespace won't be able
> to exceed the limits that were set when the user namespace was created
> by creating additional users.

Got it.

> Part of the reason for my ucount work is my frustration that cgroups
> would up something completely different than what was originally
> proposed and solve a rather problem set. Originally the proposal was
> that cgroups would be the user interface for the bean-counter patches.
> (Roughly counts like the ucounts are now). Except for maybe the pid
> controller you mention below cgroups look nothing like that today.
> So I went and I solved the original problem because it was still not
> solved.

I see...

> The network stack should already have packet limits to prevent a global
> OOM so I am a bit curious why those limits aren't preventing a global
> OOM in for the veth device.

No packets are used in the demo / commit log, it is just creating
tons of veths that OOMs.

> I am not saying that the patch is correct (although from 10,000 feet the
> patch sounds like it is solving the reported problem).

From your description, it would like it is indeed a right approach
to correct memory allocations so that cgroup memory limits are respected.

Outside of that, it still begs the question if the ucounts can/should
be used for something like root in a namespace creating tons of veths
and put a cap to that.

> I am answering
> the question of how I understand limits to work.

It does!

> Luis does this explanation of how limits work help?

Yup thanks!

> >> From admin's perspective it is preferred to have minimal
> >> knobs to set and if these objects are charged to memcg then the memcg
> >> limits would limit them. There was similar situation for inotify
> >> instances where fs sysctl inotify/max_user_instances already limits the
> >> inotify instances but we memcg charged them to not worry about setting
> >> such limits. See ac7b79fd190b ("inotify, memcg: account inotify
> >> instances to kmemcg").
> >
> > Yes but we want sensible defaults out of the box. What those should be
> > IMHO might be work which needs to be figured out well.
> >
> > IMHO today's ulimits are a bit over the top today. This is off slightly
> > off topic but for instance play with:
> >
> > git clone https://github.com/ColinIanKing/stress-ng
> > cd stress-ng
> > make -j 8
> > echo 0 > /proc/sys/vm/oom_dump_tasks
> > i=1; while true; do echo "RUNNING TEST $i"; ./stress-ng --unshare 8192 --unshare-ops 10000; sleep 1; let i=$i+1; done
> >
> > If you see:
> >
> > [ 217.798124] cgroup: fork rejected by pids controller in
> > /user.slice/user-1000.slice/session-1.scope
> >
> > Edit /usr/lib/systemd/system/user-.slice.d/10-defaults.conf to be:
> >
> > [Slice]
> > TasksMax=MAX_TASKS|infinity
> >
> > Even though we have max_threads set to 61343, things ulimits have a
> > different limit set, and what this means is the above can end up easily
> > creating over 1048576 (17 times max_threads) threads all eagerly doing
> > nothing to just exit, essentially allowing a sort of fork bomb on exit.
> > Your system may or not fall to its knees.
>
> What max_threads are you talking about here?

Sorry for not being clear, in the kernel this is exposed as max_threads

Which is initialize do kernel/fork.c

[email protected] ~ # cat /proc/sys/kernel/threads-max
62157

> The global max_threads
> exposed in /proc/sys/kernel/threads-max? I don't see how you can get
> around that.

Yeah I was perplexed and I don't think it's just me.

> Especially since the count is not decremented until the
> process is reaped.
>
> Or is this the pids controller having a low limit and
> /proc/sys/kernel/threads-max having a higher limit?

Not sure, I used defailt debian testing with the above change
to /usr/lib/systemd/system/user-.slice.d/10-defaults.conf to
TasksMax=MAX_TASKS|infinity

> I really have not looked at this pids controller.
>
> So I am not certain I understand your example here but I hope I have
> answered your question.

During experimentation with the above stress-ng test case, I saw tons
of thread just waiting to do exit:

diff --git a/kernel/exit.c b/kernel/exit.c
index 80c4a67d2770..653ca7ebfb58 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -730,11 +730,24 @@ static void check_stack_usage(void)
static inline void check_stack_usage(void) {}
#endif

+/* Approx more than twice max_threads */
+#define MAX_EXIT_CONCURRENT (1<<17)
+static atomic_t exit_concurrent_max = ATOMIC_INIT(MAX_EXIT_CONCURRENT);
+static DECLARE_WAIT_QUEUE_HEAD(exit_wq);
+
void __noreturn do_exit(long code)
{
struct task_struct *tsk = current;
int group_dead;

+ if (atomic_dec_if_positive(&exit_concurrent_max) < 0) {
+ pr_warn_ratelimited("exit: exit_concurrent_max (%u) close to 0 (max : %u), throttling...",
+ atomic_read(&exit_concurrent_max),
+ MAX_EXIT_CONCURRENT);
+ wait_event(exit_wq,
+ atomic_dec_if_positive(&exit_concurrent_max) >= 0);
+ }
+
/*
* We can get here from a kernel oops, sometimes with preemption off.
* Start by checking for critical errors.
@@ -881,6 +894,9 @@ void __noreturn do_exit(long code)

lockdep_free_task(tsk);
do_task_dead();
+
+ atomic_inc(&exit_concurrent_max);
+ wake_up(&exit_wq);
}
EXPORT_SYMBOL_GPL(do_exit);

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 4f5613dac227..980ffaba1ac5 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -238,6 +238,8 @@ struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
long max;
tns = iter->ns;
max = READ_ONCE(tns->ucount_max[type]);
+ if (atomic_long_read(&iter->ucount[type]) > max/16)
+ cond_resched();
if (!atomic_long_inc_below(&iter->ucount[type], max))
goto fail;
}
--
2.33.0

In my experimentation I saw we can easily have the above trigger 131072
concurrent exist waiting, which is twice /proc/sys/kernel/threads-max,
and at this point no OOM happens. But in reality I also saw us hitting
even 1048576, anything above 131072 starts causing tons of issues
(depending on the kernel) like OOMs or it being hard to bail on the
above shell loop.

Luis

2022-03-02 05:36:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH RFC] net: memcg accounting for veth devices

Luis Chamberlain <[email protected]> writes:

> On Tue, Mar 01, 2022 at 10:09:17AM -0800, Shakeel Butt wrote:
>> On Mon, Feb 28, 2022 at 06:36:58AM -0800, Luis Chamberlain wrote:
>> > On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
>> > > Following one-liner running inside memcg-limited container consumes
>> > > huge number of host memory and can trigger global OOM.
>> > >
>> > > for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done
>> > >
>> > > Patch accounts most part of these allocations and can protect host.
>> > > ---[cut]---
>> > > It is not polished, and perhaps should be splitted.
>> > > obviously it affects other kind of netdevices too.
>> > > Unfortunately I'm not sure that I will have enough time to handle it
>> > properly
>> > > and decided to publish current patch version as is.
>> > > OpenVz workaround it by using per-container limit for number of
>> > > available netdevices, but upstream does not have any kind of
>> > > per-container configuration.
>> > > ------
>>
>> > Should this just be a new ucount limit on kernel/ucount.c and have veth
>> > use something like inc_ucount(current_user_ns(), current_euid(),
>> > UCOUNT_VETH)?
>>
>> > This might be abusing ucounts though, not sure, Eric?
>>
>>
>> For admins of systems running multiple workloads, there is no easy way
>> to set such limits for each workload.
>
> That's why defaults would exist. Today's ulimits IMHO are insane and
> some are arbitrarily large.

My perspective is that we have two basic kinds of limits.

Limits to catch programs that go out of control hopefully before they
bring down the entire system. This is the purpose I see of rlimits and
ucounts. Such limits should be set by default so large that no one has
to care unless their program is broken.

Limits to contain programs and keep them from having a negative impact
on other programs. Generally this is the role I see the cgroups
playing. This limits must be much more tightly managed.

The problem with veth that was reported was that the memory cgroup
limits fails to contain veth's allocations and veth manages to affect
process outside the memory cgroup where the veth ``lives''. The effect
is an OOM but the problem is that it is affecting processes out of the
memory control group.

Part of the reason for the recent ucount work is so that ordinary users
can create user namespaces and root in that user namespace won't be able
to exceed the limits that were set when the user namespace was created
by creating additional users.

Part of the reason for my ucount work is my frustration that cgroups
would up something completely different than what was originally
proposed and solve a rather problem set. Originally the proposal was
that cgroups would be the user interface for the bean-counter patches.
(Roughly counts like the ucounts are now). Except for maybe the pid
controller you mention below cgroups look nothing like that today.
So I went and I solved the original problem because it was still not
solved.

The network stack should already have packet limits to prevent a global
OOM so I am a bit curious why those limits aren't preventing a global
OOM in for the veth device.


I am not saying that the patch is correct (although from 10,000 feet the
patch sounds like it is solving the reported problem). I am answering
the question of how I understand limits to work.

Luis does this explanation of how limits work help?


>> From admin's perspective it is preferred to have minimal
>> knobs to set and if these objects are charged to memcg then the memcg
>> limits would limit them. There was similar situation for inotify
>> instances where fs sysctl inotify/max_user_instances already limits the
>> inotify instances but we memcg charged them to not worry about setting
>> such limits. See ac7b79fd190b ("inotify, memcg: account inotify
>> instances to kmemcg").
>
> Yes but we want sensible defaults out of the box. What those should be
> IMHO might be work which needs to be figured out well.
>
> IMHO today's ulimits are a bit over the top today. This is off slightly
> off topic but for instance play with:
>
> git clone https://github.com/ColinIanKing/stress-ng
> cd stress-ng
> make -j 8
> echo 0 > /proc/sys/vm/oom_dump_tasks
> i=1; while true; do echo "RUNNING TEST $i"; ./stress-ng --unshare 8192 --unshare-ops 10000; sleep 1; let i=$i+1; done
>
> If you see:
>
> [ 217.798124] cgroup: fork rejected by pids controller in
> /user.slice/user-1000.slice/session-1.scope
>
> Edit /usr/lib/systemd/system/user-.slice.d/10-defaults.conf to be:
>
> [Slice]
> TasksMax=MAX_TASKS|infinity
>
> Even though we have max_threads set to 61343, things ulimits have a
> different limit set, and what this means is the above can end up easily
> creating over 1048576 (17 times max_threads) threads all eagerly doing
> nothing to just exit, essentially allowing a sort of fork bomb on exit.
> Your system may or not fall to its knees.

What max_threads are you talking about here? The global max_threads
exposed in /proc/sys/kernel/threads-max? I don't see how you can get
around that. Especially since the count is not decremented until the
process is reaped.

Or is this the pids controller having a low limit and
/proc/sys/kernel/threads-max having a higher limit?

I really have not looked at this pids controller.

So I am not certain I understand your example here but I hope I have
answered your question.

Eric

2022-03-02 12:47:18

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH RFC] net: memcg accounting for veth devices

On Tue, Mar 01, 2022 at 01:25:16PM -0800, Luis Chamberlain wrote:
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 4f5613dac227..980ffaba1ac5 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -238,6 +238,8 @@ struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
> long max;
> tns = iter->ns;
> max = READ_ONCE(tns->ucount_max[type]);
> + if (atomic_long_read(&iter->ucount[type]) > max/16)
> + cond_resched();
> if (!atomic_long_inc_below(&iter->ucount[type], max))
> goto fail;

You can of course ignore this, it was just a hack to try to avoid
a soft lockup on the workqueues.

Luis

2022-03-02 13:51:46

by King, Colin

[permalink] [raw]
Subject: RE: [PATCH RFC] net: memcg accounting for veth devices


Just to note that stress-ng does attempt to set the maximum ulimits across all ulimits before it invokes a stressor to try and stress the system as much as possible. It also changes the per-stressor oom adjust setting to make stressors less OOMable, one can disable this with --no-oom-adjust and one can force stressors from being restarted on an OOM with the --oomable option.


-----Original Message-----
From: Eric W. Biederman <[email protected]>
Sent: 01 March 2022 20:50
To: Luis Chamberlain <[email protected]>
Cc: Shakeel Butt <[email protected]>; Colin Ian King <[email protected]>; NeilBrown <[email protected]>; Vasily Averin <[email protected]>; Vlastimil Babka <[email protected]>; Hocko, Michal <[email protected]>; Roman Gushchin <[email protected]>; Linux MM <[email protected]>; [email protected]; David S. Miller <[email protected]>; Jakub Kicinski <[email protected]>; Tejun Heo <[email protected]>; Greg Kroah-Hartman <[email protected]>; Eric Dumazet <[email protected]>; Kees Cook <[email protected]>; Hideaki YOSHIFUJI <[email protected]>; David Ahern <[email protected]>; [email protected]; [email protected]
Subject: Re: [PATCH RFC] net: memcg accounting for veth devices

Luis Chamberlain <[email protected]> writes:

> On Tue, Mar 01, 2022 at 10:09:17AM -0800, Shakeel Butt wrote:
>> On Mon, Feb 28, 2022 at 06:36:58AM -0800, Luis Chamberlain wrote:
>> > On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
>> > > Following one-liner running inside memcg-limited container
>> > > consumes huge number of host memory and can trigger global OOM.
>> > >
>> > > for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ;
>> > > done
>> > >
>> > > Patch accounts most part of these allocations and can protect host.
>> > > ---[cut]---
>> > > It is not polished, and perhaps should be splitted.
>> > > obviously it affects other kind of netdevices too.
>> > > Unfortunately I'm not sure that I will have enough time to handle
>> > > it
>> > properly
>> > > and decided to publish current patch version as is.
>> > > OpenVz workaround it by using per-container limit for number of
>> > > available netdevices, but upstream does not have any kind of
>> > > per-container configuration.
>> > > ------
>>
>> > Should this just be a new ucount limit on kernel/ucount.c and have
>> > veth use something like inc_ucount(current_user_ns(),
>> > current_euid(), UCOUNT_VETH)?
>>
>> > This might be abusing ucounts though, not sure, Eric?
>>
>>
>> For admins of systems running multiple workloads, there is no easy
>> way to set such limits for each workload.
>
> That's why defaults would exist. Today's ulimits IMHO are insane and
> some are arbitrarily large.

My perspective is that we have two basic kinds of limits.

Limits to catch programs that go out of control hopefully before they bring down the entire system. This is the purpose I see of rlimits and ucounts. Such limits should be set by default so large that no one has to care unless their program is broken.

Limits to contain programs and keep them from having a negative impact on other programs. Generally this is the role I see the cgroups playing. This limits must be much more tightly managed.

The problem with veth that was reported was that the memory cgroup limits fails to contain veth's allocations and veth manages to affect process outside the memory cgroup where the veth ``lives''. The effect is an OOM but the problem is that it is affecting processes out of the memory control group.

Part of the reason for the recent ucount work is so that ordinary users can create user namespaces and root in that user namespace won't be able to exceed the limits that were set when the user namespace was created by creating additional users.

Part of the reason for my ucount work is my frustration that cgroups would up something completely different than what was originally proposed and solve a rather problem set. Originally the proposal was that cgroups would be the user interface for the bean-counter patches.
(Roughly counts like the ucounts are now). Except for maybe the pid controller you mention below cgroups look nothing like that today.
So I went and I solved the original problem because it was still not solved.

The network stack should already have packet limits to prevent a global OOM so I am a bit curious why those limits aren't preventing a global OOM in for the veth device.


I am not saying that the patch is correct (although from 10,000 feet the patch sounds like it is solving the reported problem). I am answering the question of how I understand limits to work.

Luis does this explanation of how limits work help?


>> From admin's perspective it is preferred to have minimal knobs to set
>> and if these objects are charged to memcg then the memcg limits would
>> limit them. There was similar situation for inotify instances where
>> fs sysctl inotify/max_user_instances already limits the inotify
>> instances but we memcg charged them to not worry about setting such
>> limits. See ac7b79fd190b ("inotify, memcg: account inotify instances
>> to kmemcg").
>
> Yes but we want sensible defaults out of the box. What those should be
> IMHO might be work which needs to be figured out well.
>
> IMHO today's ulimits are a bit over the top today. This is off
> slightly off topic but for instance play with:
>
> git clone https://github.com/ColinIanKing/stress-ng
> cd stress-ng
> make -j 8
> echo 0 > /proc/sys/vm/oom_dump_tasks
> i=1; while true; do echo "RUNNING TEST $i"; ./stress-ng --unshare 8192
> --unshare-ops 10000; sleep 1; let i=$i+1; done
>
> If you see:
>
> [ 217.798124] cgroup: fork rejected by pids controller in
> /user.slice/user-1000.slice/session-1.scope
>
> Edit /usr/lib/systemd/system/user-.slice.d/10-defaults.conf to be:
>
> [Slice]
> TasksMax=MAX_TASKS|infinity
>
> Even though we have max_threads set to 61343, things ulimits have a
> different limit set, and what this means is the above can end up
> easily creating over 1048576 (17 times max_threads) threads all
> eagerly doing nothing to just exit, essentially allowing a sort of fork bomb on exit.
> Your system may or not fall to its knees.



What max_threads are you talking about here? The global max_threads exposed in /proc/sys/kernel/threads-max? I don't see how you can get around that. Especially since the count is not decremented until the process is reaped.

Or is this the pids controller having a low limit and /proc/sys/kernel/threads-max having a higher limit?

I really have not looked at this pids controller.

So I am not certain I understand your example here but I hope I have answered your question.

Eric

2022-03-02 19:10:21

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH RFC] net: memcg accounting for veth devices

Luis Chamberlain <[email protected]> writes:

> On Tue, Mar 01, 2022 at 02:50:06PM -0600, Eric W. Biederman wrote:
>> I really have not looked at this pids controller.
>>
>> So I am not certain I understand your example here but I hope I have
>> answered your question.
>
> During experimentation with the above stress-ng test case, I saw tons
> of thread just waiting to do exit:

You increment the count of concurrent threads after a no return function
in do_exit. Since the increment is never reached the count always goes
down and eventually the warning prints.

> diff --git a/kernel/exit.c b/kernel/exit.c
> index 80c4a67d2770..653ca7ebfb58 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -730,11 +730,24 @@ static void check_stack_usage(void)
> static inline void check_stack_usage(void) {}
> #endif
>
> +/* Approx more than twice max_threads */
> +#define MAX_EXIT_CONCURRENT (1<<17)
> +static atomic_t exit_concurrent_max = ATOMIC_INIT(MAX_EXIT_CONCURRENT);
> +static DECLARE_WAIT_QUEUE_HEAD(exit_wq);
> +
> void __noreturn do_exit(long code)
> {
> struct task_struct *tsk = current;
> int group_dead;
>
> + if (atomic_dec_if_positive(&exit_concurrent_max) < 0) {
> + pr_warn_ratelimited("exit: exit_concurrent_max (%u) close to 0 (max : %u), throttling...",
> + atomic_read(&exit_concurrent_max),
> + MAX_EXIT_CONCURRENT);
> + wait_event(exit_wq,
> + atomic_dec_if_positive(&exit_concurrent_max) >= 0);
> + }
> +
> /*
> * We can get here from a kernel oops, sometimes with preemption off.
> * Start by checking for critical errors.
> @@ -881,6 +894,9 @@ void __noreturn do_exit(long code)
>
> lockdep_free_task(tsk);
> do_task_dead();

The function do_task_dead never returns.

> +
> + atomic_inc(&exit_concurrent_max);
> + wake_up(&exit_wq);
> }
> EXPORT_SYMBOL_GPL(do_exit);
>
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 4f5613dac227..980ffaba1ac5 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -238,6 +238,8 @@ struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
> long max;
> tns = iter->ns;
> max = READ_ONCE(tns->ucount_max[type]);
> + if (atomic_long_read(&iter->ucount[type]) > max/16)
> + cond_resched();
> if (!atomic_long_inc_below(&iter->ucount[type], max))
> goto fail;
> }

Eric

2022-03-03 00:26:45

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH RFC] net: memcg accounting for veth devices

On Wed, Mar 02, 2022 at 08:43:54AM -0600, Eric W. Biederman wrote:
> Luis Chamberlain <[email protected]> writes:
>
> > On Tue, Mar 01, 2022 at 02:50:06PM -0600, Eric W. Biederman wrote:
> >> I really have not looked at this pids controller.
> >>
> >> So I am not certain I understand your example here but I hope I have
> >> answered your question.
> >
> > During experimentation with the above stress-ng test case, I saw tons
> > of thread just waiting to do exit:
>
> You increment the count of concurrent threads after a no return function
> in do_exit. Since the increment is never reached the count always goes
> down and eventually the warning prints.
>
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index 80c4a67d2770..653ca7ebfb58 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -881,6 +894,9 @@ void __noreturn do_exit(long code)
> >
> > lockdep_free_task(tsk);
> > do_task_dead();
>
> The function do_task_dead never returns.
>
> > +
> > + atomic_inc(&exit_concurrent_max);
> > + wake_up(&exit_wq);
> > }
> > EXPORT_SYMBOL_GPL(do_exit);

Doh thanks!

Luis

2022-04-12 06:44:15

by Vasily Averin

[permalink] [raw]
Subject: problem with accounting of allocations called from __net_init hooks

On 3/1/22 21:09, Shakeel Butt wrote:
> On Mon, Feb 28, 2022 at 06:36:58AM -0800, Luis Chamberlain wrote:
>> On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
>> > Following one-liner running inside memcg-limited container consumes
>> > huge number of host memory and can trigger global OOM.
>> >
>> > for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done
>> >
>> > Patch accounts most part of these allocations and can protect host.
>> > ---[cut]---
>> > It is not polished, and perhaps should be splitted.
>> > obviously it affects other kind of netdevices too.
>> > Unfortunately I'm not sure that I will have enough time to handle it properly
>> > and decided to publish current patch version as is.
>> > OpenVz workaround it by using per-container limit for number of
>> > available netdevices, but upstream does not have any kind of
>> > per-container configuration.
>> > ------

I've noticed that __register_pernet_operations() executes init hook of registered
pernet_operation structure in all found net namespaces.

Usually these hooks are called by process related to specified net namespace,
and all marked allocation are accounted to related container:
i.e. objects related to netns in container A are accounted to memcg of container A,
objects allocated inside container B are accounted to corresponding memcg B,
and so on.

However __register_pernet_operations() calls the same hooks in one context,
and as result all marked allocations are accounted to one memcg.
It is quite rare scenario, however current processing looks incorrect for me.

I expect we can take memcg from 'struct net', because of this structure is accounted per se.
then we can use set_active_memcg() before init hook execution.
However I'm not sure it is fully correct.

Could you please advise some better solution?

Thank you,
Vasily Averin

2022-04-18 08:20:03

by Vasily Averin

[permalink] [raw]
Subject: [PATCH memcg RFC] net: set proper memcg for net_init hooks allocations

__register_pernet_operations() executes init hook of registered
pernet_operation structure in all existing net namespaces.

Typically, these hooks are called by a process associated with
the specified net namespace, and all __GFP_ACCOUNTING marked
allocation are accounted for corresponding container/memcg.

However __register_pernet_operations() calls the hooks in the same
context, and as a result all marked allocations are accounted
to one memcg for all processed net namespaces.

This patch adjusts active memcg for each net namespace and helps
to account memory allocated inside ops_init() into the proper memcg.

Signed-off-by: Vasily Averin <[email protected]>
---
Dear Vlastimil, Roman,
I'm not sure that memcg is used correctly here,
is it perhaps some additional locking required?
---
net/core/net_namespace.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a5b5bb99c644..171c6e0b2337 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -26,6 +26,7 @@
#include <net/net_namespace.h>
#include <net/netns/generic.h>

+#include <linux/sched/mm.h>
/*
* Our network namespace constructor/destructor lists
*/
@@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
* setup_net() and cleanup_net() are not possible.
*/
for_each_net(net) {
+ struct mem_cgroup *old, *memcg = NULL;
+#ifdef CONFIG_MEMCG
+ memcg = (net == &init_net) ? root_mem_cgroup : mem_cgroup_from_obj(net);
+#endif
+ old = set_active_memcg(memcg);
error = ops_init(ops, net);
+ set_active_memcg(old);
if (error)
goto out_undo;
list_add_tail(&net->exit_list, &net_exit_list);
--
2.31.1

2022-04-22 20:40:00

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH memcg RFC] net: set proper memcg for net_init hooks allocations

On Sat, Apr 16, 2022 at 11:39 PM Vasily Averin <[email protected]> wrote:
>
> __register_pernet_operations() executes init hook of registered
> pernet_operation structure in all existing net namespaces.
>
> Typically, these hooks are called by a process associated with
> the specified net namespace, and all __GFP_ACCOUNTING marked
> allocation are accounted for corresponding container/memcg.
>
> However __register_pernet_operations() calls the hooks in the same
> context, and as a result all marked allocations are accounted
> to one memcg for all processed net namespaces.
>
> This patch adjusts active memcg for each net namespace and helps
> to account memory allocated inside ops_init() into the proper memcg.
>
> Signed-off-by: Vasily Averin <[email protected]>
> ---
> Dear Vlastimil, Roman,
> I'm not sure that memcg is used correctly here,
> is it perhaps some additional locking required?
> ---
> net/core/net_namespace.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index a5b5bb99c644..171c6e0b2337 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -26,6 +26,7 @@
> #include <net/net_namespace.h>
> #include <net/netns/generic.h>
>
> +#include <linux/sched/mm.h>
> /*
> * Our network namespace constructor/destructor lists
> */
> @@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
> * setup_net() and cleanup_net() are not possible.
> */
> for_each_net(net) {
> + struct mem_cgroup *old, *memcg = NULL;
> +#ifdef CONFIG_MEMCG
> + memcg = (net == &init_net) ? root_mem_cgroup : mem_cgroup_from_obj(net);

memcg from obj is unstable, so you need a reference on memcg. You can
introduce get_mem_cgroup_from_kmem() which works for both
MEMCG_DATA_OBJCGS and MEMCG_DATA_KMEM. For uncharged objects (like
init_net) it should return NULL.

> +#endif
> + old = set_active_memcg(memcg);
> error = ops_init(ops, net);
> + set_active_memcg(old);
> if (error)
> goto out_undo;
> list_add_tail(&net->exit_list, &net_exit_list);
> --
> 2.31.1
>

2022-04-22 22:49:14

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg RFC] net: set proper memcg for net_init hooks allocations

On 4/22/22 23:01, Vasily Averin wrote:
> On 4/21/22 18:56, Shakeel Butt wrote:
>> On Sat, Apr 16, 2022 at 11:39 PM Vasily Averin <[email protected]> wrote:
>>> @@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
>>> * setup_net() and cleanup_net() are not possible.
>>> */
>>> for_each_net(net) {
>>> + struct mem_cgroup *old, *memcg = NULL;
>>> +#ifdef CONFIG_MEMCG
>>> + memcg = (net == &init_net) ? root_mem_cgroup : mem_cgroup_from_obj(net);
>>
>> memcg from obj is unstable, so you need a reference on memcg. You can
>> introduce get_mem_cgroup_from_kmem() which works for both
>> MEMCG_DATA_OBJCGS and MEMCG_DATA_KMEM. For uncharged objects (like
>> init_net) it should return NULL.
>
> Could you please elaborate with more details?
> It seems to me mem_cgroup_from_obj() does everything exactly as you say:
> - for slab objects it returns memcg taken from according slab->memcg_data
> - for ex-slab objects (i.e. page->memcg_data & MEMCG_DATA_OBJCGS)
> page_memcg_check() returns NULL
> - for kmem objects (i.e. page->memcg_data & MEMCG_DATA_KMEM)
> page_memcg_check() returns objcg->memcg
> - in another cases
> page_memcg_check() returns page->memcg_data,
> so for uncharged objects like init_net NULL should be returned.
>
> I can introduce exported get_mem_cgroup_from_kmem(), however it should only
> call mem_cgroup_from_obj(), perhaps under read_rcu_lock/unlock.

I think I finally got your point:
Do you mean I should use css_tryget(&memcg->css) for found memcg,
like get_mem_cgroup_from_mm() does?

Thank you,
Vasily Averin

2022-04-22 22:57:18

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg RFC] net: set proper memcg for net_init hooks allocations

On 4/21/22 18:56, Shakeel Butt wrote:
> On Sat, Apr 16, 2022 at 11:39 PM Vasily Averin <[email protected]> wrote:
>> @@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
>> * setup_net() and cleanup_net() are not possible.
>> */
>> for_each_net(net) {
>> + struct mem_cgroup *old, *memcg = NULL;
>> +#ifdef CONFIG_MEMCG
>> + memcg = (net == &init_net) ? root_mem_cgroup : mem_cgroup_from_obj(net);
>
> memcg from obj is unstable, so you need a reference on memcg. You can
> introduce get_mem_cgroup_from_kmem() which works for both
> MEMCG_DATA_OBJCGS and MEMCG_DATA_KMEM. For uncharged objects (like
> init_net) it should return NULL.

Could you please elaborate with more details?
It seems to me mem_cgroup_from_obj() does everything exactly as you say:
- for slab objects it returns memcg taken from according slab->memcg_data
- for ex-slab objects (i.e. page->memcg_data & MEMCG_DATA_OBJCGS)
page_memcg_check() returns NULL
- for kmem objects (i.e. page->memcg_data & MEMCG_DATA_KMEM)
page_memcg_check() returns objcg->memcg
- in another cases
page_memcg_check() returns page->memcg_data,
so for uncharged objects like init_net NULL should be returned.

I can introduce exported get_mem_cgroup_from_kmem(), however it should only
call mem_cgroup_from_obj(), perhaps under read_rcu_lock/unlock.

Do you mean something like this?

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1768,4 +1768,14 @@ static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)

#endif /* CONFIG_MEMCG_KMEM */

+static inline struct mem_cgroup *get_mem_cgroup_from_kmem(void *p)
+{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_obj(p);
+ rcu_read_unlock();
+
+ return memcg;
+}
#endif /* _LINUX_MEMCONTROL_H */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a5b5bb99c644..4003c47965c9 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -26,6 +26,7 @@
#include <net/net_namespace.h>
#include <net/netns/generic.h>

+#include <linux/sched/mm.h>
/*
* Our network namespace constructor/destructor lists
*/
@@ -1147,7 +1148,14 @@ static int __register_pernet_operations(struct list_head *list,
* setup_net() and cleanup_net() are not possible.
*/
for_each_net(net) {
+ struct mem_cgroup *old, *memcg;
+
+ memcg = get_mem_cgroup_from_kmem(net);
+ if (memcg == NULL)
+ memcg = root_mem_cgroup;
+ old = set_active_memcg(memcg);
error = ops_init(ops, net);
+ set_active_memcg(old);
if (error)
goto out_undo;
list_add_tail(&net->exit_list, &net_exit_list);

2022-04-22 23:05:55

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH memcg RFC] net: set proper memcg for net_init hooks allocations

On Fri, Apr 22, 2022 at 1:09 PM Vasily Averin <[email protected]> wrote:
>
> On 4/22/22 23:01, Vasily Averin wrote:
> > On 4/21/22 18:56, Shakeel Butt wrote:
> >> On Sat, Apr 16, 2022 at 11:39 PM Vasily Averin <[email protected]> wrote:
> >>> @@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
> >>> * setup_net() and cleanup_net() are not possible.
> >>> */
> >>> for_each_net(net) {
> >>> + struct mem_cgroup *old, *memcg = NULL;
> >>> +#ifdef CONFIG_MEMCG
> >>> + memcg = (net == &init_net) ? root_mem_cgroup : mem_cgroup_from_obj(net);
> >>
> >> memcg from obj is unstable, so you need a reference on memcg. You can
> >> introduce get_mem_cgroup_from_kmem() which works for both
> >> MEMCG_DATA_OBJCGS and MEMCG_DATA_KMEM. For uncharged objects (like
> >> init_net) it should return NULL.
> >
> > Could you please elaborate with more details?
> > It seems to me mem_cgroup_from_obj() does everything exactly as you say:
> > - for slab objects it returns memcg taken from according slab->memcg_data
> > - for ex-slab objects (i.e. page->memcg_data & MEMCG_DATA_OBJCGS)
> > page_memcg_check() returns NULL
> > - for kmem objects (i.e. page->memcg_data & MEMCG_DATA_KMEM)
> > page_memcg_check() returns objcg->memcg
> > - in another cases
> > page_memcg_check() returns page->memcg_data,
> > so for uncharged objects like init_net NULL should be returned.
> >
> > I can introduce exported get_mem_cgroup_from_kmem(), however it should only
> > call mem_cgroup_from_obj(), perhaps under read_rcu_lock/unlock.
>
> I think I finally got your point:
> Do you mean I should use css_tryget(&memcg->css) for found memcg,
> like get_mem_cgroup_from_mm() does?

Yes.

2022-04-23 09:40:36

by Vasily Averin

[permalink] [raw]
Subject: [PATCH] net: set proper memcg for net_init hooks allocations

__register_pernet_operations() executes init hook of registered
pernet_operation structure in all existing net namespaces.

Typically, these hooks are called by a process associated with
the specified net namespace, and all __GFP_ACCOUNTING marked
allocation are accounted for corresponding container/memcg.

However __register_pernet_operations() calls the hooks in the same
context, and as a result all marked allocations are accounted
to one memcg for all processed net namespaces.

This patch adjusts active memcg for each net namespace and helps
to account memory allocated inside ops_init() into the proper memcg.

Signed-off-by: Vasily Averin <[email protected]>
---
v1: introduced get_mem_cgroup_from_kmem(), which takes the refcount
for the found memcg, suggested by Shakeel
---
include/linux/memcontrol.h | 11 +++++++++++
net/core/net_namespace.c | 9 +++++++++
2 files changed, 20 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0abbd685703b..16b157852a8c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1768,4 +1768,15 @@ static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)

#endif /* CONFIG_MEMCG_KMEM */

+static inline struct mem_cgroup *get_mem_cgroup_from_kmem(void *p)
+{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ do {
+ memcg = mem_cgroup_from_obj(p);
+ } while (memcg && !css_tryget(&memcg->css));
+ rcu_read_unlock();
+ return memcg;
+}
#endif /* _LINUX_MEMCONTROL_H */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a5b5bb99c644..56608f56bed6 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -26,6 +26,7 @@
#include <net/net_namespace.h>
#include <net/netns/generic.h>

+#include <linux/sched/mm.h>
/*
* Our network namespace constructor/destructor lists
*/
@@ -1147,7 +1148,15 @@ static int __register_pernet_operations(struct list_head *list,
* setup_net() and cleanup_net() are not possible.
*/
for_each_net(net) {
+ struct mem_cgroup *old, *memcg;
+
+ memcg = get_mem_cgroup_from_kmem(net);
+ if (!memcg)
+ memcg = root_mem_cgroup;
+ old = set_active_memcg(memcg);
error = ops_init(ops, net);
+ set_active_memcg(old);
+ css_put(&memcg->css);
if (error)
goto out_undo;
list_add_tail(&net->exit_list, &net_exit_list);
--
2.31.1

2022-04-23 11:57:45

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] net: set proper memcg for net_init hooks allocations

Hi Vasily,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.18-rc3 next-20220422]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/intel-lab-lkp/linux/commits/Vasily-Averin/net-set-proper-memcg-for-net_init-hooks-allocations/20220423-160759
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git c00c5e1d157bec0ef0b0b59aa5482eb8dc7e8e49
config: mips-buildonly-randconfig-r006-20220423 (https://download.01.org/0day-ci/archive/20220423/[email protected]/config)
compiler: mips64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/3b379e5391e36e13b9f36305aa6d233fb03d4e58
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Vasily-Averin/net-set-proper-memcg-for-net_init-hooks-allocations/20220423-160759
git checkout 3b379e5391e36e13b9f36305aa6d233fb03d4e58
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross W=1 O=build_dir ARCH=mips prepare

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from include/linux/swap.h:9,
from include/linux/suspend.h:5,
from arch/mips/kernel/asm-offsets.c:17:
include/linux/memcontrol.h: In function 'get_mem_cgroup_from_kmem':
>> include/linux/memcontrol.h:1773:28: error: implicit declaration of function 'css_tryget'; did you mean 'wb_tryget'? [-Werror=implicit-function-declaration]
1773 | } while (memcg && !css_tryget(&memcg->css));
| ^~~~~~~~~~
| wb_tryget
>> include/linux/memcontrol.h:1773:45: error: invalid use of undefined type 'struct mem_cgroup'
1773 | } while (memcg && !css_tryget(&memcg->css));
| ^~
arch/mips/kernel/asm-offsets.c: At top level:
arch/mips/kernel/asm-offsets.c:26:6: warning: no previous prototype for 'output_ptreg_defines' [-Wmissing-prototypes]
26 | void output_ptreg_defines(void)
| ^~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:78:6: warning: no previous prototype for 'output_task_defines' [-Wmissing-prototypes]
78 | void output_task_defines(void)
| ^~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:92:6: warning: no previous prototype for 'output_thread_info_defines' [-Wmissing-prototypes]
92 | void output_thread_info_defines(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:108:6: warning: no previous prototype for 'output_thread_defines' [-Wmissing-prototypes]
108 | void output_thread_defines(void)
| ^~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:179:6: warning: no previous prototype for 'output_mm_defines' [-Wmissing-prototypes]
179 | void output_mm_defines(void)
| ^~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:240:6: warning: no previous prototype for 'output_sc_defines' [-Wmissing-prototypes]
240 | void output_sc_defines(void)
| ^~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:253:6: warning: no previous prototype for 'output_signal_defined' [-Wmissing-prototypes]
253 | void output_signal_defined(void)
| ^~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:332:6: warning: no previous prototype for 'output_pm_defines' [-Wmissing-prototypes]
332 | void output_pm_defines(void)
| ^~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:120: arch/mips/kernel/asm-offsets.s] Error 1
make[2]: Target '__build' not remade because of errors.
make[1]: *** [Makefile:1194: prepare0] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [Makefile:219: __sub-make] Error 2
make: Target 'prepare' not remade because of errors.


vim +1773 include/linux/memcontrol.h

1765
1766 static inline struct mem_cgroup *get_mem_cgroup_from_kmem(void *p)
1767 {
1768 struct mem_cgroup *memcg;
1769
1770 rcu_read_lock();
1771 do {
1772 memcg = mem_cgroup_from_obj(p);
> 1773 } while (memcg && !css_tryget(&memcg->css));

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-04-23 12:51:23

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] net: set proper memcg for net_init hooks allocations

Hi Vasily,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.18-rc3 next-20220422]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/intel-lab-lkp/linux/commits/Vasily-Averin/net-set-proper-memcg-for-net_init-hooks-allocations/20220423-160759
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git c00c5e1d157bec0ef0b0b59aa5482eb8dc7e8e49
config: riscv-randconfig-r042-20220422 (https://download.01.org/0day-ci/archive/20220423/[email protected]/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 5bd87350a5ae429baf8f373cb226a57b62f87280)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install riscv cross compiling tool for clang build
# apt-get install binutils-riscv64-linux-gnu
# https://github.com/intel-lab-lkp/linux/commit/3b379e5391e36e13b9f36305aa6d233fb03d4e58
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Vasily-Averin/net-set-proper-memcg-for-net_init-hooks-allocations/20220423-160759
git checkout 3b379e5391e36e13b9f36305aa6d233fb03d4e58
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash drivers/gpu/drm/exynos/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

In file included from drivers/gpu/drm/exynos/exynos_drm_dma.c:15:
In file included from drivers/gpu/drm/exynos/exynos_drm_drv.h:16:
In file included from include/drm/drm_crtc.h:28:
In file included from include/linux/i2c.h:19:
In file included from include/linux/regulator/consumer.h:35:
In file included from include/linux/suspend.h:5:
In file included from include/linux/swap.h:9:
include/linux/memcontrol.h:1773:21: error: call to undeclared function 'css_tryget'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
} while (memcg && !css_tryget(&memcg->css));
^
include/linux/memcontrol.h:1773:38: error: incomplete definition of type 'struct mem_cgroup'
} while (memcg && !css_tryget(&memcg->css));
~~~~~^
include/linux/mm_types.h:31:8: note: forward declaration of 'struct mem_cgroup'
struct mem_cgroup;
^
>> drivers/gpu/drm/exynos/exynos_drm_dma.c:55:35: warning: implicit conversion from 'unsigned long long' to 'unsigned int' changes value from 18446744073709551615 to 4294967295 [-Wconstant-conversion]
dma_set_max_seg_size(subdrv_dev, DMA_BIT_MASK(32));
~~~~~~~~~~~~~~~~~~~~ ^~~~~~~~~~~~~~~~
include/linux/dma-mapping.h:76:40: note: expanded from macro 'DMA_BIT_MASK'
#define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
^~~~~
1 warning and 2 errors generated.


vim +55 drivers/gpu/drm/exynos/exynos_drm_dma.c

67fbf3a3ef8443 Andrzej Hajda 2018-10-12 33
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 34 /*
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 35 * drm_iommu_attach_device- attach device to iommu mapping
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 36 *
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 37 * @drm_dev: DRM device
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 38 * @subdrv_dev: device to be attach
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 39 *
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 40 * This function should be called by sub drivers to attach it to iommu
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 41 * mapping.
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 42 */
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 43 static int drm_iommu_attach_device(struct drm_device *drm_dev,
07dc3678bacc2a Marek Szyprowski 2020-03-09 44 struct device *subdrv_dev, void **dma_priv)
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 45 {
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 46 struct exynos_drm_private *priv = drm_dev->dev_private;
b9c633882de460 Marek Szyprowski 2020-06-01 47 int ret = 0;
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 48
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 49 if (get_dma_ops(priv->dma_dev) != get_dma_ops(subdrv_dev)) {
6f83d20838c099 Inki Dae 2019-04-15 50 DRM_DEV_ERROR(subdrv_dev, "Device %s lacks support for IOMMU\n",
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 51 dev_name(subdrv_dev));
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 52 return -EINVAL;
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 53 }
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 54
ddfd4ab6bb0883 Marek Szyprowski 2020-07-07 @55 dma_set_max_seg_size(subdrv_dev, DMA_BIT_MASK(32));
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 56 if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)) {
07dc3678bacc2a Marek Szyprowski 2020-03-09 57 /*
07dc3678bacc2a Marek Szyprowski 2020-03-09 58 * Keep the original DMA mapping of the sub-device and
07dc3678bacc2a Marek Szyprowski 2020-03-09 59 * restore it on Exynos DRM detach, otherwise the DMA
07dc3678bacc2a Marek Szyprowski 2020-03-09 60 * framework considers it as IOMMU-less during the next
07dc3678bacc2a Marek Szyprowski 2020-03-09 61 * probe (in case of deferred probe or modular build)
07dc3678bacc2a Marek Szyprowski 2020-03-09 62 */
07dc3678bacc2a Marek Szyprowski 2020-03-09 63 *dma_priv = to_dma_iommu_mapping(subdrv_dev);
07dc3678bacc2a Marek Szyprowski 2020-03-09 64 if (*dma_priv)
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 65 arm_iommu_detach_device(subdrv_dev);
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 66
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 67 ret = arm_iommu_attach_device(subdrv_dev, priv->mapping);
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 68 } else if (IS_ENABLED(CONFIG_IOMMU_DMA)) {
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 69 ret = iommu_attach_device(priv->mapping, subdrv_dev);
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 70 }
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 71
b9c633882de460 Marek Szyprowski 2020-06-01 72 return ret;
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 73 }
67fbf3a3ef8443 Andrzej Hajda 2018-10-12 74

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-04-23 16:11:22

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] net: set proper memcg for net_init hooks allocations

Hi Vasily,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.18-rc3 next-20220422]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/intel-lab-lkp/linux/commits/Vasily-Averin/net-set-proper-memcg-for-net_init-hooks-allocations/20220423-160759
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git c00c5e1d157bec0ef0b0b59aa5482eb8dc7e8e49
config: powerpc-buildonly-randconfig-r004-20220422 (https://download.01.org/0day-ci/archive/20220423/[email protected]/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 5bd87350a5ae429baf8f373cb226a57b62f87280)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install powerpc cross compiling tool for clang build
# apt-get install binutils-powerpc-linux-gnu
# https://github.com/intel-lab-lkp/linux/commit/3b379e5391e36e13b9f36305aa6d233fb03d4e58
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Vasily-Averin/net-set-proper-memcg-for-net_init-hooks-allocations/20220423-160759
git checkout 3b379e5391e36e13b9f36305aa6d233fb03d4e58
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=powerpc prepare

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from arch/powerpc/kernel/asm-offsets.c:21:
In file included from include/linux/suspend.h:5:
In file included from include/linux/swap.h:9:
>> include/linux/memcontrol.h:1773:38: error: incomplete definition of type 'struct mem_cgroup'
} while (memcg && !css_tryget(&memcg->css));
~~~~~^
include/linux/mm_types.h:31:8: note: forward declaration of 'struct mem_cgroup'
struct mem_cgroup;
^
1 error generated.
make[2]: *** [scripts/Makefile.build:120: arch/powerpc/kernel/asm-offsets.s] Error 1
make[2]: Target '__build' not remade because of errors.
make[1]: *** [Makefile:1194: prepare0] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [Makefile:219: __sub-make] Error 2
make: Target 'prepare' not remade because of errors.


vim +1773 include/linux/memcontrol.h

1765
1766 static inline struct mem_cgroup *get_mem_cgroup_from_kmem(void *p)
1767 {
1768 struct mem_cgroup *memcg;
1769
1770 rcu_read_lock();
1771 do {
1772 memcg = mem_cgroup_from_obj(p);
> 1773 } while (memcg && !css_tryget(&memcg->css));

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-04-24 18:37:24

by Vasily Averin

[permalink] [raw]
Subject: [PATCH memcg v2] net: set proper memcg for net_init hooks allocations

__register_pernet_operations() executes init hook of registered
pernet_operation structure in all existing net namespaces.

Typically, these hooks are called by a process associated with
the specified net namespace, and all __GFP_ACCOUNTING marked
allocation are accounted for corresponding container/memcg.

However __register_pernet_operations() calls the hooks in the same
context, and as a result all marked allocations are accounted
to one memcg for all processed net namespaces.

This patch adjusts active memcg for each net namespace and helps
to account memory allocated inside ops_init() into the proper memcg.

Signed-off-by: Vasily Averin <[email protected]>
---
v2: introduced get/put_net_memcg(),
new functions are moved under CONFIG_MEMCG_KMEM
to fix compilation issues reported by Intel's kernel test robot

v1: introduced get_mem_cgroup_from_kmem(), which takes the refcount
for the found memcg, suggested by Shakeel
---
include/linux/memcontrol.h | 35 +++++++++++++++++++++++++++++++++++
net/core/net_namespace.c | 7 +++++++
2 files changed, 42 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0abbd685703b..5230d3c5585a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1714,6 +1714,33 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)

struct mem_cgroup *mem_cgroup_from_obj(void *p);

+static inline struct mem_cgroup *get_mem_cgroup_from_kmem(void *p)
+{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ do {
+ memcg = mem_cgroup_from_obj(p);
+ } while (memcg && !css_tryget(&memcg->css));
+ rcu_read_unlock();
+ return memcg;
+}
+
+static inline struct mem_cgroup *get_net_memcg(void *p)
+{
+ struct mem_cgroup *memcg;
+
+ memcg = get_mem_cgroup_from_kmem(p);
+
+ if (!memcg)
+ memcg = root_mem_cgroup;
+
+ return memcg;
+}
+static inline void put_net_memcg(struct mem_cgroup *memcg)
+{
+ css_put(&memcg->css);
+}
#else
static inline bool mem_cgroup_kmem_disabled(void)
{
@@ -1766,6 +1793,14 @@ static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
return NULL;
}

+static inline struct mem_cgroup *get_net_memcg(void *p)
+{
+ return NULL;
+}
+
+static inline void put_net_memcg(struct mem_cgroup *memcg)
+{
+}
#endif /* CONFIG_MEMCG_KMEM */

#endif /* _LINUX_MEMCONTROL_H */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a5b5bb99c644..bf88360b8377 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -26,6 +26,7 @@
#include <net/net_namespace.h>
#include <net/netns/generic.h>

+#include <linux/sched/mm.h>
/*
* Our network namespace constructor/destructor lists
*/
@@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
* setup_net() and cleanup_net() are not possible.
*/
for_each_net(net) {
+ struct mem_cgroup *old, *memcg;
+
+ memcg = get_net_memcg(net);
+ old = set_active_memcg(memcg);
error = ops_init(ops, net);
+ set_active_memcg(old);
+ put_net_memcg(memcg);
if (error)
goto out_undo;
list_add_tail(&net->exit_list, &net_exit_list);
--
2.31.1

2022-04-25 08:35:14

by kernel test robot

[permalink] [raw]
Subject: [net] 3b379e5391: BUG:kernel_NULL_pointer_dereference,address



Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 3b379e5391e36e13b9f36305aa6d233fb03d4e58 ("[PATCH] net: set proper memcg for net_init hooks allocations")
url: https://github.com/intel-lab-lkp/linux/commits/Vasily-Averin/net-set-proper-memcg-for-net_init-hooks-allocations/20220423-160759
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git c00c5e1d157bec0ef0b0b59aa5482eb8dc7e8e49
patch link: https://lore.kernel.org/lkml/[email protected]

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+---------------------------------------------+------------+------------+
| | c00c5e1d15 | 3b379e5391 |
+---------------------------------------------+------------+------------+
| boot_successes | 9 | 0 |
| boot_failures | 0 | 32 |
| BUG:kernel_NULL_pointer_dereference,address | 0 | 32 |
| Oops:#[##] | 0 | 32 |
| EIP:__register_pernet_operations | 0 | 32 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 32 |
+---------------------------------------------+------------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>



[ 1.054816][ T0] BUG: kernel NULL pointer dereference, address: 0000002c
[ 1.055472][ T0] #PF: supervisor read access in kernel mode
[ 1.056034][ T0] #PF: error_code(0x0000) - not-present page
[ 1.056650][ T0] *pdpt = 0000000000000000 *pde = f000ff53f000ff53
[ 1.056795][ T0] Oops: 0000 [#1] SMP PTI
[ 1.056795][ T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.18.0-rc3-00191-g3b379e5391e3 #1
[ 1.056795][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 1.056795][ T0] EIP: __register_pernet_operations+0x169/0x340
[ 1.056795][ T0] Code: 1e d4 8b 40 08 a8 03 0f 85 44 01 00 00 64 ff 00 64 ff 0d d4 06 1e d4 e9 1d ff ff ff 8d 74 26 00 90 8b 45 e0 89 b8 0c 0f 00 00 <
f6> 43 2c 01 0f 85 68 ff ff ff 64 ff 05 d4 06 1e d4 8b 43 08 a8 03
[ 1.056795][ T0] EAX: d3cf4740 EBX: 00000000 ECX: 00000000 EDX: 00000cc0
[ 1.056795][ T0] ESI: d4331340 EDI: 00000000 EBP: d3cedf58 ESP: d3cedf34
[ 1.056795][ T0] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210246
[ 1.056795][ T0] CR0: 80050033 CR2: 0000002c CR3: 141f8000 CR4: 000406b0
[ 1.056795][ T0] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 1.056795][ T0] DR6: fffe0ff0 DR7: 00000400
[ 1.056795][ T0] Call Trace:
[ 1.056795][ T0] ? setup_net+0x44/0x300
[ 1.056795][ T0] register_pernet_operations+0x5c/0xc0
[ 1.056795][ T0] register_pernet_subsys+0x21/0x40
[ 1.056795][ T0] net_ns_init+0xb1/0xf1
[ 1.056795][ T0] start_kernel+0x403/0x46d
[ 1.056795][ T0] i386_start_kernel+0x48/0x4a
[ 1.056795][ T0] startup_32_smp+0x161/0x164
[ 1.056795][ T0] Modules linked in:
[ 1.056795][ T0] CR2: 000000000000002c
[ 1.056795][ T0] ---[ end trace 0000000000000000 ]---
[ 1.056795][ T0] EIP: __register_pernet_operations+0x169/0x340
[ 1.056795][ T0] Code: 1e d4 8b 40 08 a8 03 0f 85 44 01 00 00 64 ff 00 64 ff 0d d4 06 1e d4 e9 1d ff ff ff 8d 74 26 00 90 8b 45 e0 89 b8 0c 0f 00 00 <f6> 43 2c 01 0f 85 68 ff ff ff 64 ff 05 d4 06 1e d4 8b 43 08 a8 03
[ 1.056795][ T0] EAX: d3cf4740 EBX: 00000000 ECX: 00000000 EDX: 00000cc0
[ 1.056795][ T0] ESI: d4331340 EDI: 00000000 EBP: d3cedf58 ESP: d3cedf34
[ 1.056795][ T0] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210246
[ 1.056795][ T0] CR0: 80050033 CR2: 0000002c CR3: 141f8000 CR4: 000406b0
[ 1.056795][ T0] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 1.056795][ T0] DR6: fffe0ff0 DR7: 00000400
[ 1.056795][ T0] Kernel panic - not syncing: Fatal exception



To reproduce:

# build kernel
cd linux
cp config-5.18.0-rc3-00191-g3b379e5391e3 .config
make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.



--
0-DAY CI Kernel Test Service
https://01.org/lkp



Attachments:
(No filename) (4.90 kB)
config-5.18.0-rc3-00191-g3b379e5391e3 (144.02 kB)
job-script (4.59 kB)
dmesg.xz (5.67 kB)
Download all attachments

2022-04-25 15:51:48

by Vasily Averin

[permalink] [raw]
Subject: [PATCH memcg v3] net: set proper memcg for net_init hooks allocations

__register_pernet_operations() executes init hook of registered
pernet_operation structure in all existing net namespaces.

Typically, these hooks are called by a process associated with
the specified net namespace, and all __GFP_ACCOUNTING marked
allocation are accounted for corresponding container/memcg.

However __register_pernet_operations() calls the hooks in the same
context, and as a result all marked allocations are accounted
to one memcg for all processed net namespaces.

This patch adjusts active memcg for each net namespace and helps
to account memory allocated inside ops_init() into the proper memcg.

Signed-off-by: Vasily Averin <[email protected]>
---
v3: put_net_memcg() replaced by an alreay existing mem_cgroup_put()
It checks memcg before accessing it, this is required for
__register_pernet_operations() called before memcg initialization.
Additionally fixed leading whitespaces in non-memcg_kmem version
of mem_cgroup_from_obj().

v2: introduced get/put_net_memcg(),
new functions are moved under CONFIG_MEMCG_KMEM
to fix compilation issues reported by Intel's kernel test robot

v1: introduced get_mem_cgroup_from_kmem(), which takes the refcount
for the found memcg, suggested by Shakeel
---
include/linux/memcontrol.h | 29 ++++++++++++++++++++++++++++-
net/core/net_namespace.c | 7 +++++++
2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0abbd685703b..cfb68a3f7015 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1714,6 +1714,29 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)

struct mem_cgroup *mem_cgroup_from_obj(void *p);

+static inline struct mem_cgroup *get_mem_cgroup_from_kmem(void *p)
+{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ do {
+ memcg = mem_cgroup_from_obj(p);
+ } while (memcg && !css_tryget(&memcg->css));
+ rcu_read_unlock();
+ return memcg;
+}
+
+static inline struct mem_cgroup *get_net_memcg(void *p)
+{
+ struct mem_cgroup *memcg;
+
+ memcg = get_mem_cgroup_from_kmem(p);
+
+ if (!memcg)
+ memcg = root_mem_cgroup;
+
+ return memcg;
+}
#else
static inline bool mem_cgroup_kmem_disabled(void)
{
@@ -1763,9 +1786,13 @@ static inline void memcg_put_cache_ids(void)

static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
{
- return NULL;
+ return NULL;
}

+static inline struct mem_cgroup *get_net_memcg(void *p)
+{
+ return NULL;
+}
#endif /* CONFIG_MEMCG_KMEM */

#endif /* _LINUX_MEMCONTROL_H */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a5b5bb99c644..3093b4d5b2b9 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -26,6 +26,7 @@
#include <net/net_namespace.h>
#include <net/netns/generic.h>

+#include <linux/sched/mm.h>
/*
* Our network namespace constructor/destructor lists
*/
@@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
* setup_net() and cleanup_net() are not possible.
*/
for_each_net(net) {
+ struct mem_cgroup *old, *memcg;
+
+ memcg = get_net_memcg(net);
+ old = set_active_memcg(memcg);
error = ops_init(ops, net);
+ set_active_memcg(old);
+ mem_cgroup_put(memcg);
if (error)
goto out_undo;
list_add_tail(&net->exit_list, &net_exit_list);
--
2.31.1

2022-04-26 08:35:41

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg v3] net: set proper memcg for net_init hooks allocations

On 4/26/22 05:50, Roman Gushchin wrote:
> On Mon, Apr 25, 2022 at 01:56:02PM +0300, Vasily Averin wrote:
>> +
>> +static inline struct mem_cgroup *get_net_memcg(void *p)
>> +{
>> + struct mem_cgroup *memcg;
>> +
>> + memcg = get_mem_cgroup_from_kmem(p);
>> +
>> + if (!memcg)
>> + memcg = root_mem_cgroup;
>> +
>> + return memcg;
>> +}
>
> I'm not a fan of this helper: it has nothing to do with the networking,
> actually it's a wrapper of get_mem_cgroup_from_kmem() replacing NULL
> with root_mem_cgroup.
>
> Overall the handling of root_mem_cgroup is very messy, I don't blame
> this patch. But I wonder if it's better to simple move this code
> to the call site without introducing a new function?

Unfortunately root_mem_cgroup is defined under CONFIG_MEMCG,
so we cannot use it outside without ifdefs.

> Alternatively, you can introduce something like:
> struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg)
> {
> return memcg ? memcg : root_mem_cgroup;
> }

Thank you for the hint, this looks much better.
Vasily Averin

2022-04-26 08:45:01

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH memcg v3] net: set proper memcg for net_init hooks allocations

On Mon, Apr 25, 2022 at 01:56:02PM +0300, Vasily Averin wrote:

Hello, Vasily!

> __register_pernet_operations() executes init hook of registered
> pernet_operation structure in all existing net namespaces.
>
> Typically, these hooks are called by a process associated with
> the specified net namespace, and all __GFP_ACCOUNTING marked
> allocation are accounted for corresponding container/memcg.

__GFP_ACCOUNT

>
> However __register_pernet_operations() calls the hooks in the same
> context, and as a result all marked allocations are accounted
> to one memcg for all processed net namespaces.
>
> This patch adjusts active memcg for each net namespace and helps
> to account memory allocated inside ops_init() into the proper memcg.
>
> Signed-off-by: Vasily Averin <[email protected]>
> ---
> v3: put_net_memcg() replaced by an alreay existing mem_cgroup_put()
> It checks memcg before accessing it, this is required for
> __register_pernet_operations() called before memcg initialization.
> Additionally fixed leading whitespaces in non-memcg_kmem version
> of mem_cgroup_from_obj().
>
> v2: introduced get/put_net_memcg(),
> new functions are moved under CONFIG_MEMCG_KMEM
> to fix compilation issues reported by Intel's kernel test robot
>
> v1: introduced get_mem_cgroup_from_kmem(), which takes the refcount
> for the found memcg, suggested by Shakeel
> ---
> include/linux/memcontrol.h | 29 ++++++++++++++++++++++++++++-
> net/core/net_namespace.c | 7 +++++++
> 2 files changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0abbd685703b..cfb68a3f7015 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1714,6 +1714,29 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
>
> struct mem_cgroup *mem_cgroup_from_obj(void *p);
>
> +static inline struct mem_cgroup *get_mem_cgroup_from_kmem(void *p)
> +{
> + struct mem_cgroup *memcg;
> +
> + rcu_read_lock();
> + do {
> + memcg = mem_cgroup_from_obj(p);
> + } while (memcg && !css_tryget(&memcg->css));
> + rcu_read_unlock();
> + return memcg;
> +}

Please, rename it to get_mem_cgroup_from_obj() for consistency.

> +
> +static inline struct mem_cgroup *get_net_memcg(void *p)
> +{
> + struct mem_cgroup *memcg;
> +
> + memcg = get_mem_cgroup_from_kmem(p);
> +
> + if (!memcg)
> + memcg = root_mem_cgroup;
> +
> + return memcg;
> +}

I'm not a fan of this helper: it has nothing to do with the networking,
actually it's a wrapper of get_mem_cgroup_from_kmem() replacing NULL
with root_mem_cgroup.

Overall the handling of root_mem_cgroup is very messy, I don't blame
this patch. But I wonder if it's better to simple move this code
to the call site without introducing a new function?

Alternatively, you can introduce something like:
struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg)
{
return memcg ? memcg : root_mem_cgroup;
}

> #else
> static inline bool mem_cgroup_kmem_disabled(void)
> {
> @@ -1763,9 +1786,13 @@ static inline void memcg_put_cache_ids(void)
>
> static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
> {
> - return NULL;
> + return NULL;
> }
>
> +static inline struct mem_cgroup *get_net_memcg(void *p)
> +{
> + return NULL;
> +}
> #endif /* CONFIG_MEMCG_KMEM */
>
> #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index a5b5bb99c644..3093b4d5b2b9 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -26,6 +26,7 @@
> #include <net/net_namespace.h>
> #include <net/netns/generic.h>
>
> +#include <linux/sched/mm.h>
> /*
> * Our network namespace constructor/destructor lists
> */
> @@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
> * setup_net() and cleanup_net() are not possible.
> */
> for_each_net(net) {
> + struct mem_cgroup *old, *memcg;
> +
> + memcg = get_net_memcg(net);
> + old = set_active_memcg(memcg);
> error = ops_init(ops, net);
> + set_active_memcg(old);
> + mem_cgroup_put(memcg);
> if (error)
> goto out_undo;
> list_add_tail(&net->exit_list, &net_exit_list);
> --
> 2.31.1
>

2022-04-26 17:34:14

by Vasily Averin

[permalink] [raw]
Subject: [PATCH memcg v4] net: set proper memcg for net_init hooks allocations

__register_pernet_operations() executes init hook of registered
pernet_operation structure in all existing net namespaces.

Typically, these hooks are called by a process associated with
the specified net namespace, and all __GFP_ACCOUNT marked
allocation are accounted for corresponding container/memcg.

However __register_pernet_operations() calls the hooks in the same
context, and as a result all marked allocations are accounted
to one memcg for all processed net namespaces.

This patch adjusts active memcg for each net namespace and helps
to account memory allocated inside ops_init() into the proper memcg.

Signed-off-by: Vasily Averin <[email protected]>
---
v4: get_mem_cgroup_from_kmem() renamed to get_mem_cgroup_from_obj(),
get_net_memcg() replaced by mem_cgroup_or_root(), suggested by Roman.

v3: put_net_memcg() replaced by an alreay existing mem_cgroup_put()
It checks memcg before accessing it, this is required for
__register_pernet_operations() called before memcg initialization.
Additionally fixed leading whitespaces in non-memcg_kmem version
of mem_cgroup_from_obj().

v2: introduced get/put_net_memcg(),
new functions are moved under CONFIG_MEMCG_KMEM
to fix compilation issues reported by Intel's kernel test robot

v1: introduced get_mem_cgroup_from_kmem(), which takes the refcount
for the found memcg, suggested by Shakeel
---
include/linux/memcontrol.h | 27 ++++++++++++++++++++++++++-
net/core/net_namespace.c | 7 +++++++
2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0abbd685703b..6dd4ed7d3b6f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1714,6 +1714,22 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)

struct mem_cgroup *mem_cgroup_from_obj(void *p);

+static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
+{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ do {
+ memcg = mem_cgroup_from_obj(p);
+ } while (memcg && !css_tryget(&memcg->css));
+ rcu_read_unlock();
+ return memcg;
+}
+
+static inline struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg)
+{
+ return memcg ? memcg : root_mem_cgroup;
+}
#else
static inline bool mem_cgroup_kmem_disabled(void)
{
@@ -1763,9 +1779,18 @@ static inline void memcg_put_cache_ids(void)

static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
{
- return NULL;
+ return NULL;
}

+static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
+{
+ return NULL;
+}
+
+static inline struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg)
+{
+ return NULL;
+}
#endif /* CONFIG_MEMCG_KMEM */

#endif /* _LINUX_MEMCONTROL_H */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a5b5bb99c644..240f3db77dec 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -26,6 +26,7 @@
#include <net/net_namespace.h>
#include <net/netns/generic.h>

+#include <linux/sched/mm.h>
/*
* Our network namespace constructor/destructor lists
*/
@@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
* setup_net() and cleanup_net() are not possible.
*/
for_each_net(net) {
+ struct mem_cgroup *old, *memcg;
+
+ memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(net));
+ old = set_active_memcg(memcg);
error = ops_init(ops, net);
+ set_active_memcg(old);
+ mem_cgroup_put(memcg);
if (error)
goto out_undo;
list_add_tail(&net->exit_list, &net_exit_list);
--
2.31.1

2022-04-27 09:40:35

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH memcg v4] net: set proper memcg for net_init hooks allocations

On Mon, Apr 25, 2022 at 11:43 PM Vasily Averin <[email protected]> wrote:
>
> __register_pernet_operations() executes init hook of registered
> pernet_operation structure in all existing net namespaces.
>
> Typically, these hooks are called by a process associated with
> the specified net namespace, and all __GFP_ACCOUNT marked
> allocation are accounted for corresponding container/memcg.
>
> However __register_pernet_operations() calls the hooks in the same
> context, and as a result all marked allocations are accounted
> to one memcg for all processed net namespaces.
>
> This patch adjusts active memcg for each net namespace and helps
> to account memory allocated inside ops_init() into the proper memcg.
>
> Signed-off-by: Vasily Averin <[email protected]>

Acked-by: Shakeel Butt <[email protected]>

[...]
>
> +static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
> +{
> + struct mem_cgroup *memcg;
> +

Do we need memcg_kmem_enabled() check here or maybe
mem_cgroup_from_obj() should be doing memcg_kmem_enabled() instead of
mem_cgroup_disabled() as we can have "cgroup.memory=nokmem" boot
param.

> + rcu_read_lock();
> + do {
> + memcg = mem_cgroup_from_obj(p);
> + } while (memcg && !css_tryget(&memcg->css));
> + rcu_read_unlock();
> + return memcg;
> +}

2022-04-27 10:43:33

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH memcg v4] net: set proper memcg for net_init hooks allocations

On Tue, Apr 26, 2022 at 09:43:43AM +0300, Vasily Averin wrote:
> __register_pernet_operations() executes init hook of registered
> pernet_operation structure in all existing net namespaces.
>
> Typically, these hooks are called by a process associated with
> the specified net namespace, and all __GFP_ACCOUNT marked
> allocation are accounted for corresponding container/memcg.
>
> However __register_pernet_operations() calls the hooks in the same
> context, and as a result all marked allocations are accounted
> to one memcg for all processed net namespaces.
>
> This patch adjusts active memcg for each net namespace and helps
> to account memory allocated inside ops_init() into the proper memcg.
>
> Signed-off-by: Vasily Averin <[email protected]>
> ---
> v4: get_mem_cgroup_from_kmem() renamed to get_mem_cgroup_from_obj(),
> get_net_memcg() replaced by mem_cgroup_or_root(), suggested by Roman.
>
> v3: put_net_memcg() replaced by an alreay existing mem_cgroup_put()
> It checks memcg before accessing it, this is required for
> __register_pernet_operations() called before memcg initialization.
> Additionally fixed leading whitespaces in non-memcg_kmem version
> of mem_cgroup_from_obj().
>
> v2: introduced get/put_net_memcg(),
> new functions are moved under CONFIG_MEMCG_KMEM
> to fix compilation issues reported by Intel's kernel test robot
>
> v1: introduced get_mem_cgroup_from_kmem(), which takes the refcount
> for the found memcg, suggested by Shakeel
> ---
> include/linux/memcontrol.h | 27 ++++++++++++++++++++++++++-
> net/core/net_namespace.c | 7 +++++++
> 2 files changed, 33 insertions(+), 1 deletion(-)

Acked-by: Roman Gushchin <[email protected]>

Thanks!

2022-04-27 12:48:48

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH memcg v4] net: set proper memcg for net_init hooks allocations

On Tue, Apr 26, 2022 at 10:23:32PM -0700, Shakeel Butt <[email protected]> wrote:
> [...]
> >
> > +static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
> > +{
> > + struct mem_cgroup *memcg;
> > +
>
> Do we need memcg_kmem_enabled() check here or maybe
> mem_cgroup_from_obj() should be doing memcg_kmem_enabled() instead of
> mem_cgroup_disabled() as we can have "cgroup.memory=nokmem" boot
> param.

I reckon such a guard is on the charge side and readers should treat
NULL and root_mem_group equally. Or is there a case when these two are
different?

(I can see it's different semantics when stored in current->active_memcg
(and active_memcg() getter) but for such "outer" callers like here it
seems equal.)

Regards,
Michal

2022-04-27 15:33:17

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH memcg v4] net: set proper memcg for net_init hooks allocations

On Wed, Apr 27, 2022 at 5:22 AM Michal Koutný <[email protected]> wrote:
>
> On Tue, Apr 26, 2022 at 10:23:32PM -0700, Shakeel Butt <[email protected]> wrote:
> > [...]
> > >
> > > +static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
> > > +{
> > > + struct mem_cgroup *memcg;
> > > +
> >
> > Do we need memcg_kmem_enabled() check here or maybe
> > mem_cgroup_from_obj() should be doing memcg_kmem_enabled() instead of
> > mem_cgroup_disabled() as we can have "cgroup.memory=nokmem" boot
> > param.
>
> I reckon such a guard is on the charge side and readers should treat
> NULL and root_mem_group equally. Or is there a case when these two are
> different?
>
> (I can see it's different semantics when stored in current->active_memcg
> (and active_memcg() getter) but for such "outer" callers like here it
> seems equal.)

I was more thinking about possible shortcut optimization and unrelated
to this patch.

Vasily, can you please add documentation for get_mem_cgroup_from_obj()
similar to get_mem_cgroup_from_mm()? Also for mem_cgroup_or_root().
Please note that root_mem_cgroup can be NULL during early boot.

2022-04-27 22:52:14

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg v4] net: set proper memcg for net_init hooks allocations

On 4/27/22 18:06, Shakeel Butt wrote:
> On Wed, Apr 27, 2022 at 5:22 AM Michal Koutný <[email protected]> wrote:
>>
>> On Tue, Apr 26, 2022 at 10:23:32PM -0700, Shakeel Butt <[email protected]> wrote:
>>> [...]
>>>>
>>>> +static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
>>>> +{
>>>> + struct mem_cgroup *memcg;
>>>> +
>>>
>>> Do we need memcg_kmem_enabled() check here or maybe
>>> mem_cgroup_from_obj() should be doing memcg_kmem_enabled() instead of
>>> mem_cgroup_disabled() as we can have "cgroup.memory=nokmem" boot
>>> param.

Shakeel, unfortunately I'm not ready to answer this question right now.
I even did not noticed that memcg_kmem_enabled() and mem_cgroup_disabled()
have a different nature.
If you have no objections I'm going to keep this place as is and investigate
this question later.

>> I reckon such a guard is on the charge side and readers should treat
>> NULL and root_mem_group equally. Or is there a case when these two are
>> different?
>>
>> (I can see it's different semantics when stored in current->active_memcg
>> (and active_memcg() getter) but for such "outer" callers like here it
>> seems equal.)

Dear Michal,
I may have misunderstood your point of view, so let me explain my vision
in more detail.
I do not think that NULL and root_mem_cgroup are equal here:
- we have enabled cgroups and well-defined root_mem_cgroup,
- this function is called from inside memcg-limited container,
- we tried to get memcg from net, but without success,
and as result got NULL from mem_cgroup_from_obj()
(frankly speaking I do not think this situation is really possible)
If we keep memcg = NULL, then current's memcg will not be masked and
net_init's allocations will be accounted to current's memcg.
So we need to set active_memcg to root_mem_cgroup, it helps to avoid
incorrect accounting.

> I was more thinking about possible shortcut optimization and unrelated
> to this patch.
>
> Vasily, can you please add documentation for get_mem_cgroup_from_obj()
> similar to get_mem_cgroup_from_mm()? Also for mem_cgroup_or_root().
> Please note that root_mem_cgroup can be NULL during early boot.

Ok, thank you for the remark, I'll improve it in next patch version.

Thank you,
Vasily Averin

2022-04-28 00:30:32

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH memcg v4] net: set proper memcg for net_init hooks allocations

On Thu, Apr 28, 2022 at 01:16:53AM +0300, Vasily Averin wrote:
> On 4/27/22 18:06, Shakeel Butt wrote:
> > On Wed, Apr 27, 2022 at 5:22 AM Michal Koutn? <[email protected]> wrote:
> >>
> >> On Tue, Apr 26, 2022 at 10:23:32PM -0700, Shakeel Butt <[email protected]> wrote:
> >>> [...]
> >>>>
> >>>> +static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
> >>>> +{
> >>>> + struct mem_cgroup *memcg;
> >>>> +
> >>>
> >>> Do we need memcg_kmem_enabled() check here or maybe
> >>> mem_cgroup_from_obj() should be doing memcg_kmem_enabled() instead of
> >>> mem_cgroup_disabled() as we can have "cgroup.memory=nokmem" boot
> >>> param.
>
> Shakeel, unfortunately I'm not ready to answer this question right now.
> I even did not noticed that memcg_kmem_enabled() and mem_cgroup_disabled()
> have a different nature.
> If you have no objections I'm going to keep this place as is and investigate
> this question later.
>
> >> I reckon such a guard is on the charge side and readers should treat
> >> NULL and root_mem_group equally. Or is there a case when these two are
> >> different?
> >>
> >> (I can see it's different semantics when stored in current->active_memcg
> >> (and active_memcg() getter) but for such "outer" callers like here it
> >> seems equal.)
>
> Dear Michal,
> I may have misunderstood your point of view, so let me explain my vision
> in more detail.
> I do not think that NULL and root_mem_cgroup are equal here:
> - we have enabled cgroups and well-defined root_mem_cgroup,
> - this function is called from inside memcg-limited container,
> - we tried to get memcg from net, but without success,
> and as result got NULL from mem_cgroup_from_obj()
> (frankly speaking I do not think this situation is really possible)
> If we keep memcg = NULL, then current's memcg will not be masked and
> net_init's allocations will be accounted to current's memcg.
> So we need to set active_memcg to root_mem_cgroup, it helps to avoid
> incorrect accounting.

It's way out of scope of this patch, but I think we need to stop
using NULL as root_mem_cgroup/system scope indicator. Remaining use cases
will be like end of cgroup iteration, active memcg not set, parent of the
root memcg, etc.
We can point root_mem_cgroup at a statically allocated structure
on both CONFIG_MEMCG and !CONFIG_MEMCG.
Does it sound reasonable or I'm missing some important points?

Thanks!

2022-04-28 08:12:13

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg v4] net: set proper memcg for net_init hooks allocations

On 4/28/22 01:36, Roman Gushchin wrote:
> We can point root_mem_cgroup at a statically allocated structure
> on both CONFIG_MEMCG and !CONFIG_MEMCG.
> Does it sound reasonable or I'm missing some important points?

I expect Embedded developers will be highly disagree.
Pointer only should be acceptable for them.

Thank you,
Vasily Averin

2022-04-28 08:32:07

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH memcg v4] net: set proper memcg for net_init hooks allocations

On Wed, Apr 27, 2022 at 3:43 PM Vasily Averin <[email protected]> wrote:
>
> On 4/27/22 18:06, Shakeel Butt wrote:
> > On Wed, Apr 27, 2022 at 5:22 AM Michal Koutný <[email protected]> wrote:
> >>
> >> On Tue, Apr 26, 2022 at 10:23:32PM -0700, Shakeel Butt <[email protected]> wrote:
> >>> [...]
> >>>>
> >>>> +static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
> >>>> +{
> >>>> + struct mem_cgroup *memcg;
> >>>> +
> >>>
> >>> Do we need memcg_kmem_enabled() check here or maybe
> >>> mem_cgroup_from_obj() should be doing memcg_kmem_enabled() instead of
> >>> mem_cgroup_disabled() as we can have "cgroup.memory=nokmem" boot
> >>> param.
>
> Shakeel, unfortunately I'm not ready to answer this question right now.
> I even did not noticed that memcg_kmem_enabled() and mem_cgroup_disabled()
> have a different nature.
> If you have no objections I'm going to keep this place as is and investigate
> this question later.
>

Patch is good as is. Just add the documentation to the functions in
the next version and you can keep the ACKs.

2022-05-02 22:42:39

by Vasily Averin

[permalink] [raw]
Subject: [PATCH memcg v5] net: set proper memcg for net_init hooks allocations

__register_pernet_operations() executes init hook of registered
pernet_operation structure in all existing net namespaces.

Typically, these hooks are called by a process associated with
the specified net namespace, and all __GFP_ACCOUNT marked
allocation are accounted for corresponding container/memcg.

However __register_pernet_operations() calls the hooks in the same
context, and as a result all marked allocations are accounted
to one memcg for all processed net namespaces.

This patch adjusts active memcg for each net namespace and helps
to account memory allocated inside ops_init() into the proper memcg.

Signed-off-by: Vasily Averin <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Shakeel Butt <[email protected]>

---
v5: documented get_mem_cgroup_from_obj() and for mem_cgroup_or_root()
functions, asked by Shakeel.

v4: get_mem_cgroup_from_kmem() renamed to get_mem_cgroup_from_obj(),
get_net_memcg() renamed to mem_cgroup_or_root(), suggested by Roman.

v3: put_net_memcg() replaced by an alreay existing mem_cgroup_put()
It checks memcg before accessing it, this is required for
__register_pernet_operations() called before memcg initialization.
Additionally fixed leading whitespaces in non-memcg_kmem version
of mem_cgroup_from_obj().

v2: introduced get/put_net_memcg(),
new functions are moved under CONFIG_MEMCG_KMEM
to fix compilation issues reported by Intel's kernel test robot

v1: introduced get_mem_cgroup_from_kmem(), which takes the refcount
for the found memcg, suggested by Shakeel
---
include/linux/memcontrol.h | 47 +++++++++++++++++++++++++++++++++++++-
net/core/net_namespace.c | 7 ++++++
2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0abbd685703b..6405f9b8f5a8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1714,6 +1714,42 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)

struct mem_cgroup *mem_cgroup_from_obj(void *p);

+/**
+ * get_mem_cgroup_from_obj - get a memcg associated with passed kernel object.
+ * @p: pointer to object from which memcg should be extracted. It can be NULL.
+ *
+ * Retrieves the memory group into which the memory of the pointed kernel
+ * object is accounted. If memcg is found, its reference is taken.
+ * If a passed kernel object is uncharged, or if proper memcg cannot be found,
+ * as well as if mem_cgroup is disabled, NULL is returned.
+ *
+ * Return: valid memcg pointer with taken reference or NULL.
+ */
+static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
+{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ do {
+ memcg = mem_cgroup_from_obj(p);
+ } while (memcg && !css_tryget(&memcg->css));
+ rcu_read_unlock();
+ return memcg;
+}
+
+/**
+ * mem_cgroup_or_root - always returns a pointer to a valid memory cgroup.
+ * @memcg: pointer to a valid memory cgroup or NULL.
+ *
+ * If passed argument is not NULL, returns it without any additional checks
+ * and changes. Otherwise, root_mem_cgroup is returned.
+ *
+ * NOTE: root_mem_cgroup can be NULL during early boot.
+ */
+static inline struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg)
+{
+ return memcg ? memcg : root_mem_cgroup;
+}
#else
static inline bool mem_cgroup_kmem_disabled(void)
{
@@ -1763,9 +1799,18 @@ static inline void memcg_put_cache_ids(void)

static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
{
- return NULL;
+ return NULL;
}

+static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
+{
+ return NULL;
+}
+
+static inline struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg)
+{
+ return NULL;
+}
#endif /* CONFIG_MEMCG_KMEM */

#endif /* _LINUX_MEMCONTROL_H */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a5b5bb99c644..240f3db77dec 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -26,6 +26,7 @@
#include <net/net_namespace.h>
#include <net/netns/generic.h>

+#include <linux/sched/mm.h>
/*
* Our network namespace constructor/destructor lists
*/
@@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
* setup_net() and cleanup_net() are not possible.
*/
for_each_net(net) {
+ struct mem_cgroup *old, *memcg;
+
+ memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(net));
+ old = set_active_memcg(memcg);
error = ops_init(ops, net);
+ set_active_memcg(old);
+ mem_cgroup_put(memcg);
if (error)
goto out_undo;
list_add_tail(&net->exit_list, &net_exit_list);
--
2.31.1

2022-05-02 23:17:15

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH memcg v4] net: set proper memcg for net_init hooks allocations


> On May 1, 2022, at 6:44 AM, Vasily Averin <[email protected]> wrote:
>
> On 4/28/22 01:47, Shakeel Butt wrote:
>>> On Wed, Apr 27, 2022 at 3:43 PM Vasily Averin <[email protected]> wrote:
>>>
>>> On 4/27/22 18:06, Shakeel Butt wrote:
>>>> On Wed, Apr 27, 2022 at 5:22 AM Michal Koutný <[email protected]> wrote:
>>>>>
>>>>> On Tue, Apr 26, 2022 at 10:23:32PM -0700, Shakeel Butt <[email protected]> wrote:
>>>>>> [...]
>>>>>>>
>>>>>>> +static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
>>>>>>> +{
>>>>>>> + struct mem_cgroup *memcg;
>>>>>>> +
>>>>>>
>>>>>> Do we need memcg_kmem_enabled() check here or maybe
>>>>>> mem_cgroup_from_obj() should be doing memcg_kmem_enabled() instead of
>>>>>> mem_cgroup_disabled() as we can have "cgroup.memory=nokmem" boot
>>>>>> param.
>>>
>>> Shakeel, unfortunately I'm not ready to answer this question right now.
>>> I even did not noticed that memcg_kmem_enabled() and mem_cgroup_disabled()
>>> have a different nature.
>>> If you have no objections I'm going to keep this place as is and investigate
>>> this question later.
>>>
>>
>> Patch is good as is. Just add the documentation to the functions in
>> the next version and you can keep the ACKs.
>
> I noticed that the kernel already has a function get_mem_cgroup_from_objcg(),
> the name of which is very similar to my new function get_mem_cgroup_from_obj().
> Maybe it's better to rename my function to get_mem_cgroup_from_ptr()?

I don’t think it’s a problem: objcg is a widely used abbreviation and in my opinion is different enough from obj. I’d keep it for the consistency with the mem_cgroup_from_obj().

Thanks!

2022-05-03 00:25:25

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg v4] net: set proper memcg for net_init hooks allocations

On 4/28/22 01:47, Shakeel Butt wrote:
> On Wed, Apr 27, 2022 at 3:43 PM Vasily Averin <[email protected]> wrote:
>>
>> On 4/27/22 18:06, Shakeel Butt wrote:
>>> On Wed, Apr 27, 2022 at 5:22 AM Michal Koutný <[email protected]> wrote:
>>>>
>>>> On Tue, Apr 26, 2022 at 10:23:32PM -0700, Shakeel Butt <[email protected]> wrote:
>>>>> [...]
>>>>>>
>>>>>> +static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
>>>>>> +{
>>>>>> + struct mem_cgroup *memcg;
>>>>>> +
>>>>>
>>>>> Do we need memcg_kmem_enabled() check here or maybe
>>>>> mem_cgroup_from_obj() should be doing memcg_kmem_enabled() instead of
>>>>> mem_cgroup_disabled() as we can have "cgroup.memory=nokmem" boot
>>>>> param.
>>
>> Shakeel, unfortunately I'm not ready to answer this question right now.
>> I even did not noticed that memcg_kmem_enabled() and mem_cgroup_disabled()
>> have a different nature.
>> If you have no objections I'm going to keep this place as is and investigate
>> this question later.
>>
>
> Patch is good as is. Just add the documentation to the functions in
> the next version and you can keep the ACKs.

I noticed that the kernel already has a function get_mem_cgroup_from_objcg(),
the name of which is very similar to my new function get_mem_cgroup_from_obj().
Maybe it's better to rename my function to get_mem_cgroup_from_ptr()?

Thank you,
Vasily Averin

2022-05-30 13:48:18

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg v5] net: set proper memcg for net_init hooks allocations

Dear Andrew,
could you please pick up this patch?

Thank you,
Vasily Averin

On 5/2/22 03:10, Vasily Averin wrote:
> __register_pernet_operations() executes init hook of registered
> pernet_operation structure in all existing net namespaces.
>
> Typically, these hooks are called by a process associated with
> the specified net namespace, and all __GFP_ACCOUNT marked
> allocation are accounted for corresponding container/memcg.
>
> However __register_pernet_operations() calls the hooks in the same
> context, and as a result all marked allocations are accounted
> to one memcg for all processed net namespaces.
>
> This patch adjusts active memcg for each net namespace and helps
> to account memory allocated inside ops_init() into the proper memcg.
>
> Signed-off-by: Vasily Averin <[email protected]>
> Acked-by: Roman Gushchin <[email protected]>
> Acked-by: Shakeel Butt <[email protected]>
>
> ---
> v5: documented get_mem_cgroup_from_obj() and for mem_cgroup_or_root()
> functions, asked by Shakeel.
>
> v4: get_mem_cgroup_from_kmem() renamed to get_mem_cgroup_from_obj(),
> get_net_memcg() renamed to mem_cgroup_or_root(), suggested by Roman.
>
> v3: put_net_memcg() replaced by an alreay existing mem_cgroup_put()
> It checks memcg before accessing it, this is required for
> __register_pernet_operations() called before memcg initialization.
> Additionally fixed leading whitespaces in non-memcg_kmem version
> of mem_cgroup_from_obj().
>
> v2: introduced get/put_net_memcg(),
> new functions are moved under CONFIG_MEMCG_KMEM
> to fix compilation issues reported by Intel's kernel test robot
>
> v1: introduced get_mem_cgroup_from_kmem(), which takes the refcount
> for the found memcg, suggested by Shakeel
> ---
> include/linux/memcontrol.h | 47 +++++++++++++++++++++++++++++++++++++-
> net/core/net_namespace.c | 7 ++++++
> 2 files changed, 53 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0abbd685703b..6405f9b8f5a8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1714,6 +1714,42 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
>
> struct mem_cgroup *mem_cgroup_from_obj(void *p);
>
> +/**
> + * get_mem_cgroup_from_obj - get a memcg associated with passed kernel object.
> + * @p: pointer to object from which memcg should be extracted. It can be NULL.
> + *
> + * Retrieves the memory group into which the memory of the pointed kernel
> + * object is accounted. If memcg is found, its reference is taken.
> + * If a passed kernel object is uncharged, or if proper memcg cannot be found,
> + * as well as if mem_cgroup is disabled, NULL is returned.
> + *
> + * Return: valid memcg pointer with taken reference or NULL.
> + */
> +static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
> +{
> + struct mem_cgroup *memcg;
> +
> + rcu_read_lock();
> + do {
> + memcg = mem_cgroup_from_obj(p);
> + } while (memcg && !css_tryget(&memcg->css));
> + rcu_read_unlock();
> + return memcg;
> +}
> +
> +/**
> + * mem_cgroup_or_root - always returns a pointer to a valid memory cgroup.
> + * @memcg: pointer to a valid memory cgroup or NULL.
> + *
> + * If passed argument is not NULL, returns it without any additional checks
> + * and changes. Otherwise, root_mem_cgroup is returned.
> + *
> + * NOTE: root_mem_cgroup can be NULL during early boot.
> + */
> +static inline struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg)
> +{
> + return memcg ? memcg : root_mem_cgroup;
> +}
> #else
> static inline bool mem_cgroup_kmem_disabled(void)
> {
> @@ -1763,9 +1799,18 @@ static inline void memcg_put_cache_ids(void)
>
> static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
> {
> - return NULL;
> + return NULL;
> }
>
> +static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
> +{
> + return NULL;
> +}
> +
> +static inline struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg)
> +{
> + return NULL;
> +}
> #endif /* CONFIG_MEMCG_KMEM */
>
> #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index a5b5bb99c644..240f3db77dec 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -26,6 +26,7 @@
> #include <net/net_namespace.h>
> #include <net/netns/generic.h>
>
> +#include <linux/sched/mm.h>
> /*
> * Our network namespace constructor/destructor lists
> */
> @@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
> * setup_net() and cleanup_net() are not possible.
> */
> for_each_net(net) {
> + struct mem_cgroup *old, *memcg;
> +
> + memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(net));
> + old = set_active_memcg(memcg);
> error = ops_init(ops, net);
> + set_active_memcg(old);
> + mem_cgroup_put(memcg);
> if (error)
> goto out_undo;
> list_add_tail(&net->exit_list, &net_exit_list);


2022-06-06 04:47:14

by Vasily Averin

[permalink] [raw]
Subject: [PATCH memcg v6] net: set proper memcg for net_init hooks allocations

__register_pernet_operations() executes init hook of registered
pernet_operation structure in all existing net namespaces.

Typically, these hooks are called by a process associated with
the specified net namespace, and all __GFP_ACCOUNT marked
allocation are accounted for corresponding container/memcg.

However __register_pernet_operations() calls the hooks in the same
context, and as a result all marked allocations are accounted
to one memcg for all processed net namespaces.

This patch adjusts active memcg for each net namespace and helps
to account memory allocated inside ops_init() into the proper memcg.

Signed-off-by: Vasily Averin <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
---
v6: re-based to current upstream (v5.18-11267-gb00ed48bb0a7)

v5: documented get_mem_cgroup_from_obj() and for mem_cgroup_or_root()
functions, asked by Shakeel.

v4: get_mem_cgroup_from_kmem() renamed to get_mem_cgroup_from_obj(),
get_net_memcg() renamed to mem_cgroup_or_root(), suggested by Roman.

v3: put_net_memcg() replaced by an alreay existing mem_cgroup_put()
It checks memcg before accessing it, this is required for
__register_pernet_operations() called before memcg initialization.
Additionally fixed leading whitespaces in non-memcg_kmem version
of mem_cgroup_from_obj().

v2: introduced get/put_net_memcg(),
new functions are moved under CONFIG_MEMCG_KMEM
to fix compilation issues reported by Intel's kernel test robot

v1: introduced get_mem_cgroup_from_kmem(), which takes the refcount
for the found memcg, suggested by Shakeel
---
include/linux/memcontrol.h | 47 +++++++++++++++++++++++++++++++++++++-
net/core/net_namespace.c | 7 ++++++
2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9ecead1042b9..dad16b484cd5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1755,6 +1755,42 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
rcu_read_unlock();
}

+/**
+ * get_mem_cgroup_from_obj - get a memcg associated with passed kernel object.
+ * @p: pointer to object from which memcg should be extracted. It can be NULL.
+ *
+ * Retrieves the memory group into which the memory of the pointed kernel
+ * object is accounted. If memcg is found, its reference is taken.
+ * If a passed kernel object is uncharged, or if proper memcg cannot be found,
+ * as well as if mem_cgroup is disabled, NULL is returned.
+ *
+ * Return: valid memcg pointer with taken reference or NULL.
+ */
+static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
+{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ do {
+ memcg = mem_cgroup_from_obj(p);
+ } while (memcg && !css_tryget(&memcg->css));
+ rcu_read_unlock();
+ return memcg;
+}
+
+/**
+ * mem_cgroup_or_root - always returns a pointer to a valid memory cgroup.
+ * @memcg: pointer to a valid memory cgroup or NULL.
+ *
+ * If passed argument is not NULL, returns it without any additional checks
+ * and changes. Otherwise, root_mem_cgroup is returned.
+ *
+ * NOTE: root_mem_cgroup can be NULL during early boot.
+ */
+static inline struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg)
+{
+ return memcg ? memcg : root_mem_cgroup;
+}
#else
static inline bool mem_cgroup_kmem_disabled(void)
{
@@ -1798,7 +1834,7 @@ static inline int memcg_kmem_id(struct mem_cgroup *memcg)

static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
{
- return NULL;
+ return NULL;
}

static inline void count_objcg_event(struct obj_cgroup *objcg,
@@ -1806,6 +1842,15 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
{
}

+static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
+{
+ return NULL;
+}
+
+static inline struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg)
+{
+ return NULL;
+}
#endif /* CONFIG_MEMCG_KMEM */

#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 0ec2f5906a27..6b9f19122ec1 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -18,6 +18,7 @@
#include <linux/user_namespace.h>
#include <linux/net_namespace.h>
#include <linux/sched/task.h>
+#include <linux/sched/mm.h>
#include <linux/uidgid.h>
#include <linux/cookie.h>

@@ -1143,7 +1144,13 @@ static int __register_pernet_operations(struct list_head *list,
* setup_net() and cleanup_net() are not possible.
*/
for_each_net(net) {
+ struct mem_cgroup *old, *memcg;
+
+ memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(net));
+ old = set_active_memcg(memcg);
error = ops_init(ops, net);
+ set_active_memcg(old);
+ mem_cgroup_put(memcg);
if (error)
goto out_undo;
list_add_tail(&net->exit_list, &net_exit_list);
--
2.36.1

2022-06-06 14:02:20

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH memcg v6] net: set proper memcg for net_init hooks allocations

On Fri, Jun 03, 2022 at 07:19:43AM +0300, Vasily Averin wrote:
> __register_pernet_operations() executes init hook of registered
> pernet_operation structure in all existing net namespaces.
>
> Typically, these hooks are called by a process associated with
> the specified net namespace, and all __GFP_ACCOUNT marked
> allocation are accounted for corresponding container/memcg.
>
> However __register_pernet_operations() calls the hooks in the same
> context, and as a result all marked allocations are accounted
> to one memcg for all processed net namespaces.
>
> This patch adjusts active memcg for each net namespace and helps
> to account memory allocated inside ops_init() into the proper memcg.
>
> Signed-off-by: Vasily Averin <[email protected]>
> Acked-by: Roman Gushchin <[email protected]>
> Acked-by: Shakeel Butt <[email protected]>
> ---
...
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 9ecead1042b9..dad16b484cd5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1755,6 +1755,42 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> rcu_read_unlock();
> }
>
> +/**
> + * get_mem_cgroup_from_obj - get a memcg associated with passed kernel object.
> + * @p: pointer to object from which memcg should be extracted. It can be NULL.
> + *
> + * Retrieves the memory group into which the memory of the pointed kernel
> + * object is accounted. If memcg is found, its reference is taken.
> + * If a passed kernel object is uncharged, or if proper memcg cannot be found,
> + * as well as if mem_cgroup is disabled, NULL is returned.
> + *
> + * Return: valid memcg pointer with taken reference or NULL.
> + */
> +static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
> +{
> + struct mem_cgroup *memcg;
> +
> + rcu_read_lock();
> + do {
> + memcg = mem_cgroup_from_obj(p);
> + } while (memcg && !css_tryget(&memcg->css));
> + rcu_read_unlock();
> + return memcg;
> +}
...
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 0ec2f5906a27..6b9f19122ec1 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -18,6 +18,7 @@
> #include <linux/user_namespace.h>
> #include <linux/net_namespace.h>
> #include <linux/sched/task.h>
> +#include <linux/sched/mm.h>
> #include <linux/uidgid.h>
> #include <linux/cookie.h>
>
> @@ -1143,7 +1144,13 @@ static int __register_pernet_operations(struct list_head *list,
> * setup_net() and cleanup_net() are not possible.
> */
> for_each_net(net) {
> + struct mem_cgroup *old, *memcg;
> +
> + memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(net));
> + old = set_active_memcg(memcg);
> error = ops_init(ops, net);
> + set_active_memcg(old);
> + mem_cgroup_put(memcg);
> if (error)
> goto out_undo;
> list_add_tail(&net->exit_list, &net_exit_list);
> --
> 2.36.1

This triggers a few boot warnings like those.

virt_to_phys used for non-linear address: ffffd8efe2d2fe00 (init_net)
WARNING: CPU: 87 PID: 3170 at arch/arm64/mm/physaddr.c:12 __virt_to_phys
CPU: 87 PID: 3170 Comm: modprobe Tainted: G B W 5.19.0-rc1-next-20220606 #138
pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __virt_to_phys
lr : __virt_to_phys
sp : ffff800051cc76b0
x29: ffff800051cc76b0 x28: ffffd8efb5ba6ab8 x27: ffffd8efb5ba6b2c
x26: ffffd8efb1bccb20 x25: ffffd8efbaaf8200 x24: ffff800051cc77f0
x23: ffffd8efb744a000 x22: ffffd8efbb1bc000 x21: 0000600000000000
x20: 0000d8efe2d2fe00 x19: ffffd8efe2d2fe00 x18: 0000000000000443
x17: 0000000000000000 x16: 0000000000000002 x15: ffffd8efb9db2000
x14: 0000000000000001 x13: 0000000000000000 x12: ffff6806c88f8986
x11: 1fffe806c88f8985 x10: ffff6806c88f8985 x9 : dfff800000000000
x8 : ffff4036447c4c2b x7 : 0000000000000001 x6 : ffff6806c88f8985
x5 : ffff4036447c4c28 x4 : ffff6806c88f8986 x3 : ffffd8efb34b3850
x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff400335f99a80
Call trace:
__virt_to_phys
mem_cgroup_from_obj
__register_pernet_operations
register_pernet_operations
register_pernet_subsys
nfnetlink_init [nfnetlink]
load_module
__do_sys_finit_module
__arm64_sys_finit_module
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync
irq event stamp: 0
hardirqs last enabled at (0): 0x0
hardirqs last disabled at (0): copy_process
softirqs last enabled at (0): copy_process
softirqs last disabled at (0): 0x0

virt_to_phys used for non-linear address: ffffd8efe2d2fe00 (init_net)
WARNING: CPU: 156 PID: 3176 at arch/arm64/mm/physaddr.c:12 __virt_to_phys
CPU: 156 PID: 3176 Comm: modprobe Tainted: G B W 5.19.0-rc1-next-20220606 #138
pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __virt_to_phys
lr : __virt_to_phys
sp : ffff800051b376e0
x29: ffff800051b376e0 x28: ffffd8efb5ba6ab8 x27: ffffd8efb5ba6b2c
x26: ffffd8efb286e910 x25: ffffd8efbaaf8200 x24: ffff800051b37820
x23: ffffd8efb744a000 x22: ffffd8efbb1bc000 x21: 0000600000000000
x20: 0000d8efe2d2fe00 x19: ffffd8efe2d2fe00 x18: 00000000000001cb
x17: 0000000000000000 x16: 0000000000000002 x15: ffffd8efb9db2000
x14: 0000000000000001 x13: 0000000000000000 x12: ffff6806c8a03f86
x8 : ffff40364501fc2b x7 : 0000000000000001 x6 : ffff6806c8a03f85
x5 : ffff40364501fc28 x4 : ffff6806c8a03f86 x3 : ffffd8efb34b3850
x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff40033376b4c0
Call trace:
__virt_to_phys
mem_cgroup_from_obj
__register_pernet_operations
register_pernet_operations
register_pernet_subsys
nf_tables_module_init [nf_tables]
do_one_initcall
do_init_module
load_module
__do_sys_finit_module
__arm64_sys_finit_module
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync
irq event stamp: 0
hardirqs last enabled at (0): 0x0
hardirqs last disabled at (0): copy_process
softirqs last enabled at (0): copy_process
softirqs last disabled at (0): 0x0

2022-06-06 17:54:11

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg v6] net: set proper memcg for net_init hooks allocations

On 6/6/22 16:49, Qian Cai wrote:
> This triggers a few boot warnings like those.
>
> virt_to_phys used for non-linear address: ffffd8efe2d2fe00 (init_net)
> WARNING: CPU: 87 PID: 3170 at arch/arm64/mm/physaddr.c:12 __virt_to_phys

Thank you for reporting the problem,
Could you please provide me your config file via private email?

Thank you,
Vasily Averin

2022-06-07 03:12:56

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH memcg v6] net: set proper memcg for net_init hooks allocations

On Mon, Jun 06, 2022 at 08:37:26PM +0300, Vasily Averin wrote:
> On 6/6/22 16:49, Qian Cai wrote:
> > This triggers a few boot warnings like those.
> >
> > virt_to_phys used for non-linear address: ffffd8efe2d2fe00 (init_net)
> > WARNING: CPU: 87 PID: 3170 at arch/arm64/mm/physaddr.c:12 __virt_to_phys
>
> Thank you for reporting the problem,
> Could you please provide me your config file via private email?

$ make ARCH=arm64 defconfig debug.config

2022-06-07 08:53:36

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH memcg v6] net: set proper memcg for net_init hooks allocations

On Mon, Jun 6, 2022 at 11:45 AM Vasily Averin <[email protected]> wrote:
>
[...]
>
> As far as I understand this report means that 'init_net' have incorrect
> virtual address on arm64.

So, the two call stacks tell the addresses belong to the kernel
modules (nfnetlink and nf_tables) whose underlying memory is allocated
through vmalloc and virt_to_page() does not work on vmalloc()
addresses.

>
> Roman, Shakeel, I need your help
>
> Should we perhaps verify kaddr via virt_addr_valid() before using virt_to_page()
> If so, where it should be checked?

I think virt_addr_valid() check in mem_cgroup_from_obj() should work
but I think it is expensive on the arm64 platform. The cheaper and a
bit hacky way to avoid such addresses is to directly use
is_vmalloc_addr() directly.

2022-06-07 09:53:33

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg v6] net: set proper memcg for net_init hooks allocations

On 6/6/22 16:49, Qian Cai wrote:
> On Fri, Jun 03, 2022 at 07:19:43AM +0300, Vasily Averin wrote:

> This triggers a few boot warnings like those.
>
> virt_to_phys used for non-linear address: ffffd8efe2d2fe00 (init_net)
> WARNING: CPU: 87 PID: 3170 at arch/arm64/mm/physaddr.c:12 __virt_to_phys
...
> Call trace:
> __virt_to_phys
> mem_cgroup_from_obj
> __register_pernet_operations

@@ -1143,7 +1144,13 @@ static int __register_pernet_operations(struct list_head *list,
* setup_net() and cleanup_net() are not possible.
*/
for_each_net(net) {
+ struct mem_cgroup *old, *memcg;
+
+ memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(net)); <<<< Here
+ old = set_active_memcg(memcg);
error = ops_init(ops, net);
+ set_active_memcg(old);
+ mem_cgroup_put(memcg);
...
+static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
+{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ do {
+ memcg = mem_cgroup_from_obj(p); <<<<
+ } while (memcg && !css_tryget(&memcg->css));
...
struct mem_cgroup *mem_cgroup_from_obj(void *p)
{
struct folio *folio;

if (mem_cgroup_disabled())
return NULL;

folio = virt_to_folio(p); <<<< here
...
static inline struct folio *virt_to_folio(const void *x)
{
struct page *page = virt_to_page(x); <<< here

... (arm64)
#define virt_to_page(x) pfn_to_page(virt_to_pfn(x))
...
#define virt_to_pfn(x) __phys_to_pfn(__virt_to_phys((unsigned long)(x)))
...
phys_addr_t __virt_to_phys(unsigned long x)
{
WARN(!__is_lm_address(__tag_reset(x)),
"virt_to_phys used for non-linear address: %pK (%pS)\n",

from arch/x86/include/asm/page.h:
* virt_to_page(kaddr) returns a valid pointer if and only if
* virt_addr_valid(kaddr) returns true.

As far as I understand this report means that 'init_net' have incorrect
virtual address on arm64.

Roman, Shakeel, I need your help

Should we perhaps verify kaddr via virt_addr_valid() before using virt_to_page()
If so, where it should be checked?

Thank you,
Vasily Averin

2022-06-08 03:13:21

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg v6] net: set proper memcg for net_init hooks allocations

On 6/7/22 08:58, Shakeel Butt wrote:
> On Mon, Jun 6, 2022 at 11:45 AM Vasily Averin <[email protected]> wrote:
>>
> [...]
>>
>> As far as I understand this report means that 'init_net' have incorrect
>> virtual address on arm64.
>
> So, the two call stacks tell the addresses belong to the kernel
> modules (nfnetlink and nf_tables) whose underlying memory is allocated
> through vmalloc and virt_to_page() does not work on vmalloc()
> addresses.

However in both these cases get_mem_cgroup_from_obj() -> mem_cgroup_from_obj() ->
virt_to_folio() -> virt_to_page() -> virt_to_pfn() -> __virt_to_phys()
handles address of struct net taken from for_each_net().
The only net namespace that exists at this stage is init_net,
and dmesg output confirms this:
"virt_to_phys used for non-linear address: ffffd8efe2d2fe00 (init_net)"

>> Roman, Shakeel, I need your help
>>
>> Should we perhaps verify kaddr via virt_addr_valid() before using virt_to_page()
>> If so, where it should be checked?
>
> I think virt_addr_valid() check in mem_cgroup_from_obj() should work
> but I think it is expensive on the arm64 platform. The cheaper and a
> bit hacky way to avoid such addresses is to directly use
> is_vmalloc_addr() directly.

I do not understand why you mean that processed address is vmalloc-specific.
As far as I understand it is valid address of static variable, and for some reason
arm64 does not consider them valid virtual addresses.

Thank you,
Vasily Averin

2022-06-08 03:18:43

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH memcg v6] net: set proper memcg for net_init hooks allocations

On Tue, Jun 7, 2022 at 5:37 AM Vasily Averin <[email protected]> wrote:
>
> On 6/7/22 08:58, Shakeel Butt wrote:
> > On Mon, Jun 6, 2022 at 11:45 AM Vasily Averin <[email protected]> wrote:
> >>
> > [...]
> >>
> >> As far as I understand this report means that 'init_net' have incorrect
> >> virtual address on arm64.
> >
> > So, the two call stacks tell the addresses belong to the kernel
> > modules (nfnetlink and nf_tables) whose underlying memory is allocated
> > through vmalloc and virt_to_page() does not work on vmalloc()
> > addresses.
>
> However in both these cases get_mem_cgroup_from_obj() -> mem_cgroup_from_obj() ->
> virt_to_folio() -> virt_to_page() -> virt_to_pfn() -> __virt_to_phys()
> handles address of struct net taken from for_each_net().
> The only net namespace that exists at this stage is init_net,
> and dmesg output confirms this:
> "virt_to_phys used for non-linear address: ffffd8efe2d2fe00 (init_net)"
>
> >> Roman, Shakeel, I need your help
> >>
> >> Should we perhaps verify kaddr via virt_addr_valid() before using virt_to_page()
> >> If so, where it should be checked?
> >
> > I think virt_addr_valid() check in mem_cgroup_from_obj() should work
> > but I think it is expensive on the arm64 platform. The cheaper and a
> > bit hacky way to avoid such addresses is to directly use
> > is_vmalloc_addr() directly.
>
> I do not understand why you mean that processed address is vmalloc-specific.
> As far as I understand it is valid address of static variable, and for some reason
> arm64 does not consider them valid virtual addresses.
>

Indeed you are right as we are using the addresses of net namespaces
and the report already has the information on the address
ffffd8efe2d2fe00 which is init_net.

I don't know what is the right way to handle such addresses on arm64.
BTW there is a separate report on this issue and arm maintainers are
also CCed. Why not ask this question on that report?