[permalink] [raw]

Subject: Re: [PATCH memcg RFC] net: set proper memcg for net_init hooks allocations

On Sat, Apr 16, 2022 at 11:39 PM Vasily Averin <[email protected]> wrote:
>
> __register_pernet_operations() executes init hook of registered
> pernet_operation structure in all existing net namespaces.
>
> Typically, these hooks are called by a process associated with
> the specified net namespace, and all __GFP_ACCOUNTING marked
> allocation are accounted for corresponding container/memcg.
>
> However __register_pernet_operations() calls the hooks in the same
> context, and as a result all marked allocations are accounted
> to one memcg for all processed net namespaces.
>
> This patch adjusts active memcg for each net namespace and helps
> to account memory allocated inside ops_init() into the proper memcg.
>
> Signed-off-by: Vasily Averin <[email protected]>
> ---
> Dear Vlastimil, Roman,
> I'm not sure that memcg is used correctly here,
> is it perhaps some additional locking required?
> ---
> net/core/net_namespace.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index a5b5bb99c644..171c6e0b2337 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -26,6 +26,7 @@
> #include <net/net_namespace.h>
> #include <net/netns/generic.h>
>
> +#include <linux/sched/mm.h>
> /*
> * Our network namespace constructor/destructor lists
> */
> @@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
> * setup_net() and cleanup_net() are not possible.
> */
> for_each_net(net) {
> + struct mem_cgroup *old, *memcg = NULL;
> +#ifdef CONFIG_MEMCG
> + memcg = (net == &init_net) ? root_mem_cgroup : mem_cgroup_from_obj(net);

memcg from obj is unstable, so you need a reference on memcg. You can
introduce get_mem_cgroup_from_kmem() which works for both
MEMCG_DATA_OBJCGS and MEMCG_DATA_KMEM. For uncharged objects (like
init_net) it should return NULL.

> +#endif
> + old = set_active_memcg(memcg);
> error = ops_init(ops, net);
> + set_active_memcg(old);
> if (error)
> goto out_undo;
> list_add_tail(&net->exit_list, &net_exit_list);
> --
> 2.31.1
>

2022-04-22 22:49:14

by Vasily Averin

[permalink] [raw]

Subject: Re: [PATCH memcg RFC] net: set proper memcg for net_init hooks allocations

On 4/22/22 23:01, Vasily Averin wrote:
> On 4/21/22 18:56, Shakeel Butt wrote:
>> On Sat, Apr 16, 2022 at 11:39 PM Vasily Averin <[email protected]> wrote:
>>> @@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
>>> * setup_net() and cleanup_net() are not possible.
>>> */
>>> for_each_net(net) {
>>> + struct mem_cgroup *old, *memcg = NULL;
>>> +#ifdef CONFIG_MEMCG
>>> + memcg = (net == &init_net) ? root_mem_cgroup : mem_cgroup_from_obj(net);
>>
>> memcg from obj is unstable, so you need a reference on memcg. You can
>> introduce get_mem_cgroup_from_kmem() which works for both
>> MEMCG_DATA_OBJCGS and MEMCG_DATA_KMEM. For uncharged objects (like
>> init_net) it should return NULL.
>
> Could you please elaborate with more details?
> It seems to me mem_cgroup_from_obj() does everything exactly as you say:
> - for slab objects it returns memcg taken from according slab->memcg_data
> - for ex-slab objects (i.e. page->memcg_data & MEMCG_DATA_OBJCGS)
> page_memcg_check() returns NULL
> - for kmem objects (i.e. page->memcg_data & MEMCG_DATA_KMEM)
> page_memcg_check() returns objcg->memcg
> - in another cases
> page_memcg_check() returns page->memcg_data,
> so for uncharged objects like init_net NULL should be returned.
>
> I can introduce exported get_mem_cgroup_from_kmem(), however it should only
> call mem_cgroup_from_obj(), perhaps under read_rcu_lock/unlock.

I think I finally got your point:
Do you mean I should use css_tryget(&memcg->css) for found memcg,
like get_mem_cgroup_from_mm() does?

Thank you,
Vasily Averin

2022-04-22 22:57:18

by Vasily Averin

[permalink] [raw]

Subject: Re: [PATCH memcg RFC] net: set proper memcg for net_init hooks allocations

On 4/21/22 18:56, Shakeel Butt wrote:
> On Sat, Apr 16, 2022 at 11:39 PM Vasily Averin <[email protected]> wrote:
>> @@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
>> * setup_net() and cleanup_net() are not possible.
>> */
>> for_each_net(net) {
>> + struct mem_cgroup *old, *memcg = NULL;
>> +#ifdef CONFIG_MEMCG
>> + memcg = (net == &init_net) ? root_mem_cgroup : mem_cgroup_from_obj(net);
>
> memcg from obj is unstable, so you need a reference on memcg. You can
> introduce get_mem_cgroup_from_kmem() which works for both
> MEMCG_DATA_OBJCGS and MEMCG_DATA_KMEM. For uncharged objects (like
> init_net) it should return NULL.

Could you please elaborate with more details?
It seems to me mem_cgroup_from_obj() does everything exactly as you say:
- for slab objects it returns memcg taken from according slab->memcg_data
- for ex-slab objects (i.e. page->memcg_data & MEMCG_DATA_OBJCGS)
page_memcg_check() returns NULL
- for kmem objects (i.e. page->memcg_data & MEMCG_DATA_KMEM)
page_memcg_check() returns objcg->memcg
- in another cases
page_memcg_check() returns page->memcg_data,
so for uncharged objects like init_net NULL should be returned.

I can introduce exported get_mem_cgroup_from_kmem(), however it should only
call mem_cgroup_from_obj(), perhaps under read_rcu_lock/unlock.

Do you mean something like this?

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1768,4 +1768,14 @@ static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)

#endif /* CONFIG_MEMCG_KMEM */

+static inline struct mem_cgroup *get_mem_cgroup_from_kmem(void *p)
+{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_obj(p);
+ rcu_read_unlock();
+
+ return memcg;
+}
#endif /* _LINUX_MEMCONTROL_H */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a5b5bb99c644..4003c47965c9 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -26,6 +26,7 @@
#include <net/net_namespace.h>
#include <net/netns/generic.h>

+#include <linux/sched/mm.h>
/*
* Our network namespace constructor/destructor lists
*/
@@ -1147,7 +1148,14 @@ static int __register_pernet_operations(struct list_head *list,
* setup_net() and cleanup_net() are not possible.
*/
for_each_net(net) {
+ struct mem_cgroup *old, *memcg;
+
+ memcg = get_mem_cgroup_from_kmem(net);
+ if (memcg == NULL)
+ memcg = root_mem_cgroup;
+ old = set_active_memcg(memcg);
error = ops_init(ops, net);
+ set_active_memcg(old);
if (error)
goto out_undo;
list_add_tail(&net->exit_list, &net_exit_list);

2022-04-22 23:05:55

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH memcg RFC] net: set proper memcg for net_init hooks allocations

On Fri, Apr 22, 2022 at 1:09 PM Vasily Averin <[email protected]> wrote:
>
> On 4/22/22 23:01, Vasily Averin wrote:
> > On 4/21/22 18:56, Shakeel Butt wrote:
> >> On Sat, Apr 16, 2022 at 11:39 PM Vasily Averin <[email protected]> wrote:
> >>> @@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
> >>> * setup_net() and cleanup_net() are not possible.
> >>> */
> >>> for_each_net(net) {
> >>> + struct mem_cgroup *old, *memcg = NULL;
> >>> +#ifdef CONFIG_MEMCG
> >>> + memcg = (net == &init_net) ? root_mem_cgroup : mem_cgroup_from_obj(net);
> >>
> >> memcg from obj is unstable, so you need a reference on memcg. You can
> >> introduce get_mem_cgroup_from_kmem() which works for both
> >> MEMCG_DATA_OBJCGS and MEMCG_DATA_KMEM. For uncharged objects (like
> >> init_net) it should return NULL.
> >
> > Could you please elaborate with more details?
> > It seems to me mem_cgroup_from_obj() does everything exactly as you say:
> > - for slab objects it returns memcg taken from according slab->memcg_data
> > - for ex-slab objects (i.e. page->memcg_data & MEMCG_DATA_OBJCGS)
> > page_memcg_check() returns NULL
> > - for kmem objects (i.e. page->memcg_data & MEMCG_DATA_KMEM)
> > page_memcg_check() returns objcg->memcg
> > - in another cases
> > page_memcg_check() returns page->memcg_data,
> > so for uncharged objects like init_net NULL should be returned.
> >
> > I can introduce exported get_mem_cgroup_from_kmem(), however it should only
> > call mem_cgroup_from_obj(), perhaps under read_rcu_lock/unlock.
>
> I think I finally got your point:
> Do you mean I should use css_tryget(&memcg->css) for found memcg,
> like get_mem_cgroup_from_mm() does?

Yes.

2022-04-23 09:40:36

by Vasily Averin

[permalink] [raw]

2022-04-24 18:37:24

by Vasily Averin

[permalink] [raw]

Subject: [PATCH memcg v2] net: set proper memcg for net_init hooks allocations

2022-04-25 08:35:14

by kernel test robot

[permalink] [raw]

Subject: [net] 3b379e5391: BUG:kernel_NULL_pointer_dereference,address

Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 3b379e5391e36e13b9f36305aa6d233fb03d4e58 ("[PATCH] net: set proper memcg for net_init hooks allocations")
url: https://github.com/intel-lab-lkp/linux/commits/Vasily-Averin/net-set-proper-memcg-for-net_init-hooks-allocations/20220423-160759
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git c00c5e1d157bec0ef0b0b59aa5482eb8dc7e8e49
patch link: https://lore.kernel.org/lkml/[email protected]

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

+---------------------------------------------+------------+------------+
| | c00c5e1d15 | 3b379e5391 |
+---------------------------------------------+------------+------------+
| boot_successes | 9 | 0 |
| boot_failures | 0 | 32 |
| BUG:kernel_NULL_pointer_dereference,address | 0 | 32 |
| Oops:#[##] | 0 | 32 |
| EIP:__register_pernet_operations | 0 | 32 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 32 |
+---------------------------------------------+------------+------------+

If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>

[ 1.054816][ T0] BUG: kernel NULL pointer dereference, address: 0000002c
[ 1.055472][ T0] #PF: supervisor read access in kernel mode
[ 1.056034][ T0] #PF: error_code(0x0000) - not-present page
[ 1.056650][ T0] *pdpt = 0000000000000000 *pde = f000ff53f000ff53
[ 1.056795][ T0] Oops: 0000 [#1] SMP PTI
[ 1.056795][ T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.18.0-rc3-00191-g3b379e5391e3 #1
[ 1.056795][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 1.056795][ T0] EIP: __register_pernet_operations+0x169/0x340
[ 1.056795][ T0] Code: 1e d4 8b 40 08 a8 03 0f 85 44 01 00 00 64 ff 00 64 ff 0d d4 06 1e d4 e9 1d ff ff ff 8d 74 26 00 90 8b 45 e0 89 b8 0c 0f 00 00 <
f6> 43 2c 01 0f 85 68 ff ff ff 64 ff 05 d4 06 1e d4 8b 43 08 a8 03
[ 1.056795][ T0] EAX: d3cf4740 EBX: 00000000 ECX: 00000000 EDX: 00000cc0
[ 1.056795][ T0] ESI: d4331340 EDI: 00000000 EBP: d3cedf58 ESP: d3cedf34
[ 1.056795][ T0] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210246
[ 1.056795][ T0] CR0: 80050033 CR2: 0000002c CR3: 141f8000 CR4: 000406b0
[ 1.056795][ T0] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 1.056795][ T0] DR6: fffe0ff0 DR7: 00000400
[ 1.056795][ T0] Call Trace:
[ 1.056795][ T0] ? setup_net+0x44/0x300
[ 1.056795][ T0] register_pernet_operations+0x5c/0xc0
[ 1.056795][ T0] register_pernet_subsys+0x21/0x40
[ 1.056795][ T0] net_ns_init+0xb1/0xf1
[ 1.056795][ T0] start_kernel+0x403/0x46d
[ 1.056795][ T0] i386_start_kernel+0x48/0x4a
[ 1.056795][ T0] startup_32_smp+0x161/0x164
[ 1.056795][ T0] Modules linked in:
[ 1.056795][ T0] CR2: 000000000000002c
[ 1.056795][ T0] ---[ end trace 0000000000000000 ]---
[ 1.056795][ T0] EIP: __register_pernet_operations+0x169/0x340
[ 1.056795][ T0] Code: 1e d4 8b 40 08 a8 03 0f 85 44 01 00 00 64 ff 00 64 ff 0d d4 06 1e d4 e9 1d ff ff ff 8d 74 26 00 90 8b 45 e0 89 b8 0c 0f 00 00 <f6> 43 2c 01 0f 85 68 ff ff ff 64 ff 05 d4 06 1e d4 8b 43 08 a8 03
[ 1.056795][ T0] EAX: d3cf4740 EBX: 00000000 ECX: 00000000 EDX: 00000cc0
[ 1.056795][ T0] ESI: d4331340 EDI: 00000000 EBP: d3cedf58 ESP: d3cedf34
[ 1.056795][ T0] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210246
[ 1.056795][ T0] CR0: 80050033 CR2: 0000002c CR3: 141f8000 CR4: 000406b0
[ 1.056795][ T0] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 1.056795][ T0] DR6: fffe0ff0 DR7: 00000400
[ 1.056795][ T0] Kernel panic - not syncing: Fatal exception

To reproduce:

# build kernel
cd linux
cp config-5.18.0-rc3-00191-g3b379e5391e3 .config
make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

--
0-DAY CI Kernel Test Service
https://01.org/lkp

Attachments:

(No filename) (4.90 kB)
config-5.18.0-rc3-00191-g3b379e5391e3 (144.02 kB)
job-script (4.59 kB)
dmesg.xz (5.67 kB)
Download all attachments

2022-04-25 15:51:48

by Vasily Averin

[permalink] [raw]

__register_pernet_operations() executes init hook of registered
pernet_operation structure in all existing net namespaces.

Typically, these hooks are called by a process associated with
the specified net namespace, and all __GFP_ACCOUNT marked
allocation are accounted for corresponding container/memcg.

However __register_pernet_operations() calls the hooks in the same
context, and as a result all marked allocations are accounted
to one memcg for all processed net namespaces.

This patch adjusts active memcg for each net namespace and helps
to account memory allocated inside ops_init() into the proper memcg.

Signed-off-by: Vasily Averin <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
---
v6: re-based to current upstream (v5.18-11267-gb00ed48bb0a7)

v5: documented get_mem_cgroup_from_obj() and for mem_cgroup_or_root()
functions, asked by Shakeel.

v4: get_mem_cgroup_from_kmem() renamed to get_mem_cgroup_from_obj(),
get_net_memcg() renamed to mem_cgroup_or_root(), suggested by Roman.

v3: put_net_memcg() replaced by an alreay existing mem_cgroup_put()
It checks memcg before accessing it, this is required for
__register_pernet_operations() called before memcg initialization.
Additionally fixed leading whitespaces in non-memcg_kmem version
of mem_cgroup_from_obj().

v2: introduced get/put_net_memcg(),
new functions are moved under CONFIG_MEMCG_KMEM
to fix compilation issues reported by Intel's kernel test robot

v1: introduced get_mem_cgroup_from_kmem(), which takes the refcount
for the found memcg, suggested by Shakeel
---
include/linux/memcontrol.h | 47 +++++++++++++++++++++++++++++++++++++-
net/core/net_namespace.c | 7 ++++++
2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9ecead1042b9..dad16b484cd5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1755,6 +1755,42 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
rcu_read_unlock();
}

+/**
+ * get_mem_cgroup_from_obj - get a memcg associated with passed kernel object.
+ * @p: pointer to object from which memcg should be extracted. It can be NULL.
+ *
+ * Retrieves the memory group into which the memory of the pointed kernel
+ * object is accounted. If memcg is found, its reference is taken.
+ * If a passed kernel object is uncharged, or if proper memcg cannot be found,
+ * as well as if mem_cgroup is disabled, NULL is returned.
+ *
+ * Return: valid memcg pointer with taken reference or NULL.
+ */
+static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
+{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ do {
+ memcg = mem_cgroup_from_obj(p);
+ } while (memcg && !css_tryget(&memcg->css));
+ rcu_read_unlock();
+ return memcg;
+}
+
+/**
+ * mem_cgroup_or_root - always returns a pointer to a valid memory cgroup.
+ * @memcg: pointer to a valid memory cgroup or NULL.
+ *
+ * If passed argument is not NULL, returns it without any additional checks
+ * and changes. Otherwise, root_mem_cgroup is returned.
+ *
+ * NOTE: root_mem_cgroup can be NULL during early boot.
+ */
+static inline struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg)
+{
+ return memcg ? memcg : root_mem_cgroup;
+}
#else
static inline bool mem_cgroup_kmem_disabled(void)
{
@@ -1798,7 +1834,7 @@ static inline int memcg_kmem_id(struct mem_cgroup *memcg)

static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
{
- return NULL;
+ return NULL;
}

static inline void count_objcg_event(struct obj_cgroup *objcg,
@@ -1806,6 +1842,15 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
{
}

+static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p)
+{
+ return NULL;
+}
+
+static inline struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg)
+{
+ return NULL;
+}
#endif /* CONFIG_MEMCG_KMEM */

#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 0ec2f5906a27..6b9f19122ec1 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -18,6 +18,7 @@
#include <linux/user_namespace.h>
#include <linux/net_namespace.h>
#include <linux/sched/task.h>
+#include <linux/sched/mm.h>
#include <linux/uidgid.h>
#include <linux/cookie.h>

@@ -1143,7 +1144,13 @@ static int __register_pernet_operations(struct list_head *list,
* setup_net() and cleanup_net() are not possible.
*/
for_each_net(net) {
+ struct mem_cgroup *old, *memcg;
+
+ memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(net));
+ old = set_active_memcg(memcg);
error = ops_init(ops, net);
+ set_active_memcg(old);
+ mem_cgroup_put(memcg);
if (error)
goto out_undo;
list_add_tail(&net->exit_list, &net_exit_list);
--
2.36.1

2022-06-06 14:02:20

by Qian Cai

[permalink] [raw]

On 27.09.22 11:54, Vlastimil Babka wrote:
> On 9/18/22 11:28, Anatoly Pugachev wrote:
>> On Fri, Jun 03, 2022 at 07:19:43AM +0300, Vasily Averin wrote:
>>> __register_pernet_operations() executes init hook of registered
>>> pernet_operation structure in all existing net namespaces.
>> [...]
>> I'm unable to boot my sparc64 VM anymore (5.19 still boots, 6.0-rc1 does not),
>> bisected up to this patch,
>>
>> mator@ttip:~/linux-2.6$ git bisect bad
>> 1d0403d20f6c281cb3d14c5f1db5317caeec48e9 is the first bad commit
>> commit 1d0403d20f6c281cb3d14c5f1db5317caeec48e9
>> [...]
>
> #regzbot introduced: 1d0403d20f6c

Thx for getting this regression tracked using regzbot. FWIW, that went
sideways (as your already noticed and mentioned on IRC), as that made
regzbot treat *your* mail as the report of the regressions. In cases
like this you need "#regzbot ^introduced 1d0403d20f6c" (since recently
"#regzbot introduced 1d0403d20f6c ^" works, too), as then regzbot will
consider the *parent* mail the report (and then regzbot will look out
for patches that link to them using a Link: tag).

No worries, I did the same mistake a few time already :-D I send a mail
with that command now, so let's resolve this subthread by marking it invalid

#regzbot invalid: mis-used regzbot command, now properly tracked

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.