2022-07-31 15:54:43

by Vasily Averin

[permalink] [raw]
Subject: [PATCH 0/3] enable memcg accounting for kernfs objects

This patch set enables memcg accounting for kernfs-related objects.

Originally it was a part of patch set
"memcg: accounting for objects allocated by mkdir cgroup"
https://lore.kernel.org/all/[email protected]/

The patches have received approval from several developers,
however respected Michal Hocko pointed out that, if neccesary,
cgroups consumption can be restricted via cgroup.max.descendants
limit without additional accounting of allocated memory.
I still disagree with him, I think that memory limits works better,
but I could not give any new substantial arguments, so discussion
was stalled and patches was frozen in limbo until better times.

However 3 of these patches affect not only cgroups, and I hope
to get help from kernfs maintainers.

Kernfs nodes are quite small kernel objects, however there are few
scenarios where it consumes significant piece of all allocated memory.
I am aware of the following cases, but I am sure there are many other
ones.

1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
was allocated for 80+ kernfs nodes.

2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
structures.

3) Shakeel Butt reports that Google has workloads which create 100s
of subcontainers and they have observed high system overhead
without memcg accounting of kernfs.

My experimets with LXC conrainer on Fedora node show that
usually new kernfs node creates few other objects:

Allocs Alloc Allocation
number size
--------------------------------------------
1 + 128 (__kernfs_new_node+0x4d) kernfs node
1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs
1 + 96 (simple_xattr_alloc+0x28) simple_xattr(*)
1 32 (simple_xattr_set+0x59)
1 8 (__kernfs_new_node+0x30)

'+' -- to be accounted

(*) simple_xattr in this scenaro was allocated directly during
kernfs creation for selinux label. Even here it consumes noticeable
part of newly allocated object.
However please keep in mind that xattr can be allocated later,
via setxattr system calls, its size is controlled by userspace
and can reach 64Kb per call. kernfs objects lives in memory,
so it is improtant to account it.

Originally the patches was splitted to simplify their rewiev,
however if required I can merge them together.

Vasily Averin (3):
memcg: enable accounting for kernfs nodes
memcg: enable accounting for kernfs iattrs
memcg: enable accounting for struct simple_xattr

fs/kernfs/mount.c | 6 ++++--
fs/xattr.c | 2 +-
2 files changed, 5 insertions(+), 3 deletions(-)

--
2.25.1



2022-08-09 17:56:33

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH 0/3] enable memcg accounting for kernfs objects

On Tue, Aug 09, 2022 at 07:31:31AM -1000, Tejun Heo <[email protected]> wrote:
> I'm not quite sure whether following the usual "charge it to the allocating
> task's cgroup" is the best way to go about it. I wonder whether it'd be
> better to attach it to the new cgroup's nearest ancestor with memcg enabled.

See also
https://lore.kernel.org/r/YnBLge4ZQNbbxufc@blackbook/
and
https://lore.kernel.org/r/[email protected]/

Michal

2022-08-09 17:57:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/3] enable memcg accounting for kernfs objects

(cc'ing Johannes)

Hello,

On Sun, Jul 31, 2022 at 06:37:15PM +0300, Vasily Averin wrote:
> 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
> was allocated for 80+ kernfs nodes.
>
> 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
> structures.
>
> 3) Shakeel Butt reports that Google has workloads which create 100s
> of subcontainers and they have observed high system overhead
> without memcg accounting of kernfs.

So, I don't have anything against accounting kernfs objects in general but,
for cgroups, because cgroups are what determines what gets charged where,
I'm not quite sure whether following the usual "charge it to the allocating
task's cgroup" is the best way to go about it. I wonder whether it'd be
better to attach it to the new cgroup's nearest ancestor with memcg enabled.
Johannes, Michal, what do you guys think?

Thanks.

--
tejun

2022-08-09 17:58:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/3] enable memcg accounting for kernfs objects

Hello,

On Tue, Aug 09, 2022 at 07:49:34PM +0200, Michal Koutn? wrote:
> On Tue, Aug 09, 2022 at 07:31:31AM -1000, Tejun Heo <[email protected]> wrote:
> > I'm not quite sure whether following the usual "charge it to the allocating
> > task's cgroup" is the best way to go about it. I wonder whether it'd be
> > better to attach it to the new cgroup's nearest ancestor with memcg enabled.
>
> See also
> https://lore.kernel.org/r/YnBLge4ZQNbbxufc@blackbook/
> and
> https://lore.kernel.org/r/[email protected]/

Ah, thanks. Vasily, can you please include some summary of the discussions
and the rationales for the path taken in the commit message?

Thanks.

--
tejun

2022-08-09 21:48:07

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH 0/3] enable memcg accounting for kernfs objects

On Tue, Aug 09, 2022 at 07:31:31AM -1000, Tejun Heo wrote:
> (cc'ing Johannes)
>
> Hello,
>
> On Sun, Jul 31, 2022 at 06:37:15PM +0300, Vasily Averin wrote:
> > 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
> > was allocated for 80+ kernfs nodes.
> >
> > 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
> > structures.
> >
> > 3) Shakeel Butt reports that Google has workloads which create 100s
> > of subcontainers and they have observed high system overhead
> > without memcg accounting of kernfs.
>
> So, I don't have anything against accounting kernfs objects in general but,
> for cgroups, because cgroups are what determines what gets charged where,
> I'm not quite sure whether following the usual "charge it to the allocating
> task's cgroup" is the best way to go about it. I wonder whether it'd be
> better to attach it to the new cgroup's nearest ancestor with memcg enabled.

I also like this approach better, however Michal had some arguments against it.
I don't think there is a huge practical difference, so I'm ok with either
approach.

Thanks!

2022-08-11 03:30:29

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH 0/3] enable memcg accounting for kernfs objects

On 8/9/22 20:56, Tejun Heo wrote:
> Hello,
>
> On Tue, Aug 09, 2022 at 07:49:34PM +0200, Michal Koutný wrote:
>> On Tue, Aug 09, 2022 at 07:31:31AM -1000, Tejun Heo <[email protected]> wrote:
>>> I'm not quite sure whether following the usual "charge it to the allocating
>>> task's cgroup" is the best way to go about it. I wonder whether it'd be
>>> better to attach it to the new cgroup's nearest ancestor with memcg enabled.
>>
>> See also
>> https://lore.kernel.org/r/YnBLge4ZQNbbxufc@blackbook/
>> and
>> https://lore.kernel.org/r/[email protected]/
>
> Ah, thanks. Vasily, can you please include some summary of the discussions
> and the rationales for the path taken in the commit message?

Dear Tejun,
thank you for the feedback, I'll do it in next patch set iteration.

However, I noticed another problem in neighborhood and I planned to
add new patches into current patch set. One of the new patches is quite simple,
however second one is quite complex and requires some discussion.

So I'm still thinking how best to solve these issues.

Thank you,
Vasily Averin

2022-08-11 05:14:58

by Vasily Averin

[permalink] [raw]
Subject: [RFC PATCH] kernfs: enable per-inode limits for all xattr types

Currently it's possible to create a huge number of xattr per inode,
and I would like to add USER-like restrcition to all xattr types.

I've prepared RFC patch and would like to discuss it.

This patch moves counters calculations into simple_xattr_set,
under simple_xattrs spinlock. This allows to replace atomic counters
used currently for USER xattr to ints.

To keep current behaviour for USER xattr I keep current limits
and counters and add separated limits and counters for all another
xattr types. However I would like to merge these limits and counters
together, because it simplifies the code even more.
Could someone please clarify if this is acceptable?

Signed-off-by: Vasily Averin <[email protected]>
---
fs/kernfs/inode.c | 67 ++-----------------------------------
fs/kernfs/kernfs-internal.h | 2 --
fs/xattr.c | 56 +++++++++++++++++++------------
include/linux/kernfs.h | 2 ++
include/linux/xattr.h | 11 ++++--
mm/shmem.c | 29 +++++++---------
6 files changed, 61 insertions(+), 106 deletions(-)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3d783d80f5da..7cfdda41fc89 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -47,8 +47,6 @@ static struct kernfs_iattrs *__kernfs_iattrs(struct kernfs_node *kn, int alloc)
kn->iattr->ia_ctime = kn->iattr->ia_atime;

simple_xattrs_init(&kn->iattr->xattrs);
- atomic_set(&kn->iattr->nr_user_xattrs, 0);
- atomic_set(&kn->iattr->user_xattr_size, 0);
out_unlock:
ret = kn->iattr;
mutex_unlock(&iattr_mutex);
@@ -314,7 +312,7 @@ int kernfs_xattr_set(struct kernfs_node *kn, const char *name,
if (!attrs)
return -ENOMEM;

- return simple_xattr_set(&attrs->xattrs, name, value, size, flags, NULL);
+ return simple_xattr_set(&attrs->xattrs, name, value, size, flags);
}

static int kernfs_vfs_xattr_get(const struct xattr_handler *handler,
@@ -339,61 +337,6 @@ static int kernfs_vfs_xattr_set(const struct xattr_handler *handler,
return kernfs_xattr_set(kn, name, value, size, flags);
}

-static int kernfs_vfs_user_xattr_add(struct kernfs_node *kn,
- const char *full_name,
- struct simple_xattrs *xattrs,
- const void *value, size_t size, int flags)
-{
- atomic_t *sz = &kn->iattr->user_xattr_size;
- atomic_t *nr = &kn->iattr->nr_user_xattrs;
- ssize_t removed_size;
- int ret;
-
- if (atomic_inc_return(nr) > KERNFS_MAX_USER_XATTRS) {
- ret = -ENOSPC;
- goto dec_count_out;
- }
-
- if (atomic_add_return(size, sz) > KERNFS_USER_XATTR_SIZE_LIMIT) {
- ret = -ENOSPC;
- goto dec_size_out;
- }
-
- ret = simple_xattr_set(xattrs, full_name, value, size, flags,
- &removed_size);
-
- if (!ret && removed_size >= 0)
- size = removed_size;
- else if (!ret)
- return 0;
-dec_size_out:
- atomic_sub(size, sz);
-dec_count_out:
- atomic_dec(nr);
- return ret;
-}
-
-static int kernfs_vfs_user_xattr_rm(struct kernfs_node *kn,
- const char *full_name,
- struct simple_xattrs *xattrs,
- const void *value, size_t size, int flags)
-{
- atomic_t *sz = &kn->iattr->user_xattr_size;
- atomic_t *nr = &kn->iattr->nr_user_xattrs;
- ssize_t removed_size;
- int ret;
-
- ret = simple_xattr_set(xattrs, full_name, value, size, flags,
- &removed_size);
-
- if (removed_size >= 0) {
- atomic_sub(removed_size, sz);
- atomic_dec(nr);
- }
-
- return ret;
-}
-
static int kernfs_vfs_user_xattr_set(const struct xattr_handler *handler,
struct user_namespace *mnt_userns,
struct dentry *unused, struct inode *inode,
@@ -411,13 +354,7 @@ static int kernfs_vfs_user_xattr_set(const struct xattr_handler *handler,
if (!attrs)
return -ENOMEM;

- if (value)
- return kernfs_vfs_user_xattr_add(kn, full_name, &attrs->xattrs,
- value, size, flags);
- else
- return kernfs_vfs_user_xattr_rm(kn, full_name, &attrs->xattrs,
- value, size, flags);
-
+ return simple_xattr_set(&attrs->xattrs, full_name, value, size, flags);
}

static const struct xattr_handler kernfs_trusted_xattr_handler = {
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index eeaa779b929c..a2b89bd48c9d 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -27,8 +27,6 @@ struct kernfs_iattrs {
struct timespec64 ia_ctime;

struct simple_xattrs xattrs;
- atomic_t nr_user_xattrs;
- atomic_t user_xattr_size;
};

struct kernfs_root {
diff --git a/fs/xattr.c b/fs/xattr.c
index b4875514a3ee..de4a2efc7fa4 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -1037,6 +1037,11 @@ int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
return ret;
}

+static bool xattr_is_user(const char *name)
+{
+ return !strncmp(name, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN);
+}
+
/**
* simple_xattr_set - xattr SET operation for in-memory/pseudo filesystems
* @xattrs: target simple_xattr list
@@ -1053,16 +1058,17 @@ int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
* Returns 0 on success, -errno on failure.
*/
int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
- const void *value, size_t size, int flags,
- ssize_t *removed_size)
+ const void *value, size_t size, int flags)
{
struct simple_xattr *xattr;
struct simple_xattr *new_xattr = NULL;
+ bool is_user_xattr = false;
+ int *sz = &xattrs->xattr_size;
+ int *nr = &xattrs->nr_xattrs;
+ int sz_limit = KERNFS_XATTR_SIZE_LIMIT;
+ int nr_limit = KERNFS_MAX_XATTRS;
int err = 0;

- if (removed_size)
- *removed_size = -1;
-
/* value == NULL means remove */
if (value) {
new_xattr = simple_xattr_alloc(value, size);
@@ -1076,6 +1082,14 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
}
}

+ is_user_xattr = xattr_is_user(name);
+ if (is_user_xattr) {
+ sz = &xattrs->user_xattr_size;
+ nr = &xattrs->nr_user_xattrs;
+ sz_limit = KERNFS_USER_XATTR_SIZE_LIMIT;
+ nr_limit = KERNFS_MAX_USER_XATTRS;
+ }
+
spin_lock(&xattrs->lock);
list_for_each_entry(xattr, &xattrs->head, list) {
if (!strcmp(name, xattr->name)) {
@@ -1083,13 +1097,19 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
xattr = new_xattr;
err = -EEXIST;
} else if (new_xattr) {
- list_replace(&xattr->list, &new_xattr->list);
- if (removed_size)
- *removed_size = xattr->size;
+ int delta = new_xattr->size - xattr->size;
+
+ if (*sz + delta > sz_limit) {
+ xattr = new_xattr;
+ err = -ENOSPC;
+ } else {
+ *sz += delta;
+ list_replace(&xattr->list, &new_xattr->list);
+ }
} else {
+ *sz -= xattr->size;
+ (*nr)--;
list_del(&xattr->list);
- if (removed_size)
- *removed_size = xattr->size;
}
goto out;
}
@@ -1097,7 +1117,12 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
if (flags & XATTR_REPLACE) {
xattr = new_xattr;
err = -ENODATA;
+ } else if ((*sz + new_xattr->size > sz_limit) || (*nr == nr_limit)) {
+ xattr = new_xattr;
+ err = -ENOSPC;
} else {
+ *sz += new_xattr->size;
+ (*nr)++;
list_add(&new_xattr->list, &xattrs->head);
xattr = NULL;
}
@@ -1172,14 +1197,3 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,

return err ? err : size - remaining_size;
}
-
-/*
- * Adds an extended attribute to the list
- */
-void simple_xattr_list_add(struct simple_xattrs *xattrs,
- struct simple_xattr *new_xattr)
-{
- spin_lock(&xattrs->lock);
- list_add(&new_xattr->list, &xattrs->head);
- spin_unlock(&xattrs->lock);
-}
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index e2ae15a6225e..1972beb0d7b9 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -44,6 +44,8 @@ enum kernfs_node_type {
#define KERNFS_FLAG_MASK ~KERNFS_TYPE_MASK
#define KERNFS_MAX_USER_XATTRS 128
#define KERNFS_USER_XATTR_SIZE_LIMIT (128 << 10)
+#define KERNFS_MAX_XATTRS 128
+#define KERNFS_XATTR_SIZE_LIMIT (128 << 10)

enum kernfs_node_flag {
KERNFS_ACTIVATED = 0x0010,
diff --git a/include/linux/xattr.h b/include/linux/xattr.h
index 4c379d23ec6e..c6b9258958d5 100644
--- a/include/linux/xattr.h
+++ b/include/linux/xattr.h
@@ -81,6 +81,10 @@ static inline const char *xattr_prefix(const struct xattr_handler *handler)

struct simple_xattrs {
struct list_head head;
+ int nr_xattrs;
+ int nr_user_xattrs;
+ int xattr_size;
+ int user_xattr_size;
spinlock_t lock;
};

@@ -98,6 +102,10 @@ static inline void simple_xattrs_init(struct simple_xattrs *xattrs)
{
INIT_LIST_HEAD(&xattrs->head);
spin_lock_init(&xattrs->lock);
+ xattrs->nr_xattrs = 0;
+ xattrs->nr_user_xattrs = 0;
+ xattrs->xattr_size = 0;
+ xattrs->user_xattr_size = 0;
}

/*
@@ -117,8 +125,7 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size);
int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
void *buffer, size_t size);
int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
- const void *value, size_t size, int flags,
- ssize_t *removed_size);
+ const void *value, size_t size, int flags);
ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs, char *buffer,
size_t size);
void simple_xattr_list_add(struct simple_xattrs *xattrs,
diff --git a/mm/shmem.c b/mm/shmem.c
index 66eed363e5c2..0215c16a2643 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3155,30 +3155,27 @@ static int shmem_initxattrs(struct inode *inode,
struct shmem_inode_info *info = SHMEM_I(inode);
const struct xattr *xattr;
struct simple_xattr *new_xattr;
+ char *name;
size_t len;
+ int ret = 0;

for (xattr = xattr_array; xattr->name != NULL; xattr++) {
- new_xattr = simple_xattr_alloc(xattr->value, xattr->value_len);
- if (!new_xattr)
- return -ENOMEM;
-
len = strlen(xattr->name) + 1;
- new_xattr->name = kmalloc(XATTR_SECURITY_PREFIX_LEN + len,
- GFP_KERNEL_ACCOUNT);
- if (!new_xattr->name) {
- kvfree(new_xattr);
+ name = kmalloc(XATTR_SECURITY_PREFIX_LEN + len, GFP_KERNEL);
+ if (!name)
return -ENOMEM;
- }

- memcpy(new_xattr->name, XATTR_SECURITY_PREFIX,
- XATTR_SECURITY_PREFIX_LEN);
- memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN,
- xattr->name, len);
+ memcpy(name, XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN);
+ memcpy(name + XATTR_SECURITY_PREFIX_LEN, xattr->name, len);

- simple_xattr_list_add(&info->xattrs, new_xattr);
+ ret = simple_xattr_set(&info->xattrs, name, xattr->value,
+ xattr->value_len, XATTR_CREATE);
+ kfree(name);
+ if (ret)
+ break;
}

- return 0;
+ return ret;
}

static int shmem_xattr_handler_get(const struct xattr_handler *handler,
@@ -3200,7 +3197,7 @@ static int shmem_xattr_handler_set(const struct xattr_handler *handler,
struct shmem_inode_info *info = SHMEM_I(inode);

name = xattr_full_name(handler, name);
- return simple_xattr_set(&info->xattrs, name, value, size, flags, NULL);
+ return simple_xattr_set(&info->xattrs, name, value, size, flags);
}

static const struct xattr_handler shmem_security_xattr_handler = {
--
2.25.1

2022-08-12 10:25:43

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC PATCH] kernfs: enable per-inode limits for all xattr types

On Thu, Aug 11, 2022 at 07:58:46AM +0300, Vasily Averin wrote:
> Currently it's possible to create a huge number of xattr per inode,
> and I would like to add USER-like restrcition to all xattr types.
>
> I've prepared RFC patch and would like to discuss it.
>
> This patch moves counters calculations into simple_xattr_set,
> under simple_xattrs spinlock. This allows to replace atomic counters
> used currently for USER xattr to ints.
>
> To keep current behaviour for USER xattr I keep current limits
> and counters and add separated limits and counters for all another
> xattr types. However I would like to merge these limits and counters
> together, because it simplifies the code even more.
> Could someone please clarify if this is acceptable?
>
> Signed-off-by: Vasily Averin <[email protected]>
> ---

Hey Vasily,

Fyi, this patch doesn't seem to apply cleanly to anything from v5.17 up
until current master. It's a bit unfortunate that it's the middle of the
merge window so ideally you'd resend once the merge window closes on top
of v5.20-rc1.

A few questions below.

> fs/kernfs/inode.c | 67 ++-----------------------------------
> fs/kernfs/kernfs-internal.h | 2 --
> fs/xattr.c | 56 +++++++++++++++++++------------
> include/linux/kernfs.h | 2 ++
> include/linux/xattr.h | 11 ++++--
> mm/shmem.c | 29 +++++++---------
> 6 files changed, 61 insertions(+), 106 deletions(-)
>
> diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> index 3d783d80f5da..7cfdda41fc89 100644
> --- a/fs/kernfs/inode.c
> +++ b/fs/kernfs/inode.c
> @@ -47,8 +47,6 @@ static struct kernfs_iattrs *__kernfs_iattrs(struct kernfs_node *kn, int alloc)
> kn->iattr->ia_ctime = kn->iattr->ia_atime;
>
> simple_xattrs_init(&kn->iattr->xattrs);
> - atomic_set(&kn->iattr->nr_user_xattrs, 0);
> - atomic_set(&kn->iattr->user_xattr_size, 0);
> out_unlock:
> ret = kn->iattr;
> mutex_unlock(&iattr_mutex);
> @@ -314,7 +312,7 @@ int kernfs_xattr_set(struct kernfs_node *kn, const char *name,
> if (!attrs)
> return -ENOMEM;
>
> - return simple_xattr_set(&attrs->xattrs, name, value, size, flags, NULL);
> + return simple_xattr_set(&attrs->xattrs, name, value, size, flags);
> }
>
> static int kernfs_vfs_xattr_get(const struct xattr_handler *handler,
> @@ -339,61 +337,6 @@ static int kernfs_vfs_xattr_set(const struct xattr_handler *handler,
> return kernfs_xattr_set(kn, name, value, size, flags);
> }
>
> -static int kernfs_vfs_user_xattr_add(struct kernfs_node *kn,
> - const char *full_name,
> - struct simple_xattrs *xattrs,
> - const void *value, size_t size, int flags)
> -{
> - atomic_t *sz = &kn->iattr->user_xattr_size;
> - atomic_t *nr = &kn->iattr->nr_user_xattrs;
> - ssize_t removed_size;
> - int ret;
> -
> - if (atomic_inc_return(nr) > KERNFS_MAX_USER_XATTRS) {
> - ret = -ENOSPC;
> - goto dec_count_out;
> - }
> -
> - if (atomic_add_return(size, sz) > KERNFS_USER_XATTR_SIZE_LIMIT) {
> - ret = -ENOSPC;
> - goto dec_size_out;
> - }
> -
> - ret = simple_xattr_set(xattrs, full_name, value, size, flags,
> - &removed_size);
> -
> - if (!ret && removed_size >= 0)
> - size = removed_size;
> - else if (!ret)
> - return 0;
> -dec_size_out:
> - atomic_sub(size, sz);
> -dec_count_out:
> - atomic_dec(nr);
> - return ret;
> -}
> -
> -static int kernfs_vfs_user_xattr_rm(struct kernfs_node *kn,
> - const char *full_name,
> - struct simple_xattrs *xattrs,
> - const void *value, size_t size, int flags)
> -{
> - atomic_t *sz = &kn->iattr->user_xattr_size;
> - atomic_t *nr = &kn->iattr->nr_user_xattrs;
> - ssize_t removed_size;
> - int ret;
> -
> - ret = simple_xattr_set(xattrs, full_name, value, size, flags,
> - &removed_size);
> -
> - if (removed_size >= 0) {
> - atomic_sub(removed_size, sz);
> - atomic_dec(nr);
> - }
> -
> - return ret;
> -}
> -
> static int kernfs_vfs_user_xattr_set(const struct xattr_handler *handler,
> struct user_namespace *mnt_userns,
> struct dentry *unused, struct inode *inode,
> @@ -411,13 +354,7 @@ static int kernfs_vfs_user_xattr_set(const struct xattr_handler *handler,
> if (!attrs)
> return -ENOMEM;
>
> - if (value)
> - return kernfs_vfs_user_xattr_add(kn, full_name, &attrs->xattrs,
> - value, size, flags);
> - else
> - return kernfs_vfs_user_xattr_rm(kn, full_name, &attrs->xattrs,
> - value, size, flags);
> -
> + return simple_xattr_set(&attrs->xattrs, full_name, value, size, flags);
> }
>
> static const struct xattr_handler kernfs_trusted_xattr_handler = {
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index eeaa779b929c..a2b89bd48c9d 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -27,8 +27,6 @@ struct kernfs_iattrs {
> struct timespec64 ia_ctime;
>
> struct simple_xattrs xattrs;
> - atomic_t nr_user_xattrs;
> - atomic_t user_xattr_size;
> };
>
> struct kernfs_root {
> diff --git a/fs/xattr.c b/fs/xattr.c
> index b4875514a3ee..de4a2efc7fa4 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -1037,6 +1037,11 @@ int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
> return ret;
> }
>
> +static bool xattr_is_user(const char *name)
> +{
> + return !strncmp(name, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN);
> +}
> +
> /**
> * simple_xattr_set - xattr SET operation for in-memory/pseudo filesystems
> * @xattrs: target simple_xattr list
> @@ -1053,16 +1058,17 @@ int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
> * Returns 0 on success, -errno on failure.
> */
> int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> - const void *value, size_t size, int flags,
> - ssize_t *removed_size)
> + const void *value, size_t size, int flags)
> {
> struct simple_xattr *xattr;
> struct simple_xattr *new_xattr = NULL;
> + bool is_user_xattr = false;
> + int *sz = &xattrs->xattr_size;
> + int *nr = &xattrs->nr_xattrs;
> + int sz_limit = KERNFS_XATTR_SIZE_LIMIT;
> + int nr_limit = KERNFS_MAX_XATTRS;
> int err = 0;
>
> - if (removed_size)
> - *removed_size = -1;
> -
> /* value == NULL means remove */
> if (value) {
> new_xattr = simple_xattr_alloc(value, size);
> @@ -1076,6 +1082,14 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> }
> }
>
> + is_user_xattr = xattr_is_user(name);
> + if (is_user_xattr) {
> + sz = &xattrs->user_xattr_size;
> + nr = &xattrs->nr_user_xattrs;
> + sz_limit = KERNFS_USER_XATTR_SIZE_LIMIT;
> + nr_limit = KERNFS_MAX_USER_XATTRS;
> + }
> +
> spin_lock(&xattrs->lock);
> list_for_each_entry(xattr, &xattrs->head, list) {
> if (!strcmp(name, xattr->name)) {
> @@ -1083,13 +1097,19 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> xattr = new_xattr;
> err = -EEXIST;
> } else if (new_xattr) {
> - list_replace(&xattr->list, &new_xattr->list);
> - if (removed_size)
> - *removed_size = xattr->size;
> + int delta = new_xattr->size - xattr->size;
> +
> + if (*sz + delta > sz_limit) {
> + xattr = new_xattr;
> + err = -ENOSPC;
> + } else {
> + *sz += delta;
> + list_replace(&xattr->list, &new_xattr->list);
> + }
> } else {
> + *sz -= xattr->size;
> + (*nr)--;
> list_del(&xattr->list);
> - if (removed_size)
> - *removed_size = xattr->size;
> }
> goto out;
> }
> @@ -1096,7 +1117,12 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> if (flags & XATTR_REPLACE) {
> xattr = new_xattr;
> err = -ENODATA;
> + } else if ((*sz + new_xattr->size > sz_limit) || (*nr == nr_limit)) {
> + xattr = new_xattr;
> + err = -ENOSPC;
> } else {
> + *sz += new_xattr->size;
> + (*nr)++;
> list_add(&new_xattr->list, &xattrs->head);
> xattr = NULL;
> }
> @@ -1172,14 +1197,3 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
>
> return err ? err : size - remaining_size;
> }
> -
> -/*
> - * Adds an extended attribute to the list
> - */
> -void simple_xattr_list_add(struct simple_xattrs *xattrs,
> - struct simple_xattr *new_xattr)

You should also remove the function from the header.

> -{
> - spin_lock(&xattrs->lock);
> - list_add(&new_xattr->list, &xattrs->head);
> - spin_unlock(&xattrs->lock);
> -}
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index e2ae15a6225e..1972beb0d7b9 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -44,6 +44,8 @@ enum kernfs_node_type {
> #define KERNFS_FLAG_MASK ~KERNFS_TYPE_MASK
> #define KERNFS_MAX_USER_XATTRS 128
> #define KERNFS_USER_XATTR_SIZE_LIMIT (128 << 10)
> +#define KERNFS_MAX_XATTRS 128
> +#define KERNFS_XATTR_SIZE_LIMIT (128 << 10)

So iiuc, for tmpfs this is effectively a per-inode limit of 128 xattrs
and 128 user xattrs; nr_inodes * 256. I honestly have no idea if there
are legimitate use-cases to want more. But there's at least a remote
chance that this might break someone.

Apart from

> Currently it's possible to create a huge number of xattr per inode,

what exactly is this limit protecting against? In other words, the
commit message misses the motivation for the patch.
(The original thread it was spun out of was so scattered I honestly
don't have time to dig through it currently. So it'd be great if this
patch expanded on that a bit.)

I'd also prefer to see a summary of what filesystems are affected by
this change. Afaict, the patchset doesn't change anything for kernfs
users such as cgroup{1,2} so it should only be tmpfs and potential
future users of the simple_xattr_* api.

Sidenote: While looking at this I found out that cgroup{1,2} supports
e.g. security.capability to be set on say cpu.stat or similar files.
That seems very strange to me.

2022-08-13 04:22:41

by Vasily Averin

[permalink] [raw]
Subject: Re: [RFC PATCH] kernfs: enable per-inode limits for all xattr types

On 8/12/22 13:20, Christian Brauner wrote:
> So iiuc, for tmpfs this is effectively a per-inode limit of 128 xattrs
> and 128 user xattrs; nr_inodes * 256. I honestly have no idea if there
> are legimitate use-cases to want more. But there's at least a remote
> chance that this might break someone.
>
> Apart from
>
>> Currently it's possible to create a huge number of xattr per inode,
>
> what exactly is this limit protecting against? In other words, the
> commit message misses the motivation for the patch.

This should prevent softlockup and hung_task_panic caused by slow search
in xattrs->list in simple_xattr_set() and _get()

Adding new xattr checks all present entries in the list,
so execution time linearly depends on the number of such entries.

To avoid this problem I decided to limit somehow the number of entries in the list.
As an alternative Tejun advises to switch this list to something like rb-tree,
now I think he is right.

> I'd also prefer to see a summary of what filesystems are affected by
> this change. Afaict, the patchset doesn't change anything for kernfs
> users such as cgroup{1,2} so it should only be tmpfs and potential
> future users of the simple_xattr_* api.

This affect all file systems used simple_xattr_* API: sysfs and tmpfs.

Now I'll try to follow Tejun's advice and switch the xatrrs list to rb-tree.
Unfortunately, I doubt that I will have enough time to finish before the merge
window closes.

Thank you,
Vasily Averin

2022-08-16 09:31:47

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH 0/3] enable memcg accounting for kernfs objects

On 8/11/22 06:19, Vasily Averin wrote:
> On 8/9/22 20:56, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Aug 09, 2022 at 07:49:34PM +0200, Michal Koutný wrote:
>>> On Tue, Aug 09, 2022 at 07:31:31AM -1000, Tejun Heo <[email protected]> wrote:
>>>> I'm not quite sure whether following the usual "charge it to the allocating
>>>> task's cgroup" is the best way to go about it. I wonder whether it'd be
>>>> better to attach it to the new cgroup's nearest ancestor with memcg enabled.
>>>
>>> See also
>>> https://lore.kernel.org/r/YnBLge4ZQNbbxufc@blackbook/
>>> and
>>> https://lore.kernel.org/r/[email protected]/
>>
>> Ah, thanks. Vasily, can you please include some summary of the discussions
>> and the rationales for the path taken in the commit message?
>
> Dear Tejun,
> thank you for the feedback, I'll do it in next patch set iteration.
>
> However, I noticed another problem in neighborhood and I planned to
> add new patches into current patch set. One of the new patches is quite simple,
> however second one is quite complex and requires some discussion.

Summing up a private discussion with Tejun, Michal and Roman:
I'm going to create few new patches:

1) adjust active memcg for objects allocated during creation of new cgroup
This patch will take memcg from parent cgroup an use it for accounting
all objects allocated during creation of new cgroup.
For that it moves set_active_memcg() calls from mem_cgroup_css_alloc()
to cgroup_mkdir() and creates missing infrastructure.
This allows you to predict which memcg should be used for object accounting,
and should simplify debugging of possible problems and corner cases.

2) memcg: enable kernfs accounting: nodes and iattr
Already discussed and approved patches.
These objects consumes significant part of memory in various scenarios,
including new cgroup creation and new net device creation.

3) adjust active memcg for simple_xattr accounting
sysfs and tmpfs are in-memory file system,
for extended attributes they uses simple_xattr infrastructure.
The patch forces sys_set[f]xattr calls to account xattr object
to predictable memcg: for kernfs memcg will be taken from kernfs node,
for shmem -- from shmem_info.
Like 1) case, this allows to understand which memcg should be used
for object accounting and simplify debugging of possible troubles.

4) memcg: enable accounting for simple_xattr: names and values
This patch enables accounting for objects described in previous patch

5) simple_xattrs: replace list to rb-tree
This significantly reduces the search time for existing entries.

Additionally Roman Gushchin prepares patch
"`put`ting the kernfs_node reference earlier in the cgroup removal process"

Thank you,
Vasily Averin

2022-08-17 08:06:28

by Vasily Averin

[permalink] [raw]
Subject: [RFC PATCH] memcg: adjust memcg for new cgroup allocations

This is first of patches announced here:
https://lore.kernel.org/all/[email protected]/
---
Usually accounted allocations use memory cgroup of current process.
In case of cgroup this looks incorrect: memory cgroup can be created
both from the host and from inside container. At the same time
it is intuitively expected that these both operations should lead
to the same result.
This issue was addressed by Roman Gushchin in commit 3e38e0aaca9e
"mm: memcg: charge memcg percpu memory to the parent cgroup",
He adjusted memcg for allocations called inside mem_cgroup_css_alloc.

However, now we want to enable accounting for some other cgroup-related
resources called from cgroup_mkdir. We would like to guarantee that
all new accounted allocation will be charged to the same memory cgroup.

This patch moves memcg adjustment from mem_cgroup_css_alloc() to its
callers: cgroup_mkdir() and cgroup_apply_control(). In own turn this
requires to create new proxy function mem_cgroup_from_cgroup().

Signed-off-by: Vasily Averin <[email protected]>
---
include/linux/memcontrol.h | 18 ++++++++++++++++++
kernel/cgroup/cgroup.c | 22 ++++++++++++++++------
mm/memcontrol.c | 4 +---
3 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4d31ce55b1c0..342426d1edbf 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1813,6 +1813,19 @@ static inline struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg)
{
return memcg ? memcg : root_mem_cgroup;
}
+
+static inline struct mem_cgroup *mem_cgroup_from_cgroup(struct cgroup *cgroup)
+{
+ struct cgroup_subsys_state *css;
+ struct mem_cgroup *memcg = NULL;
+
+ css = cgroup_get_e_css(cgroup, &memory_cgrp_subsys);
+
+ if (css)
+ memcg = container_of(css, struct mem_cgroup, css);
+
+ return mem_cgroup_or_root(memcg);
+}
#else
static inline bool mem_cgroup_kmem_disabled(void)
{
@@ -1878,6 +1891,11 @@ static inline struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg)
{
return NULL;
}
+
+static inline struct mem_cgroup *mem_cgroup_from_cgroup(struct cgroup *cgroup)
+{
+ return NULL;
+}
#endif /* CONFIG_MEMCG_KMEM */

#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index ffaccd6373f1..544d93a8878f 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3247,13 +3247,16 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
*/
static int cgroup_apply_control(struct cgroup *cgrp)
{
+ struct mem_cgroup *memcg, *old_memcg;
int ret;

cgroup_propagate_control(cgrp);

+ memcg = mem_cgroup_from_cgroup(cgrp);
+ old_memcg = set_active_memcg(memcg);
ret = cgroup_apply_control_enable(cgrp);
if (ret)
- return ret;
+ goto out_memcg;

/*
* At this point, cgroup_e_css_by_mask() results reflect the new csses
@@ -3261,10 +3264,11 @@ static int cgroup_apply_control(struct cgroup *cgrp)
* css associations of all tasks in the subtree.
*/
ret = cgroup_update_dfl_csses(cgrp);
- if (ret)
- return ret;

- return 0;
+out_memcg:
+ set_active_memcg(old_memcg);
+ mem_cgroup_put(memcg);
+ return ret;
}

/**
@@ -5532,6 +5536,7 @@ static bool cgroup_check_hierarchy_limits(struct cgroup *parent)
int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
{
struct cgroup *parent, *cgrp;
+ struct mem_cgroup *memcg, *old_memcg;
int ret;

/* do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable */
@@ -5547,10 +5552,12 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
goto out_unlock;
}

+ memcg = mem_cgroup_from_cgroup(parent);
+ old_memcg = set_active_memcg(memcg);
cgrp = cgroup_create(parent, name, mode);
if (IS_ERR(cgrp)) {
ret = PTR_ERR(cgrp);
- goto out_unlock;
+ goto out_memcg;
}

/*
@@ -5577,10 +5584,13 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
kernfs_activate(cgrp->kn);

ret = 0;
- goto out_unlock;
+ goto out_memcg;

out_destroy:
cgroup_destroy_locked(cgrp);
+out_memcg:
+ set_active_memcg(old_memcg);
+ mem_cgroup_put(memcg);
out_unlock:
cgroup_kn_unlock(parent_kn);
return ret;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b69979c9ced5..e170c64e66e2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5239,11 +5239,9 @@ static struct cgroup_subsys_state * __ref
mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
{
struct mem_cgroup *parent = mem_cgroup_from_css(parent_css);
- struct mem_cgroup *memcg, *old_memcg;
+ struct mem_cgroup *memcg;

- old_memcg = set_active_memcg(parent);
memcg = mem_cgroup_alloc();
- set_active_memcg(old_memcg);
if (IS_ERR(memcg))
return ERR_CAST(memcg);

--
2.34.1

2022-08-17 09:43:31

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC PATCH] memcg: adjust memcg for new cgroup allocations

Hello.

On Wed, Aug 17, 2022 at 10:42:40AM +0300, Vasily Averin <[email protected]> wrote:
> However, now we want to enable accounting for some other cgroup-related
> resources called from cgroup_mkdir. We would like to guarantee that
> all new accounted allocation will be charged to the same memory cgroup.

Here's my point -- the change in the referenced patch applied to memory
controller hierarchies. This extension applies to any hierarchy that can
create groups, namely, a hierarchy without memory controller too. There
mem_cgroup_from_cgroup falls back to the root memcg (on a different
hierarchy).

If the purpose is to prevent unlimited creation of cgroup objects, the
root memcg is by principle unlimited, so it's just for accounting.

But I understand the purpose is to have everything under one roof,
unless the object lifetime is not bound to that owning memcg. Should
memory-less hierarchies be treated specially?

> +static inline struct mem_cgroup *mem_cgroup_from_cgroup(struct cgroup *cgroup)
[...]
> + css = cgroup_get_e_css(cgroup, &memory_cgrp_subsys);
> +
> + if (css)
> + memcg = container_of(css, struct mem_cgroup, css);

Nit: mem_cgroup_from_css

Regards,
Michal


Attachments:
(No filename) (1.19 kB)
signature.asc (235.00 B)
Digital signature
Download all attachments

2022-08-17 15:32:21

by Vasily Averin

[permalink] [raw]
Subject: Re: [RFC PATCH] memcg: adjust memcg for new cgroup allocations

On 8/17/22 12:17, Michal Koutný wrote:
>> +static inline struct mem_cgroup *mem_cgroup_from_cgroup(struct cgroup *cgroup)
> [...]
>> + css = cgroup_get_e_css(cgroup, &memory_cgrp_subsys);
>> +
>> + if (css)
>> + memcg = container_of(css, struct mem_cgroup, css);
> Nit: mem_cgroup_from_css

Yes, my fault, you are right.
it was planned initially, but then I lost it somehow.

Thank you very much!
Vasily Averin

2022-08-17 16:46:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH] memcg: adjust memcg for new cgroup allocations

Hello,

On Wed, Aug 17, 2022 at 11:17:28AM +0200, Michal Koutn? wrote:
> On Wed, Aug 17, 2022 at 10:42:40AM +0300, Vasily Averin <[email protected]> wrote:
> > However, now we want to enable accounting for some other cgroup-related
> > resources called from cgroup_mkdir. We would like to guarantee that
> > all new accounted allocation will be charged to the same memory cgroup.
>
> Here's my point -- the change in the referenced patch applied to memory
> controller hierarchies. This extension applies to any hierarchy that can
> create groups, namely, a hierarchy without memory controller too. There
> mem_cgroup_from_cgroup falls back to the root memcg (on a different
> hierarchy).
>
> If the purpose is to prevent unlimited creation of cgroup objects, the
> root memcg is by principle unlimited, so it's just for accounting.
>
> But I understand the purpose is to have everything under one roof,
> unless the object lifetime is not bound to that owning memcg. Should
> memory-less hierarchies be treated specially?

At least from my POV, as long as cgroup1 is not being regressed, we want to
make decisions which make the best long term sense. We surely can
accommodate cgroup1 as long as the added complexity is minimal but the bar
is pretty high there. cgroup1 has been in maintenance mode for years now and
even the basic delegation model isn't well established in cgroup1, so if we
end up accounting everything in the root cgroup for most of cgroup1
hierarchies, that sounds fine to me.

Thanks.

--
tejun

2022-08-18 09:18:20

by Vasily Averin

[permalink] [raw]
Subject: [RFC PATCH] simple_xattr: switch from list to rb_tree

The patch was announced here:
https://lore.kernel.org/all/[email protected]/
"5) simple_xattrs: replace list to rb-tree
This significantly reduces the search time for existing entries."

It was compiled but was not tested yet.
---
Currently simple_xattr uses a list to store existing entries.
If the list grows, the presence check may be slow and potentially
lead to problems. Red-black tree should work more efficiently
in this situation.

This patch replaces list to rb_tree and switches simple_xattr_* calls
to its using.

Signed-off-by: Vasily Averin <[email protected]>
---
fs/xattr.c | 109 ++++++++++++++++++++++++++++++++----------
include/linux/xattr.h | 13 +++--
2 files changed, 92 insertions(+), 30 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 6401703707f2..672f2214fcfd 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -1021,6 +1021,60 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
return new_xattr;
}

+static struct simple_xattr *simple_xattr_rb_search(struct rb_root *root,
+ const char* name)
+{
+ struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+ /* Figure out where to put new node */
+ while (*new)
+ {
+ struct simple_xattr *xattr;
+ int result;
+
+ xattr = container_of(*new, struct simple_xattr, node);
+ result = strcmp(xattr->name, name);
+
+ parent = *new;
+ if (result < 0)
+ new = &((*new)->rb_left);
+ else if (result > 0)
+ new = &((*new)->rb_right);
+ else
+ return xattr;
+ }
+ return NULL;
+}
+
+static bool simple_xattr_rb_insert(struct rb_root *root,
+ struct simple_xattr *new_xattr)
+{
+ struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+ /* Figure out where to put new node */
+ while (*new) {
+ struct simple_xattr *xattr;
+ int result;
+
+ xattr = container_of(*new, struct simple_xattr, node);
+ result = strcmp(xattr->name, new_xattr->name);
+
+ parent = *new;
+ if (result < 0)
+ new = &((*new)->rb_left);
+ else if (result > 0)
+ new = &((*new)->rb_right);
+ else
+ return false;
+ }
+
+ /* Add new node and rebalance tree. */
+ rb_link_node(&new_xattr->node, parent, new);
+ rb_insert_color(&new_xattr->node, root);
+
+ return true;
+}
+
/*
* xattr GET operation for in-memory/pseudo filesystems
*/
@@ -1031,10 +1085,8 @@ int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
int ret = -ENODATA;

spin_lock(&xattrs->lock);
- list_for_each_entry(xattr, &xattrs->head, list) {
- if (strcmp(name, xattr->name))
- continue;
-
+ xattr = simple_xattr_rb_search(&xattrs->rb_root, name);
+ if (xattr) {
ret = xattr->size;
if (buffer) {
if (size < xattr->size)
@@ -1042,7 +1094,6 @@ int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
else
memcpy(buffer, xattr->value, xattr->size);
}
- break;
}
spin_unlock(&xattrs->lock);
return ret;
@@ -1067,7 +1118,7 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
const void *value, size_t size, int flags,
ssize_t *removed_size)
{
- struct simple_xattr *xattr;
+ struct simple_xattr *xattr = NULL;
struct simple_xattr *new_xattr = NULL;
int err = 0;

@@ -1088,31 +1139,36 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
}

spin_lock(&xattrs->lock);
- list_for_each_entry(xattr, &xattrs->head, list) {
- if (!strcmp(name, xattr->name)) {
- if (flags & XATTR_CREATE) {
- xattr = new_xattr;
- err = -EEXIST;
- } else if (new_xattr) {
- list_replace(&xattr->list, &new_xattr->list);
+ if ((flags & XATTR_CREATE) && new_xattr) {
+ /* create new */
+ if (!simple_xattr_rb_insert(&xattrs->rb_root, new_xattr)) {
+ /* already exist */
+ xattr = new_xattr;
+ err = -EEXIST;
+ }
+ } else {
+ /* replace or remove */
+ xattr = simple_xattr_rb_search(&xattrs->rb_root, name);
+ if (xattr) {
+ /* found */
+ if (!new_xattr) {
+ /* remove existing */
+ rb_erase(&xattr->node, &xattrs->rb_root);
if (removed_size)
*removed_size = xattr->size;
} else {
- list_del(&xattr->list);
+ /* replace existing */
+ rb_replace_node(&xattr->node, &new_xattr->node,
+ &xattrs->rb_root);
if (removed_size)
*removed_size = xattr->size;
}
- goto out;
+ } else if (new_xattr) {
+ /* not found, incorrect replace */
+ xattr = new_xattr;
+ err = -ENODATA;
}
}
- if (flags & XATTR_REPLACE) {
- xattr = new_xattr;
- err = -ENODATA;
- } else {
- list_add(&new_xattr->list, &xattrs->head);
- xattr = NULL;
- }
-out:
spin_unlock(&xattrs->lock);
if (xattr) {
kfree(xattr->name);
@@ -1149,6 +1205,7 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
{
bool trusted = capable(CAP_SYS_ADMIN);
struct simple_xattr *xattr;
+ struct rb_node *node;
ssize_t remaining_size = size;
int err = 0;

@@ -1170,7 +1227,9 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
#endif

spin_lock(&xattrs->lock);
- list_for_each_entry(xattr, &xattrs->head, list) {
+ for (node = rb_first(&xattrs->rb_root); node; node = rb_next(node)) {
+ xattr = container_of(node, struct simple_xattr, node);
+
/* skip "trusted." attributes for unprivileged callers */
if (!trusted && xattr_is_trusted(xattr->name))
continue;
@@ -1191,6 +1250,6 @@ void simple_xattr_list_add(struct simple_xattrs *xattrs,
struct simple_xattr *new_xattr)
{
spin_lock(&xattrs->lock);
- list_add(&new_xattr->list, &xattrs->head);
+ simple_xattr_rb_insert(&xattrs->rb_root, new_xattr);
spin_unlock(&xattrs->lock);
}
diff --git a/include/linux/xattr.h b/include/linux/xattr.h
index 979a9d3e5bfb..bbe81cfb7a4d 100644
--- a/include/linux/xattr.h
+++ b/include/linux/xattr.h
@@ -80,12 +80,12 @@ static inline const char *xattr_prefix(const struct xattr_handler *handler)
}

struct simple_xattrs {
- struct list_head head;
+ struct rb_root rb_root;
spinlock_t lock;
};

struct simple_xattr {
- struct list_head list;
+ struct rb_node node;
char *name;
size_t size;
char value[];
@@ -96,7 +96,7 @@ struct simple_xattr {
*/
static inline void simple_xattrs_init(struct simple_xattrs *xattrs)
{
- INIT_LIST_HEAD(&xattrs->head);
+ xattrs->rb_root = RB_ROOT;
spin_lock_init(&xattrs->lock);
}

@@ -105,9 +105,12 @@ static inline void simple_xattrs_init(struct simple_xattrs *xattrs)
*/
static inline void simple_xattrs_free(struct simple_xattrs *xattrs)
{
- struct simple_xattr *xattr, *node;
+ struct simple_xattr *xattr;
+ struct rb_node *node;

- list_for_each_entry_safe(xattr, node, &xattrs->head, list) {
+ while ((node = rb_first(&xattrs->rb_root))) {
+ rb_erase(node, &xattrs->rb_root);
+ xattr = container_of(node, struct simple_xattr, node);
kfree(xattr->name);
kvfree(xattr);
}
--
2.34.1

2022-08-18 09:33:44

by Vasily Averin

[permalink] [raw]
Subject: [RFC PATCH] memcg: adjust memcg used to charge for new simple_xattrs objects

The patch was announced here:
https://lore.kernel.org/all/[email protected]/
"3) adjust active memcg for simple_xattr accounting
sysfs and tmpfs are in-memory file system,
for extended attributes they uses simple_xattr infrastructure.
The patch forces sys_set[f]xattr calls to account xattr object
to predictable memcg: for kernfs memcg will be taken from kernfs node,
for shmem -- from shmem_info.
Like 1) case, this allows to understand which memcg should be used
for object accounting and simplify debugging of possible troubles."

It was compiled but was not tested yet.
---
sys_set[f]xattr uses simple_xattr infrastructure to create a new
extended attribute for in-memory file systems like sysfs and tmpfs.
Number and size of allocated objects are controlled by user space,
they are always living in memory and its lifetime is indefinitely long.
Therefore this memory should be properly accounted.

By default new memory is accounted to memcg of creator process.
As a result, neighboring xattrs of the same inode can be charged to
different memcgs. This looks unexpected and makes hard the
investigation of the memcg accounting issues.

This patch adjust memcg used for such allocations. For kernfs
it gives memcg from kernfs node, for shmem -- from shmem_info.
This allows to cahrge all inode-sepcific objects to the same
memory cgroup.

Signed-off-by: Vasily Averin <[email protected]>
---
fs/kernfs/inode.c | 31 ++++++++++++++++++++++++-------
mm/shmem.c | 10 +++++++++-
2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3d783d80f5da..975532b32e7c 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -14,6 +14,7 @@
#include <linux/slab.h>
#include <linux/xattr.h>
#include <linux/security.h>
+#include <linux/memcontrol.h>

#include "kernfs-internal.h"

@@ -335,8 +336,16 @@ static int kernfs_vfs_xattr_set(const struct xattr_handler *handler,
{
const char *name = xattr_full_name(handler, suffix);
struct kernfs_node *kn = inode->i_private;
+ struct mem_cgroup *old, *memcg;
+ int ret;
+
+ memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(kn));
+ old = set_active_memcg(memcg);

- return kernfs_xattr_set(kn, name, value, size, flags);
+ ret = kernfs_xattr_set(kn, name, value, size, flags);
+ set_active_memcg(old);
+ mem_cgroup_put(memcg);
+ return ret;
}

static int kernfs_vfs_user_xattr_add(struct kernfs_node *kn,
@@ -403,21 +412,29 @@ static int kernfs_vfs_user_xattr_set(const struct xattr_handler *handler,
const char *full_name = xattr_full_name(handler, suffix);
struct kernfs_node *kn = inode->i_private;
struct kernfs_iattrs *attrs;
+ struct mem_cgroup *old, *memcg;
+ int ret = -ENOMEM;

if (!(kernfs_root(kn)->flags & KERNFS_ROOT_SUPPORT_USER_XATTR))
return -EOPNOTSUPP;

+ memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(kn));
+ old = set_active_memcg(memcg);
+
attrs = kernfs_iattrs(kn);
if (!attrs)
- return -ENOMEM;
+ goto out;

if (value)
- return kernfs_vfs_user_xattr_add(kn, full_name, &attrs->xattrs,
- value, size, flags);
- else
- return kernfs_vfs_user_xattr_rm(kn, full_name, &attrs->xattrs,
+ ret = kernfs_vfs_user_xattr_add(kn, full_name, &attrs->xattrs,
value, size, flags);
-
+ else
+ ret = kernfs_vfs_user_xattr_rm(kn, full_name, &attrs->xattrs,
+ value, size, flags);
+out:
+ set_active_memcg(old);
+ mem_cgroup_put(memcg);
+ return ret;
}

static const struct xattr_handler kernfs_trusted_xattr_handler = {
diff --git a/mm/shmem.c b/mm/shmem.c
index 5783f11351bb..291c15774340 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3255,9 +3255,17 @@ static int shmem_xattr_handler_set(const struct xattr_handler *handler,
size_t size, int flags)
{
struct shmem_inode_info *info = SHMEM_I(inode);
+ struct mem_cgroup *old, *memcg;
+ int ret;
+
+ memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(info));
+ old = set_active_memcg(memcg);

name = xattr_full_name(handler, name);
- return simple_xattr_set(&info->xattrs, name, value, size, flags, NULL);
+ ret = simple_xattr_set(&info->xattrs, name, value, size, flags, NULL);
+ set_active_memcg(old);
+ mem_cgroup_put(memcg);
+ return ret;
}

static const struct xattr_handler shmem_security_xattr_handler = {
--
2.34.1

2022-08-18 13:01:10

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC PATCH] memcg: adjust memcg used to charge for new simple_xattrs objects

On Thu, Aug 18, 2022 at 12:10:45PM +0300, Vasily Averin <[email protected]> wrote:
> sys_set[f]xattr uses simple_xattr infrastructure to create a new
> extended attribute for in-memory file systems like sysfs and tmpfs.
> Number and size of allocated objects are controlled by user space,
> they are always living in memory and its lifetime is indefinitely long.
> Therefore this memory should be properly accounted.
>
> By default new memory is accounted to memcg of creator process.

despite objects aren't bound to this process lifetime.

(I think this was the main argument for this approach and should be in
the commit message then.)

> As a result, neighboring xattrs of the same inode can be charged to
> different memcgs. This looks unexpected and makes hard the
> investigation of the memcg accounting issues.
>
> This patch adjust memcg used for such allocations. For kernfs
> it gives memcg from kernfs node, for shmem -- from shmem_info.
> This allows to cahrge all inode-sepcific objects to the same
> memory cgroup.

IIUC you intend to inherit association from shmem_inode_info (i.e.
whoever created the inode). shmem_inode_cachep has SLAB_ACCOUNT, so it's valid.

Thanks,
Michal

2022-08-18 13:24:39

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC PATCH] simple_xattr: switch from list to rb_tree

On Thu, Aug 18, 2022 at 12:12:30PM +0300, Vasily Averin wrote:
> The patch was announced here:
> https://lore.kernel.org/all/[email protected]/
> "5) simple_xattrs: replace list to rb-tree
> This significantly reduces the search time for existing entries."
>
> It was compiled but was not tested yet.
> ---
> Currently simple_xattr uses a list to store existing entries.
> If the list grows, the presence check may be slow and potentially
> lead to problems. Red-black tree should work more efficiently
> in this situation.
>
> This patch replaces list to rb_tree and switches simple_xattr_* calls
> to its using.
>
> Signed-off-by: Vasily Averin <[email protected]>
> ---

I think the background for the performance issues in the commit message
would be helpful and I have a few comments. Also, trying to test whether the
lockups are gone due to the rbtree switch would be +1.

This will likely conflict with some acl/xattr changes I have lined up so
if we decide to proceed I wouldn't mind dealing with this series if
there are no objections.

> fs/xattr.c | 109 ++++++++++++++++++++++++++++++++----------
> include/linux/xattr.h | 13 +++--
> 2 files changed, 92 insertions(+), 30 deletions(-)
>
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 6401703707f2..672f2214fcfd 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -1021,6 +1021,60 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
> return new_xattr;
> }
>
> +static struct simple_xattr *simple_xattr_rb_search(struct rb_root *root,
> + const char* name)
> +{
> + struct rb_node **new = &(root->rb_node), *parent = NULL;

I'd suggest to not name this "new" but rather just "cur" or "node".

> +
> + /* Figure out where to put new node */
> + while (*new)
> + {

nit: that "{" should be on the same line as the while

> + struct simple_xattr *xattr;
> + int result;
> +
> + xattr = container_of(*new, struct simple_xattr, node);
> + result = strcmp(xattr->name, name);
> +
> + parent = *new;

That variable and assignment seems unnecessary?

> + if (result < 0)
> + new = &((*new)->rb_left);
> + else if (result > 0)
> + new = &((*new)->rb_right);
> + else
> + return xattr;
> + }
> + return NULL;
> +}
> +
> +static bool simple_xattr_rb_insert(struct rb_root *root,
> + struct simple_xattr *new_xattr)
> +{
> + struct rb_node **new = &(root->rb_node), *parent = NULL;
> +
> + /* Figure out where to put new node */
> + while (*new) {
> + struct simple_xattr *xattr;
> + int result;
> +
> + xattr = container_of(*new, struct simple_xattr, node);
> + result = strcmp(xattr->name, new_xattr->name);
> +
> + parent = *new;
> + if (result < 0)
> + new = &((*new)->rb_left);
> + else if (result > 0)
> + new = &((*new)->rb_right);
> + else
> + return false;
> + }
> +
> + /* Add new node and rebalance tree. */
> + rb_link_node(&new_xattr->node, parent, new);
> + rb_insert_color(&new_xattr->node, root);
> +
> + return true;
> +}
> +
> /*
> * xattr GET operation for in-memory/pseudo filesystems
> */
> @@ -1031,10 +1085,8 @@ int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
> int ret = -ENODATA;
>
> spin_lock(&xattrs->lock);
> - list_for_each_entry(xattr, &xattrs->head, list) {
> - if (strcmp(name, xattr->name))
> - continue;
> -
> + xattr = simple_xattr_rb_search(&xattrs->rb_root, name);
> + if (xattr) {
> ret = xattr->size;
> if (buffer) {
> if (size < xattr->size)
> @@ -1042,7 +1094,6 @@ int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
> else
> memcpy(buffer, xattr->value, xattr->size);
> }
> - break;
> }
> spin_unlock(&xattrs->lock);
> return ret;
> @@ -1067,7 +1118,7 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> const void *value, size_t size, int flags,
> ssize_t *removed_size)
> {
> - struct simple_xattr *xattr;
> + struct simple_xattr *xattr = NULL;
> struct simple_xattr *new_xattr = NULL;
> int err = 0;
>
> @@ -1088,31 +1139,36 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> }
>
> spin_lock(&xattrs->lock);
> - list_for_each_entry(xattr, &xattrs->head, list) {
> - if (!strcmp(name, xattr->name)) {
> - if (flags & XATTR_CREATE) {
> - xattr = new_xattr;
> - err = -EEXIST;
> - } else if (new_xattr) {
> - list_replace(&xattr->list, &new_xattr->list);
> + if ((flags & XATTR_CREATE) && new_xattr) {
> + /* create new */
> + if (!simple_xattr_rb_insert(&xattrs->rb_root, new_xattr)) {
> + /* already exist */
> + xattr = new_xattr;
> + err = -EEXIST;
> + }
> + } else {
> + /* replace or remove */
> + xattr = simple_xattr_rb_search(&xattrs->rb_root, name);
> + if (xattr) {
> + /* found */
> + if (!new_xattr) {
> + /* remove existing */
> + rb_erase(&xattr->node, &xattrs->rb_root);
> if (removed_size)
> *removed_size = xattr->size;
> } else {
> - list_del(&xattr->list);
> + /* replace existing */
> + rb_replace_node(&xattr->node, &new_xattr->node,
> + &xattrs->rb_root);
> if (removed_size)
> *removed_size = xattr->size;
> }
> - goto out;
> + } else if (new_xattr) {
> + /* not found, incorrect replace */
> + xattr = new_xattr;
> + err = -ENODATA;
> }
> }
> - if (flags & XATTR_REPLACE) {

I think keeping this rather close to the original code might be nicer.
I find the code more difficult to follow afterwards. So how about
(COMPLETELY UNTESTED) sm like:

@@ -1077,30 +1139,40 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
}

spin_lock(&xattrs->lock);
- list_for_each_entry(xattr, &xattrs->head, list) {
- if (!strcmp(name, xattr->name)) {
- if (flags & XATTR_CREATE) {
- xattr = new_xattr;
- err = -EEXIST;
- } else if (new_xattr) {
- list_replace(&xattr->list, &new_xattr->list);
- if (removed_size)
- *removed_size = xattr->size;
- } else {
- list_del(&xattr->list);
- if (removed_size)
- *removed_size = xattr->size;
- }
- goto out;
+ /* Find any matching xattr by name. */
+ xattr = simple_xattr_rb_search(&xattrs->rb_root, name);
+ if (xattr) {
+ if (flags & XATTR_CREATE) {
+ /* Creating request but the xattr already existed. */
+ xattr = new_xattr;
+ err = -EEXIST;
+ } else if (new_xattr) {
+ /* Replace the existing xattr. */
+ rb_replace_node(&xattr->node, &new_xattr->node,
+ &xattrs->rb_root);
+ if (removed_size)
+ *removed_size = xattr->size;
+ } else {
+ /* No new xattr specified so wipe the existing xattr. */
+ rb_erase(&xattr->node, &xattrs->rb_root);
+ if (removed_size)
+ *removed_size = xattr->size;
}
+ goto out;
}
+
if (flags & XATTR_REPLACE) {
+ /* There's no matching xattr so fail on replace. */
xattr = new_xattr;
err = -ENODATA;
} else {
- list_add(&new_xattr->list, &xattrs->head);
- xattr = NULL;
+ /*
+ * We're holding the lock and verified that there's no
+ * pre-existing xattr so this should always succeed.
+ */
+ WARN_ON(!simple_xattr_rb_insert(&xattrs->rb_root, new_xattr))
}
+
out:
spin_unlock(&xattrs->lock);
if (xattr) {


> - xattr = new_xattr;
> - err = -ENODATA;
> - } else {
> - list_add(&new_xattr->list, &xattrs->head);
> - xattr = NULL;
> - }
> -out:
> spin_unlock(&xattrs->lock);
> if (xattr) {
> kfree(xattr->name);
> @@ -1149,6 +1205,7 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
> {
> bool trusted = capable(CAP_SYS_ADMIN);
> struct simple_xattr *xattr;
> + struct rb_node *node;
> ssize_t remaining_size = size;
> int err = 0;
>
> @@ -1170,7 +1227,9 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
> #endif
>
> spin_lock(&xattrs->lock);
> - list_for_each_entry(xattr, &xattrs->head, list) {
> + for (node = rb_first(&xattrs->rb_root); node; node = rb_next(node)) {
> + xattr = container_of(node, struct simple_xattr, node);
> +
> /* skip "trusted." attributes for unprivileged callers */
> if (!trusted && xattr_is_trusted(xattr->name))
> continue;
> @@ -1191,6 +1250,6 @@ void simple_xattr_list_add(struct simple_xattrs *xattrs,
> struct simple_xattr *new_xattr)
> {
> spin_lock(&xattrs->lock);
> - list_add(&new_xattr->list, &xattrs->head);
> + simple_xattr_rb_insert(&xattrs->rb_root, new_xattr);
> spin_unlock(&xattrs->lock);
> }
> diff --git a/include/linux/xattr.h b/include/linux/xattr.h
> index 979a9d3e5bfb..bbe81cfb7a4d 100644
> --- a/include/linux/xattr.h
> +++ b/include/linux/xattr.h
> @@ -80,12 +80,12 @@ static inline const char *xattr_prefix(const struct xattr_handler *handler)
> }
>
> struct simple_xattrs {
> - struct list_head head;
> + struct rb_root rb_root;
> spinlock_t lock;
> };
>
> struct simple_xattr {
> - struct list_head list;
> + struct rb_node node;
> char *name;
> size_t size;
> char value[];
> @@ -96,7 +96,7 @@ struct simple_xattr {
> */
> static inline void simple_xattrs_init(struct simple_xattrs *xattrs)
> {
> - INIT_LIST_HEAD(&xattrs->head);
> + xattrs->rb_root = RB_ROOT;
> spin_lock_init(&xattrs->lock);
> }
>
> @@ -105,9 +105,12 @@ static inline void simple_xattrs_init(struct simple_xattrs *xattrs)
> */
> static inline void simple_xattrs_free(struct simple_xattrs *xattrs)
> {
> - struct simple_xattr *xattr, *node;
> + struct simple_xattr *xattr;
> + struct rb_node *node;
>
> - list_for_each_entry_safe(xattr, node, &xattrs->head, list) {
> + while ((node = rb_first(&xattrs->rb_root))) {
> + rb_erase(node, &xattrs->rb_root);
> + xattr = container_of(node, struct simple_xattr, node);
> kfree(xattr->name);
> kvfree(xattr);
> }
> --
> 2.34.1
>
>

2022-08-23 09:07:40

by Vasily Averin

[permalink] [raw]
Subject: Re: [RFC PATCH] memcg: adjust memcg used to charge for new simple_xattrs objects

On 8/18/22 15:27, Michal Koutný wrote:
> On Thu, Aug 18, 2022 at 12:10:45PM +0300, Vasily Averin <[email protected]> wrote:
>> sys_set[f]xattr uses simple_xattr infrastructure to create a new
>> extended attribute for in-memory file systems like sysfs and tmpfs.
>> Number and size of allocated objects are controlled by user space,
>> they are always living in memory and its lifetime is indefinitely long.
>> Therefore this memory should be properly accounted.
>>
>> By default new memory is accounted to memcg of creator process.
>
> despite objects aren't bound to this process lifetime.
>
> (I think this was the main argument for this approach and should be in
> the commit message then.)

Thank you for the remark, I'll update patch description in the next version

>> As a result, neighboring xattrs of the same inode can be charged to
>> different memcgs. This looks unexpected and makes hard the
>> investigation of the memcg accounting issues.
>>
>> This patch adjust memcg used for such allocations. For kernfs
>> it gives memcg from kernfs node, for shmem -- from shmem_info.
>> This allows to cahrge all inode-sepcific objects to the same
>> memory cgroup.
>
> IIUC you intend to inherit association from shmem_inode_info (i.e.
> whoever created the inode). shmem_inode_cachep has SLAB_ACCOUNT, so it's valid.

Yes, you are right, I'll clarify this in next patch version.

Thank you,
Vasily Averin

2022-08-23 15:04:47

by Vasily Averin

[permalink] [raw]
Subject: Re: [RFC PATCH] memcg: adjust memcg for new cgroup allocations

On 8/17/22 19:41, Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 17, 2022 at 11:17:28AM +0200, Michal Koutný wrote:
>> On Wed, Aug 17, 2022 at 10:42:40AM +0300, Vasily Averin <[email protected]> wrote:
>>> However, now we want to enable accounting for some other cgroup-related
>>> resources called from cgroup_mkdir. We would like to guarantee that
>>> all new accounted allocation will be charged to the same memory cgroup.
>>
>> Here's my point -- the change in the referenced patch applied to memory
>> controller hierarchies. This extension applies to any hierarchy that can
>> create groups, namely, a hierarchy without memory controller too. There
>> mem_cgroup_from_cgroup falls back to the root memcg (on a different
>> hierarchy).

My goal was to properly account kernfs and simple_xattr entries only,
however I missed that it does not work in cgroup1 case.

>> If the purpose is to prevent unlimited creation of cgroup objects, the
>> root memcg is by principle unlimited, so it's just for accounting.

No, the goal is not to prevent unlimited creation of cgroup objects.
As Michal Hocko pointed it can be done via cgroup.max.descendants limits.

>> But I understand the purpose is to have everything under one roof,
>> unless the object lifetime is not bound to that owning memcg. Should
>> memory-less hierarchies be treated specially?
>
> At least from my POV, as long as cgroup1 is not being regressed, we want to
> make decisions which make the best long term sense. We surely can
> accommodate cgroup1 as long as the added complexity is minimal but the bar
> is pretty high there. cgroup1 has been in maintenance mode for years now and
> even the basic delegation model isn't well established in cgroup1, so if we
> end up accounting everything in the root cgroup for most of cgroup1
> hierarchies, that sounds fine to me.

I would like to properly handle cgroup1 case too.
To do it we can enable accounting for new 'struct cgroup' objects,
and bind them to memcg of creator task.

Thank you,
Vasily Averin

2022-08-23 15:32:29

by Vasily Averin

[permalink] [raw]
Subject: Re: [RFC PATCH] simple_xattr: switch from list to rb_tree

On 8/18/22 16:19, Christian Brauner wrote:
> On Thu, Aug 18, 2022 at 12:12:30PM +0300, Vasily Averin wrote:
>> The patch was announced here:
>> https://lore.kernel.org/all/[email protected]/
>> "5) simple_xattrs: replace list to rb-tree
>> This significantly reduces the search time for existing entries."
>>
>> It was compiled but was not tested yet.
>> ---
>> Currently simple_xattr uses a list to store existing entries.
>> If the list grows, the presence check may be slow and potentially
>> lead to problems. Red-black tree should work more efficiently
>> in this situation.
>>
>> This patch replaces list to rb_tree and switches simple_xattr_* calls
>> to its using.
>>
>> Signed-off-by: Vasily Averin <[email protected]>
>> ---
>
> I think the background for the performance issues in the commit message
> would be helpful and I have a few comments. Also, trying to test whether the
> lockups are gone due to the rbtree switch would be +1.
>
> This will likely conflict with some acl/xattr changes I have lined up so
> if we decide to proceed I wouldn't mind dealing with this series if
> there are no objections.

I would be very grateful if you pick up this issue.
Unfortunately I do not have enough time to process it properly.

I'm agree with all your remarks, however I would like to comment following one.

> I think keeping this rather close to the original code might be nicer.
> I find the code more difficult to follow afterwards. So how about
> (COMPLETELY UNTESTED) sm like:

I had this idea too, however it have one disadvantage in rb-tree scenario:
in the most typical case, when adding a new entry, we run through the tree twice:
first in simple_xattr_rb_search() and then in simple_xattr_rb_insert().
In my patch version we run through the rb-tree once only.

However now I think we can save closest neighbour on "search" stage,
and use it on "insert" stage. This should be safe because both functions
are called under the same spinlock.

> @@ -1077,30 +1139,40 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> }
>
> spin_lock(&xattrs->lock);
> - list_for_each_entry(xattr, &xattrs->head, list) {
> - if (!strcmp(name, xattr->name)) {
> - if (flags & XATTR_CREATE) {
> - xattr = new_xattr;
> - err = -EEXIST;
> - } else if (new_xattr) {
> - list_replace(&xattr->list, &new_xattr->list);
> - if (removed_size)
> - *removed_size = xattr->size;
> - } else {
> - list_del(&xattr->list);
> - if (removed_size)
> - *removed_size = xattr->size;
> - }
> - goto out;
> + /* Find any matching xattr by name. */
> + xattr = simple_xattr_rb_search(&xattrs->rb_root, name);
> + if (xattr) {
> + if (flags & XATTR_CREATE) {
> + /* Creating request but the xattr already existed. */
> + xattr = new_xattr;
> + err = -EEXIST;
> + } else if (new_xattr) {
> + /* Replace the existing xattr. */
> + rb_replace_node(&xattr->node, &new_xattr->node,
> + &xattrs->rb_root);
> + if (removed_size)
> + *removed_size = xattr->size;
> + } else {
> + /* No new xattr specified so wipe the existing xattr. */
> + rb_erase(&xattr->node, &xattrs->rb_root);
> + if (removed_size)
> + *removed_size = xattr->size;
> }
> + goto out;
> }
> +
> if (flags & XATTR_REPLACE) {
> + /* There's no matching xattr so fail on replace. */
> xattr = new_xattr;
> err = -ENODATA;
> } else {
> - list_add(&new_xattr->list, &xattrs->head);
> - xattr = NULL;
> + /*
> + * We're holding the lock and verified that there's no
> + * pre-existing xattr so this should always succeed.
> + */
> + WARN_ON(!simple_xattr_rb_insert(&xattrs->rb_root, new_xattr))
> }
> +
> out:
> spin_unlock(&xattrs->lock);
> if (xattr) {
>
>
>> - xattr = new_xattr;
>> - err = -ENODATA;
>> - } else {
>> - list_add(&new_xattr->list, &xattrs->head);
>> - xattr = NULL;
>> - }
>> -out:
>> spin_unlock(&xattrs->lock);
>> if (xattr) {
>> kfree(xattr->name);

2022-08-23 19:28:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH] memcg: adjust memcg for new cgroup allocations

Hello,

On Tue, Aug 23, 2022 at 03:04:31PM +0300, Vasily Averin wrote:
> I would like to properly handle cgroup1 case too.
> To do it we can enable accounting for new 'struct cgroup' objects,
> and bind them to memcg of creator task.

I'm not sure it'd be a good idea to introduce two different behaviors for
handling the same thing. I'd just leave cgroup1 as-is.

Thanks.

--
tejun