2018-02-21 03:03:45

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v2 0/3] Directed kmem charging

This patchset introduces memcg variant memory allocation functions. The
caller can explicitly pass the memcg to charge for kmem allocations.
Currently the kernel, for __GFP_ACCOUNT memory allocation requests,
extract the memcg of the current task to charge for the kmem allocation.
This patch series introduces kmem allocation functions where the caller
can pass the pointer to the remote memcg. The remote memcg will be
charged for the allocation instead of the memcg of the caller. However
the caller must have a reference to the remote memcg.

Fixed the build for SLOB in v2.

Shakeel Butt (3):
mm: memcg: plumbing memcg for kmem cache allocations
mm: memcg: plumbing memcg for kmalloc allocations
fs: fsnotify: account fsnotify metadata to kmemcg

fs/notify/dnotify/dnotify.c | 5 +-
fs/notify/fanotify/fanotify.c | 12 ++-
fs/notify/fanotify/fanotify.h | 3 +-
fs/notify/fanotify/fanotify_user.c | 7 +-
fs/notify/group.c | 4 +
fs/notify/inotify/inotify_fsnotify.c | 2 +-
fs/notify/inotify/inotify_user.c | 5 +-
fs/notify/mark.c | 6 +-
include/linux/fsnotify_backend.h | 12 ++-
include/linux/memcontrol.h | 13 ++-
include/linux/slab.h | 86 +++++++++++++++-
mm/memcontrol.c | 29 ++++--
mm/page_alloc.c | 2 +-
mm/slab.c | 107 ++++++++++++++++----
mm/slab.h | 6 +-
mm/slab_common.c | 41 +++++++-
mm/slob.c | 13 +++
mm/slub.c | 140 ++++++++++++++++++++++-----
18 files changed, 415 insertions(+), 78 deletions(-)

--
2.16.1.291.g4437f3f132-goog



2018-02-21 03:03:14

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v2 3/3] fs: fsnotify: account fsnotify metadata to kmemcg

A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.

There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.

The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.

The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.

This patch has also moved the members of fsnotify_group to keep the
size same, at least for 64 bit build, even with additional member by
filling the holes.

Signed-off-by: Shakeel Butt <[email protected]>
---
Changelog since v1:
- no more charging fsnotify_mark_connector objects
- Fixed the build for SLOB

fs/notify/dnotify/dnotify.c | 5 +++--
fs/notify/fanotify/fanotify.c | 12 +++++++-----
fs/notify/fanotify/fanotify.h | 3 ++-
fs/notify/fanotify/fanotify_user.c | 7 +++++--
fs/notify/group.c | 4 ++++
fs/notify/inotify/inotify_fsnotify.c | 2 +-
fs/notify/inotify/inotify_user.c | 5 ++++-
fs/notify/mark.c | 6 ++++--
include/linux/fsnotify_backend.h | 12 ++++++++----
include/linux/memcontrol.h | 7 +++++++
mm/memcontrol.c | 2 +-
11 files changed, 46 insertions(+), 19 deletions(-)

diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index 63a1ca4b9dee..eb5c41284649 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -384,8 +384,9 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)

static int __init dnotify_init(void)
{
- dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
- dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
+ dnotify_struct_cache = KMEM_CACHE(dnotify_struct,
+ SLAB_PANIC|SLAB_ACCOUNT);
+ dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);

dnotify_group = fsnotify_alloc_group(&dnotify_fsnotify_ops);
if (IS_ERR(dnotify_group))
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 6702a6a0bbb5..0d9493ebc7cd 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -140,22 +140,24 @@ static bool fanotify_should_send_event(struct fsnotify_mark *inode_mark,
}

struct fanotify_event_info *fanotify_alloc_event(struct inode *inode, u32 mask,
- const struct path *path)
+ const struct path *path,
+ struct mem_cgroup *memcg)
{
struct fanotify_event_info *event;

if (fanotify_is_perm_event(mask)) {
struct fanotify_perm_event_info *pevent;

- pevent = kmem_cache_alloc(fanotify_perm_event_cachep,
- GFP_KERNEL);
+ pevent = kmem_cache_alloc_memcg(fanotify_perm_event_cachep,
+ GFP_KERNEL, memcg);
if (!pevent)
return NULL;
event = &pevent->fae;
pevent->response = 0;
goto init;
}
- event = kmem_cache_alloc(fanotify_event_cachep, GFP_KERNEL);
+ event = kmem_cache_alloc_memcg(fanotify_event_cachep, GFP_KERNEL,
+ memcg);
if (!event)
return NULL;
init: __maybe_unused
@@ -210,7 +212,7 @@ static int fanotify_handle_event(struct fsnotify_group *group,
return 0;
}

- event = fanotify_alloc_event(inode, mask, data);
+ event = fanotify_alloc_event(inode, mask, data, group->memcg);
ret = -ENOMEM;
if (unlikely(!event))
goto finish;
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 256d9d1ddea9..51b797896c87 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -53,4 +53,5 @@ static inline struct fanotify_event_info *FANOTIFY_E(struct fsnotify_event *fse)
}

struct fanotify_event_info *fanotify_alloc_event(struct inode *inode, u32 mask,
- const struct path *path);
+ const struct path *path,
+ struct mem_cgroup *memcg);
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index ef08d64c84b8..29c9b3e57a29 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -16,6 +16,7 @@
#include <linux/uaccess.h>
#include <linux/compat.h>
#include <linux/sched/signal.h>
+#include <linux/memcontrol.h>

#include <asm/ioctls.h>

@@ -756,8 +757,9 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)

group->fanotify_data.user = user;
atomic_inc(&user->fanotify_listeners);
+ group->memcg = get_mem_cgroup_from_mm(current->mm);

- oevent = fanotify_alloc_event(NULL, FS_Q_OVERFLOW, NULL);
+ oevent = fanotify_alloc_event(NULL, FS_Q_OVERFLOW, NULL, group->memcg);
if (unlikely(!oevent)) {
fd = -ENOMEM;
goto out_destroy_group;
@@ -951,7 +953,8 @@ COMPAT_SYSCALL_DEFINE6(fanotify_mark,
*/
static int __init fanotify_user_setup(void)
{
- fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
+ fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
+ SLAB_PANIC|SLAB_ACCOUNT);
fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
if (IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS)) {
fanotify_perm_event_cachep =
diff --git a/fs/notify/group.c b/fs/notify/group.c
index b7a4b6a69efa..3e56459f4773 100644
--- a/fs/notify/group.c
+++ b/fs/notify/group.c
@@ -22,6 +22,7 @@
#include <linux/srcu.h>
#include <linux/rculist.h>
#include <linux/wait.h>
+#include <linux/memcontrol.h>

#include <linux/fsnotify_backend.h>
#include "fsnotify.h"
@@ -36,6 +37,9 @@ static void fsnotify_final_destroy_group(struct fsnotify_group *group)
if (group->ops->free_group_priv)
group->ops->free_group_priv(group);

+ if (group->memcg)
+ mem_cgroup_put(group->memcg);
+
kfree(group);
}

diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
index 8b73332735ba..ed8e7b5f3981 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -98,7 +98,7 @@ int inotify_handle_event(struct fsnotify_group *group,
i_mark = container_of(inode_mark, struct inotify_inode_mark,
fsn_mark);

- event = kmalloc(alloc_len, GFP_KERNEL);
+ event = kmalloc_memcg(alloc_len, GFP_KERNEL, group->memcg);
if (unlikely(!event))
return -ENOMEM;

diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 5c29bf16814f..e80f4656799f 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -38,6 +38,7 @@
#include <linux/uaccess.h>
#include <linux/poll.h>
#include <linux/wait.h>
+#include <linux/memcontrol.h>

#include "inotify.h"
#include "../fdinfo.h"
@@ -618,6 +619,7 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)
oevent->name_len = 0;

group->max_events = max_events;
+ group->memcg = get_mem_cgroup_from_mm(current->mm);

spin_lock_init(&group->inotify_data.idr_lock);
idr_init(&group->inotify_data.idr);
@@ -785,7 +787,8 @@ static int __init inotify_user_setup(void)

BUG_ON(hweight32(ALL_INOTIFY_BITS) != 21);

- inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);
+ inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark,
+ SLAB_PANIC|SLAB_ACCOUNT);

inotify_max_queued_events = 16384;
init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index e9191b416434..c0014d0c3783 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -432,7 +432,8 @@ int fsnotify_compare_groups(struct fsnotify_group *a, struct fsnotify_group *b)
static int fsnotify_attach_connector_to_object(
struct fsnotify_mark_connector __rcu **connp,
struct inode *inode,
- struct vfsmount *mnt)
+ struct vfsmount *mnt,
+ struct fsnotify_group *group)
{
struct fsnotify_mark_connector *conn;

@@ -517,7 +518,8 @@ static int fsnotify_add_mark_list(struct fsnotify_mark *mark,
conn = fsnotify_grab_connector(connp);
if (!conn) {
spin_unlock(&mark->lock);
- err = fsnotify_attach_connector_to_object(connp, inode, mnt);
+ err = fsnotify_attach_connector_to_object(connp, inode, mnt,
+ mark->group);
if (err)
return err;
goto restart;
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 067d52e95f02..e4428e383215 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -84,6 +84,8 @@ struct fsnotify_event_private_data;
struct fsnotify_fname;
struct fsnotify_iter_info;

+struct mem_cgroup;
+
/*
* Each group much define these ops. The fsnotify infrastructure will call
* these operations for each relevant group.
@@ -129,6 +131,8 @@ struct fsnotify_event {
* everything will be cleaned up.
*/
struct fsnotify_group {
+ const struct fsnotify_ops *ops; /* how this group handles things */
+
/*
* How the refcnt is used is up to each group. When the refcnt hits 0
* fsnotify will clean up all of the resources associated with this group.
@@ -139,8 +143,6 @@ struct fsnotify_group {
*/
refcount_t refcnt; /* things with interest in this group */

- const struct fsnotify_ops *ops; /* how this group handles things */
-
/* needed to send notification to userspace */
spinlock_t notification_lock; /* protect the notification_list */
struct list_head notification_list; /* list of event_holder this group needs to send to userspace */
@@ -162,6 +164,8 @@ struct fsnotify_group {
atomic_t num_marks; /* 1 for each mark and 1 for not being
* past the point of no return when freeing
* a group */
+ atomic_t user_waits; /* Number of tasks waiting for user
+ * response */
struct list_head marks_list; /* all inode marks for this group */

struct fasync_struct *fsn_fa; /* async notification */
@@ -169,8 +173,8 @@ struct fsnotify_group {
struct fsnotify_event *overflow_event; /* Event we queue when the
* notification list is too
* full */
- atomic_t user_waits; /* Number of tasks waiting for user
- * response */
+
+ struct mem_cgroup *memcg; /* memcg to charge allocations */

/* groups can define private fields here or use the void *private */
union {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9dec8a5c0ca2..ee4b6b9d6813 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -352,6 +352,8 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
}

+struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
+
static inline void mem_cgroup_put(struct mem_cgroup *memcg)
{
css_put(&memcg->css);
@@ -809,6 +811,11 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
return true;
}

+static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
+{
+ return NULL;
+}
+
static inline void mem_cgroup_put(struct mem_cgroup *memcg)
{
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0dcd6ab6cc94..3a72394510a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -678,7 +678,7 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
}
EXPORT_SYMBOL(mem_cgroup_from_task);

-static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
+struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
{
struct mem_cgroup *memcg = NULL;

--
2.16.1.291.g4437f3f132-goog


2018-02-21 03:03:31

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v2 1/3] mm: memcg: plumbing memcg for kmem cache allocations

Introducing the memcg variant for kmem cache allocation functions.
Currently the kernel switches the root kmem cache with the memcg
specific kmem cache for __GFP_ACCOUNT allocations to charge those
allocations to the memcg. However, the memcg to charge is extracted from
the current task_struct. This patch introduces the variant of kmem cache
allocation functions where the memcg can be provided explicitly by the
caller instead of deducing the memcg from the current task.

These functions are useful for use-cases where the allocations should be
charged to the memcg different from the memcg of the caller. One such
concrete use-case is the allocations for fsnotify event objects where
the objects should be charged to the listener instead of the producer.

One requirement to call these functions is that the caller must have the
reference to the memcg.

Signed-off-by: Shakeel Butt <[email protected]>
---
Changelog since v1:
- Fixed build for SLOB

include/linux/memcontrol.h | 3 +-
include/linux/slab.h | 41 ++++++++++++++++++++
mm/memcontrol.c | 18 +++++++--
mm/slab.c | 78 +++++++++++++++++++++++++++++++++-----
mm/slab.h | 6 +--
mm/slob.c | 7 ++++
mm/slub.c | 77 ++++++++++++++++++++++++++++++-------
7 files changed, 199 insertions(+), 31 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c79cdf9f8138..48eaf19859e9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1174,7 +1174,8 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
}
#endif

-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
+ struct mem_cgroup *memcg);
void memcg_kmem_put_cache(struct kmem_cache *cachep);
int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
struct mem_cgroup *memcg);
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 231abc8976c5..24355bc9e655 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -353,6 +353,8 @@ static __always_inline int kmalloc_index(size_t size)

void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __malloc;
void *kmem_cache_alloc(struct kmem_cache *, gfp_t flags) __assume_slab_alignment __malloc;
+void *kmem_cache_alloc_memcg(struct kmem_cache *, gfp_t flags,
+ struct mem_cgroup *memcg) __assume_slab_alignment __malloc;
void kmem_cache_free(struct kmem_cache *, void *);

/*
@@ -377,6 +379,8 @@ static __always_inline void kfree_bulk(size_t size, void **p)
#ifdef CONFIG_NUMA
void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment __malloc;
void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node) __assume_slab_alignment __malloc;
+void *kmem_cache_alloc_node_memcg(struct kmem_cache *, gfp_t flags, int node,
+ struct mem_cgroup *memcg) __assume_slab_alignment __malloc;
#else
static __always_inline void *__kmalloc_node(size_t size, gfp_t flags, int node)
{
@@ -387,15 +391,26 @@ static __always_inline void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t f
{
return kmem_cache_alloc(s, flags);
}
+
+static __always_inline void *kmem_cache_alloc_node_memcg(struct kmem_cache *s,
+ gfp_t flags, int node, struct mem_cgroup *memcg)
+{
+ return kmem_cache_alloc_memcg(s, flags, memcg);
+}
#endif

#ifdef CONFIG_TRACING
extern void *kmem_cache_alloc_trace(struct kmem_cache *, gfp_t, size_t) __assume_slab_alignment __malloc;
+extern void *kmem_cache_alloc_memcg_trace(struct kmem_cache *, gfp_t, size_t,
+ struct mem_cgroup *memcg) __assume_slab_alignment __malloc;

#ifdef CONFIG_NUMA
extern void *kmem_cache_alloc_node_trace(struct kmem_cache *s,
gfp_t gfpflags,
int node, size_t size) __assume_slab_alignment __malloc;
+extern void *kmem_cache_alloc_node_memcg_trace(struct kmem_cache *s,
+ gfp_t gfpflags, int node, size_t size,
+ struct mem_cgroup *memcg) __assume_slab_alignment __malloc;
#else
static __always_inline void *
kmem_cache_alloc_node_trace(struct kmem_cache *s,
@@ -404,6 +419,13 @@ kmem_cache_alloc_node_trace(struct kmem_cache *s,
{
return kmem_cache_alloc_trace(s, gfpflags, size);
}
+
+static __always_inline void *
+kmem_cache_alloc_node_memcg_trace(struct kmem_cache *s, gfp_t gfpflags,
+ int node, size_t size, struct mem_cgroup *memcg)
+{
+ return kmem_cache_alloc_memcg_trace(s, gfpflags, size, memcg);
+}
#endif /* CONFIG_NUMA */

#else /* CONFIG_TRACING */
@@ -416,6 +438,15 @@ static __always_inline void *kmem_cache_alloc_trace(struct kmem_cache *s,
return ret;
}

+static __always_inline void *kmem_cache_alloc_memcg_trace(struct kmem_cache *s,
+ gfp_t flags, size_t size, struct mem_cgroup *memcg)
+{
+ void *ret = kmem_cache_alloc_memcg(s, flags, memcg);
+
+ kasan_kmalloc(s, ret, size, flags);
+ return ret;
+}
+
static __always_inline void *
kmem_cache_alloc_node_trace(struct kmem_cache *s,
gfp_t gfpflags,
@@ -426,6 +457,16 @@ kmem_cache_alloc_node_trace(struct kmem_cache *s,
kasan_kmalloc(s, ret, size, gfpflags);
return ret;
}
+
+static __always_inline void *
+kmem_cache_alloc_node_memcg_trace(struct kmem_cache *s, gfp_t gfpflags,
+ int node, size_t size, struct mem_cgroup *memcg)
+{
+ void *ret = kmem_cache_alloc_node_memcg(s, gfpflags, node, memcg);
+
+ kasan_kmalloc(s, ret, size, gfpflags);
+ return ret;
+}
#endif /* CONFIG_TRACING */

extern void *kmalloc_order(size_t size, gfp_t flags, unsigned int order) __assume_page_alignment __malloc;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fffe502a2c7f..bd37e855e277 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -701,6 +701,15 @@ static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
return memcg;
}

+static struct mem_cgroup *get_mem_cgroup(struct mem_cgroup *memcg)
+{
+ rcu_read_lock();
+ if (!css_tryget_online(&memcg->css))
+ memcg = NULL;
+ rcu_read_unlock();
+ return memcg;
+}
+
/**
* mem_cgroup_iter - iterate over memory cgroup hierarchy
* @root: hierarchy root
@@ -2246,9 +2255,9 @@ static inline bool memcg_kmem_bypass(void)
* done with it, memcg_kmem_put_cache() must be called to release the
* reference.
*/
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
+ struct mem_cgroup *memcg)
{
- struct mem_cgroup *memcg;
struct kmem_cache *memcg_cachep;
int kmemcg_id;

@@ -2260,7 +2269,10 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
if (current->memcg_kmem_skip_account)
return cachep;

- memcg = get_mem_cgroup_from_mm(current->mm);
+ if (memcg)
+ memcg = get_mem_cgroup(memcg);
+ if (!memcg)
+ memcg = get_mem_cgroup_from_mm(current->mm);
kmemcg_id = READ_ONCE(memcg->kmemcg_id);
if (kmemcg_id < 0)
goto out;
diff --git a/mm/slab.c b/mm/slab.c
index 324446621b3e..3daeda62bd0c 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3276,14 +3276,14 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,

static __always_inline void *
slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
- unsigned long caller)
+ struct mem_cgroup *memcg, unsigned long caller)
{
unsigned long save_flags;
void *ptr;
int slab_node = numa_mem_id();

flags &= gfp_allowed_mask;
- cachep = slab_pre_alloc_hook(cachep, flags);
+ cachep = slab_pre_alloc_hook(cachep, flags, memcg);
if (unlikely(!cachep))
return NULL;

@@ -3356,13 +3356,14 @@ __do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
#endif /* CONFIG_NUMA */

static __always_inline void *
-slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
+slab_alloc(struct kmem_cache *cachep, gfp_t flags, struct mem_cgroup *memcg,
+ unsigned long caller)
{
unsigned long save_flags;
void *objp;

flags &= gfp_allowed_mask;
- cachep = slab_pre_alloc_hook(cachep, flags);
+ cachep = slab_pre_alloc_hook(cachep, flags, memcg);
if (unlikely(!cachep))
return NULL;

@@ -3536,7 +3537,7 @@ void ___cache_free(struct kmem_cache *cachep, void *objp,
*/
void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
- void *ret = slab_alloc(cachep, flags, _RET_IP_);
+ void *ret = slab_alloc(cachep, flags, NULL, _RET_IP_);

kasan_slab_alloc(cachep, ret, flags);
trace_kmem_cache_alloc(_RET_IP_, ret,
@@ -3546,6 +3547,19 @@ void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
}
EXPORT_SYMBOL(kmem_cache_alloc);

+void *kmem_cache_alloc_memcg(struct kmem_cache *cachep, gfp_t flags,
+ struct mem_cgroup *memcg)
+{
+ void *ret = slab_alloc(cachep, flags, memcg, _RET_IP_);
+
+ kasan_slab_alloc(cachep, ret, flags);
+ trace_kmem_cache_alloc(_RET_IP_, ret,
+ cachep->object_size, cachep->size, flags);
+
+ return ret;
+}
+EXPORT_SYMBOL(kmem_cache_alloc_memcg);
+
static __always_inline void
cache_alloc_debugcheck_after_bulk(struct kmem_cache *s, gfp_t flags,
size_t size, void **p, unsigned long caller)
@@ -3561,7 +3575,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
{
size_t i;

- s = slab_pre_alloc_hook(s, flags);
+ s = slab_pre_alloc_hook(s, flags, NULL);
if (!s)
return 0;

@@ -3602,7 +3616,7 @@ kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size)
{
void *ret;

- ret = slab_alloc(cachep, flags, _RET_IP_);
+ ret = slab_alloc(cachep, flags, NULL, _RET_IP_);

kasan_kmalloc(cachep, ret, size, flags);
trace_kmalloc(_RET_IP_, ret,
@@ -3610,6 +3624,21 @@ kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size)
return ret;
}
EXPORT_SYMBOL(kmem_cache_alloc_trace);
+
+void *
+kmem_cache_alloc_memcg_trace(struct kmem_cache *cachep, gfp_t flags,
+ size_t size, struct mem_cgroup *memcg)
+{
+ void *ret;
+
+ ret = slab_alloc(cachep, flags, memcg, _RET_IP_);
+
+ kasan_kmalloc(cachep, ret, size, flags);
+ trace_kmalloc(_RET_IP_, ret,
+ size, cachep->size, flags);
+ return ret;
+}
+EXPORT_SYMBOL(kmem_cache_alloc_memcg_trace);
#endif

#ifdef CONFIG_NUMA
@@ -3626,7 +3655,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_trace);
*/
void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
{
- void *ret = slab_alloc_node(cachep, flags, nodeid, _RET_IP_);
+ void *ret = slab_alloc_node(cachep, flags, nodeid, NULL, _RET_IP_);

kasan_slab_alloc(cachep, ret, flags);
trace_kmem_cache_alloc_node(_RET_IP_, ret,
@@ -3637,6 +3666,20 @@ void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
}
EXPORT_SYMBOL(kmem_cache_alloc_node);

+void *kmem_cache_alloc_node_memcg(struct kmem_cache *cachep, gfp_t flags,
+ int nodeid, struct mem_cgroup *memcg)
+{
+ void *ret = slab_alloc_node(cachep, flags, nodeid, memcg, _RET_IP_);
+
+ kasan_slab_alloc(cachep, ret, flags);
+ trace_kmem_cache_alloc_node(_RET_IP_, ret,
+ cachep->object_size, cachep->size,
+ flags, nodeid);
+
+ return ret;
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node_memcg);
+
#ifdef CONFIG_TRACING
void *kmem_cache_alloc_node_trace(struct kmem_cache *cachep,
gfp_t flags,
@@ -3645,7 +3688,7 @@ void *kmem_cache_alloc_node_trace(struct kmem_cache *cachep,
{
void *ret;

- ret = slab_alloc_node(cachep, flags, nodeid, _RET_IP_);
+ ret = slab_alloc_node(cachep, flags, nodeid, NULL, _RET_IP_);

kasan_kmalloc(cachep, ret, size, flags);
trace_kmalloc_node(_RET_IP_, ret,
@@ -3654,6 +3697,21 @@ void *kmem_cache_alloc_node_trace(struct kmem_cache *cachep,
return ret;
}
EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
+
+void *kmem_cache_alloc_node_memcg_trace(struct kmem_cache *cachep, gfp_t flags,
+ int nodeid, size_t size, struct mem_cgroup *memcg)
+{
+ void *ret;
+
+ ret = slab_alloc_node(cachep, flags, nodeid, memcg, _RET_IP_);
+
+ kasan_kmalloc(cachep, ret, size, flags);
+ trace_kmalloc_node(_RET_IP_, ret,
+ size, cachep->size,
+ flags, nodeid);
+ return ret;
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node_memcg_trace);
#endif

static __always_inline void *
@@ -3700,7 +3758,7 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
cachep = kmalloc_slab(size, flags);
if (unlikely(ZERO_OR_NULL_PTR(cachep)))
return cachep;
- ret = slab_alloc(cachep, flags, caller);
+ ret = slab_alloc(cachep, flags, NULL, caller);

kasan_kmalloc(cachep, ret, size, flags);
trace_kmalloc(caller, ret,
diff --git a/mm/slab.h b/mm/slab.h
index 51813236e773..77b02583ee2c 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -410,7 +410,7 @@ static inline size_t slab_ksize(const struct kmem_cache *s)
}

static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
- gfp_t flags)
+ gfp_t flags, struct mem_cgroup *memcg)
{
flags &= gfp_allowed_mask;

@@ -423,8 +423,8 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
return NULL;

if (memcg_kmem_enabled() &&
- ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
- return memcg_kmem_get_cache(s);
+ ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT) || memcg))
+ return memcg_kmem_get_cache(s, memcg);

return s;
}
diff --git a/mm/slob.c b/mm/slob.c
index 623e8a5c46ce..49cdd24424b0 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -568,6 +568,13 @@ void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
}
EXPORT_SYMBOL(kmem_cache_alloc);

+void *kmem_cache_alloc_memcg(struct kmem_cache *cachep, gfp_t flags,
+ struct mem_cgroup *memcg)
+{
+ return kmem_cache_alloc(cachep, flags);
+}
+EXPORT_SYMBOL(kmem_cache_alloc_memcg);
+
#ifdef CONFIG_NUMA
void *__kmalloc_node(size_t size, gfp_t gfp, int node)
{
diff --git a/mm/slub.c b/mm/slub.c
index e381728a3751..061cfbc7c3d7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2641,14 +2641,15 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
* Otherwise we can simply pick the next object from the lockless free list.
*/
static __always_inline void *slab_alloc_node(struct kmem_cache *s,
- gfp_t gfpflags, int node, unsigned long addr)
+ gfp_t gfpflags, int node, struct mem_cgroup *memcg,
+ unsigned long addr)
{
void *object;
struct kmem_cache_cpu *c;
struct page *page;
unsigned long tid;

- s = slab_pre_alloc_hook(s, gfpflags);
+ s = slab_pre_alloc_hook(s, gfpflags, memcg);
if (!s)
return NULL;
redo:
@@ -2727,15 +2728,15 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
return object;
}

-static __always_inline void *slab_alloc(struct kmem_cache *s,
- gfp_t gfpflags, unsigned long addr)
+static __always_inline void *slab_alloc(struct kmem_cache *s, gfp_t gfpflags,
+ struct mem_cgroup *memcg, unsigned long addr)
{
- return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr);
+ return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, memcg, addr);
}

void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
{
- void *ret = slab_alloc(s, gfpflags, _RET_IP_);
+ void *ret = slab_alloc(s, gfpflags, NULL, _RET_IP_);

trace_kmem_cache_alloc(_RET_IP_, ret, s->object_size,
s->size, gfpflags);
@@ -2744,21 +2745,44 @@ void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
}
EXPORT_SYMBOL(kmem_cache_alloc);

+void *kmem_cache_alloc_memcg(struct kmem_cache *s, gfp_t gfpflags,
+ struct mem_cgroup *memcg)
+{
+ void *ret = slab_alloc(s, gfpflags, memcg, _RET_IP_);
+
+ trace_kmem_cache_alloc(_RET_IP_, ret, s->object_size,
+ s->size, gfpflags);
+
+ return ret;
+}
+EXPORT_SYMBOL(kmem_cache_alloc_memcg);
+
#ifdef CONFIG_TRACING
void *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
{
- void *ret = slab_alloc(s, gfpflags, _RET_IP_);
+ void *ret = slab_alloc(s, gfpflags, NULL, _RET_IP_);
trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags);
kasan_kmalloc(s, ret, size, gfpflags);
return ret;
}
EXPORT_SYMBOL(kmem_cache_alloc_trace);
+
+void *kmem_cache_alloc_memcg_trace(struct kmem_cache *s, gfp_t gfpflags,
+ size_t size, struct mem_cgroup *memcg)
+{
+ void *ret = slab_alloc(s, gfpflags, memcg, _RET_IP_);
+
+ trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags);
+ kasan_kmalloc(s, ret, size, gfpflags);
+ return ret;
+}
+EXPORT_SYMBOL(kmem_cache_alloc_memcg_trace);
#endif

#ifdef CONFIG_NUMA
void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
{
- void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_);
+ void *ret = slab_alloc_node(s, gfpflags, node, NULL, _RET_IP_);

trace_kmem_cache_alloc_node(_RET_IP_, ret,
s->object_size, s->size, gfpflags, node);
@@ -2767,12 +2791,24 @@ void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
}
EXPORT_SYMBOL(kmem_cache_alloc_node);

+void *kmem_cache_alloc_node_memcg(struct kmem_cache *s, gfp_t gfpflags,
+ int node, struct mem_cgroup *memcg)
+{
+ void *ret = slab_alloc_node(s, gfpflags, node, memcg, _RET_IP_);
+
+ trace_kmem_cache_alloc_node(_RET_IP_, ret,
+ s->object_size, s->size, gfpflags, node);
+
+ return ret;
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node_memcg);
+
#ifdef CONFIG_TRACING
void *kmem_cache_alloc_node_trace(struct kmem_cache *s,
gfp_t gfpflags,
int node, size_t size)
{
- void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_);
+ void *ret = slab_alloc_node(s, gfpflags, node, NULL, _RET_IP_);

trace_kmalloc_node(_RET_IP_, ret,
size, s->size, gfpflags, node);
@@ -2781,6 +2817,19 @@ void *kmem_cache_alloc_node_trace(struct kmem_cache *s,
return ret;
}
EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
+
+void *kmem_cache_alloc_node_memcg_trace(struct kmem_cache *s, gfp_t gfpflags,
+ int node, size_t size, struct mem_cgroup *memcg)
+{
+ void *ret = slab_alloc_node(s, gfpflags, node, memcg, _RET_IP_);
+
+ trace_kmalloc_node(_RET_IP_, ret,
+ size, s->size, gfpflags, node);
+
+ kasan_kmalloc(s, ret, size, gfpflags);
+ return ret;
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node_memcg_trace);
#endif
#endif

@@ -3109,7 +3158,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
int i;

/* memcg and kmem_cache debug support */
- s = slab_pre_alloc_hook(s, flags);
+ s = slab_pre_alloc_hook(s, flags, NULL);
if (unlikely(!s))
return false;
/*
@@ -3755,7 +3804,7 @@ void *__kmalloc(size_t size, gfp_t flags)
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc(s, flags, _RET_IP_);
+ ret = slab_alloc(s, flags, NULL, _RET_IP_);

trace_kmalloc(_RET_IP_, ret, size, s->size, flags);

@@ -3800,7 +3849,7 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node)
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc_node(s, flags, node, _RET_IP_);
+ ret = slab_alloc_node(s, flags, node, NULL, _RET_IP_);

trace_kmalloc_node(_RET_IP_, ret, size, s->size, flags, node);

@@ -4305,7 +4354,7 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc(s, gfpflags, caller);
+ ret = slab_alloc(s, gfpflags, NULL, caller);

/* Honor the call site pointer we received. */
trace_kmalloc(caller, ret, size, s->size, gfpflags);
@@ -4335,7 +4384,7 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc_node(s, gfpflags, node, caller);
+ ret = slab_alloc_node(s, gfpflags, node, NULL, caller);

/* Honor the call site pointer we received. */
trace_kmalloc_node(caller, ret, size, s->size, gfpflags, node);
--
2.16.1.291.g4437f3f132-goog


2018-02-21 03:04:09

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v2 2/3] mm: memcg: plumbing memcg for kmalloc allocations

Introducing the memcg variant for kmalloc allocation functions.
The kmalloc allocations are underlying served using the kmem caches
unless the size of the allocation request is larger than
KMALLOC_MAX_CACHE_SIZE, in which case, the kmem caches are bypassed and
the request is routed directly to page allocator. So, for __GFP_ACCOUNT
kmalloc allocations, the memcg of current task is charged. This patch
introduces memcg variant of kmalloc functions to allow callers to
provide memcg for charging.

Signed-off-by: Shakeel Butt <[email protected]>
---
Changelog since v1:
- Fixed build for SLOB

include/linux/memcontrol.h | 3 +-
include/linux/slab.h | 45 +++++++++++++++++++++++---
mm/memcontrol.c | 9 ++++--
mm/page_alloc.c | 2 +-
mm/slab.c | 31 +++++++++++++-----
mm/slab_common.c | 41 +++++++++++++++++++++++-
mm/slob.c | 6 ++++
mm/slub.c | 65 +++++++++++++++++++++++++++++++-------
8 files changed, 172 insertions(+), 30 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 48eaf19859e9..9dec8a5c0ca2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1179,7 +1179,8 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
void memcg_kmem_put_cache(struct kmem_cache *cachep);
int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
struct mem_cgroup *memcg);
-int memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
+int memcg_kmem_charge(struct page *page, gfp_t gfp, int order,
+ struct mem_cgroup *memcg);
void memcg_kmem_uncharge(struct page *page, int order);

#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 24355bc9e655..9df5d6279b38 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -352,6 +352,8 @@ static __always_inline int kmalloc_index(size_t size)
#endif /* !CONFIG_SLOB */

void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __malloc;
+void *__kmalloc_memcg(size_t size, gfp_t flags,
+ struct mem_cgroup *memcg) __assume_kmalloc_alignment __malloc;
void *kmem_cache_alloc(struct kmem_cache *, gfp_t flags) __assume_slab_alignment __malloc;
void *kmem_cache_alloc_memcg(struct kmem_cache *, gfp_t flags,
struct mem_cgroup *memcg) __assume_slab_alignment __malloc;
@@ -378,6 +380,8 @@ static __always_inline void kfree_bulk(size_t size, void **p)

#ifdef CONFIG_NUMA
void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment __malloc;
+void *__kmalloc_node_memcg(size_t size, gfp_t flags, int node,
+ struct mem_cgroup *memcg) __assume_kmalloc_alignment __malloc;
void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node) __assume_slab_alignment __malloc;
void *kmem_cache_alloc_node_memcg(struct kmem_cache *, gfp_t flags, int node,
struct mem_cgroup *memcg) __assume_slab_alignment __malloc;
@@ -387,6 +391,12 @@ static __always_inline void *__kmalloc_node(size_t size, gfp_t flags, int node)
return __kmalloc(size, flags);
}

+static __always_inline void *__kmalloc_node_memcg(size_t size, gfp_t flags,
+ struct mem_cgroup *memcg, int node)
+{
+ return __kmalloc_memcg(size, flags, memcg);
+}
+
static __always_inline void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node)
{
return kmem_cache_alloc(s, flags);
@@ -470,15 +480,26 @@ kmem_cache_alloc_node_memcg_trace(struct kmem_cache *s, gfp_t gfpflags,
#endif /* CONFIG_TRACING */

extern void *kmalloc_order(size_t size, gfp_t flags, unsigned int order) __assume_page_alignment __malloc;
+extern void *kmalloc_order_memcg(size_t size, gfp_t flags, unsigned int order,
+ struct mem_cgroup *memcg) __assume_page_alignment __malloc;

#ifdef CONFIG_TRACING
extern void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order) __assume_page_alignment __malloc;
+extern void *kmalloc_order_memcg_trace(size_t size, gfp_t flags,
+ unsigned int order,
+ struct mem_cgroup *memcg) __assume_page_alignment __malloc;
#else
static __always_inline void *
kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
{
return kmalloc_order(size, flags, order);
}
+static __always_inline void *
+kmalloc_order_memcg_trace(size_t size, gfp_t flags, unsigned int order,
+ struct mem_cgroup *memcg)
+{
+ return kmalloc_order_memcg(size, flags, order, memcg);
+}
#endif

static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
@@ -487,6 +508,14 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
return kmalloc_order_trace(size, flags, order);
}

+static __always_inline void *kmalloc_large_memcg(size_t size, gfp_t flags,
+ struct mem_cgroup *memcg)
+{
+ unsigned int order = get_order(size);
+
+ return kmalloc_order_memcg_trace(size, flags, order, memcg);
+}
+
/**
* kmalloc - allocate memory
* @size: how many bytes of memory are required.
@@ -538,11 +567,12 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
* for general use, and so are not documented here. For a full list of
* potential flags, always refer to linux/gfp.h.
*/
-static __always_inline void *kmalloc(size_t size, gfp_t flags)
+static __always_inline void *
+kmalloc_memcg(size_t size, gfp_t flags, struct mem_cgroup *memcg)
{
if (__builtin_constant_p(size)) {
if (size > KMALLOC_MAX_CACHE_SIZE)
- return kmalloc_large(size, flags);
+ return kmalloc_large_memcg(size, flags, memcg);
#ifndef CONFIG_SLOB
if (!(flags & GFP_DMA)) {
int index = kmalloc_index(size);
@@ -550,12 +580,17 @@ static __always_inline void *kmalloc(size_t size, gfp_t flags)
if (!index)
return ZERO_SIZE_PTR;

- return kmem_cache_alloc_trace(kmalloc_caches[index],
- flags, size);
+ return kmem_cache_alloc_memcg_trace(
+ kmalloc_caches[index], flags, size, memcg);
}
#endif
}
- return __kmalloc(size, flags);
+ return __kmalloc_memcg(size, flags, memcg);
+}
+
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
+{
+ return kmalloc_memcg(size, flags, NULL);
}

/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bd37e855e277..0dcd6ab6cc94 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2348,15 +2348,18 @@ int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
*
* Returns 0 on success, an error code on failure.
*/
-int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
+int memcg_kmem_charge(struct page *page, gfp_t gfp, int order,
+ struct mem_cgroup *memcg)
{
- struct mem_cgroup *memcg;
int ret = 0;

if (memcg_kmem_bypass())
return 0;

- memcg = get_mem_cgroup_from_mm(current->mm);
+ if (memcg)
+ memcg = get_mem_cgroup(memcg);
+ if (!memcg)
+ memcg = get_mem_cgroup_from_mm(current->mm);
if (!mem_cgroup_is_root(memcg)) {
ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
if (!ret)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2b42f603b1a..d65d58045893 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4261,7 +4261,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,

out:
if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
- unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
+ unlikely(memcg_kmem_charge(page, gfp_mask, order, NULL) != 0)) {
__free_pages(page, order);
page = NULL;
}
diff --git a/mm/slab.c b/mm/slab.c
index 3daeda62bd0c..4282f5a84dcd 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3715,7 +3715,8 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_memcg_trace);
#endif

static __always_inline void *
-__do_kmalloc_node(size_t size, gfp_t flags, int node, unsigned long caller)
+__do_kmalloc_node(size_t size, gfp_t flags, int node, struct mem_cgroup *memcg,
+ unsigned long caller)
{
struct kmem_cache *cachep;
void *ret;
@@ -3723,7 +3724,8 @@ __do_kmalloc_node(size_t size, gfp_t flags, int node, unsigned long caller)
cachep = kmalloc_slab(size, flags);
if (unlikely(ZERO_OR_NULL_PTR(cachep)))
return cachep;
- ret = kmem_cache_alloc_node_trace(cachep, flags, node, size);
+ ret = kmem_cache_alloc_node_memcg_trace(cachep, flags, node, size,
+ memcg);
kasan_kmalloc(cachep, ret, size, flags);

return ret;
@@ -3731,14 +3733,21 @@ __do_kmalloc_node(size_t size, gfp_t flags, int node, unsigned long caller)

void *__kmalloc_node(size_t size, gfp_t flags, int node)
{
- return __do_kmalloc_node(size, flags, node, _RET_IP_);
+ return __do_kmalloc_node(size, flags, node, NULL, _RET_IP_);
}
EXPORT_SYMBOL(__kmalloc_node);

+void *__kmalloc_node_memcg(size_t size, gfp_t flags, int node,
+ struct mem_cgroup *memcg)
+{
+ return __do_kmalloc_node(size, flags, node, memcg, _RET_IP_);
+}
+EXPORT_SYMBOL(__kmalloc_node_memcg);
+
void *__kmalloc_node_track_caller(size_t size, gfp_t flags,
int node, unsigned long caller)
{
- return __do_kmalloc_node(size, flags, node, caller);
+ return __do_kmalloc_node(size, flags, node, NULL, caller);
}
EXPORT_SYMBOL(__kmalloc_node_track_caller);
#endif /* CONFIG_NUMA */
@@ -3750,7 +3759,7 @@ EXPORT_SYMBOL(__kmalloc_node_track_caller);
* @caller: function caller for debug tracking of the caller
*/
static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
- unsigned long caller)
+ struct mem_cgroup *memcg, unsigned long caller)
{
struct kmem_cache *cachep;
void *ret;
@@ -3758,7 +3767,7 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
cachep = kmalloc_slab(size, flags);
if (unlikely(ZERO_OR_NULL_PTR(cachep)))
return cachep;
- ret = slab_alloc(cachep, flags, NULL, caller);
+ ret = slab_alloc(cachep, flags, memcg, caller);

kasan_kmalloc(cachep, ret, size, flags);
trace_kmalloc(caller, ret,
@@ -3769,13 +3778,19 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,

void *__kmalloc(size_t size, gfp_t flags)
{
- return __do_kmalloc(size, flags, _RET_IP_);
+ return __do_kmalloc(size, flags, NULL, _RET_IP_);
}
EXPORT_SYMBOL(__kmalloc);

+void *__kmalloc_memcg(size_t size, gfp_t flags, struct mem_cgroup *memcg)
+{
+ return __do_kmalloc(size, flags, memcg, _RET_IP_);
+}
+EXPORT_SYMBOL(__kmalloc_memcg);
+
void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
{
- return __do_kmalloc(size, flags, caller);
+ return __do_kmalloc(size, flags, NULL, caller);
}
EXPORT_SYMBOL(__kmalloc_track_caller);

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 10f127b2de7c..49aea3b0725d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1155,20 +1155,49 @@ void __init create_kmalloc_caches(slab_flags_t flags)
* directly to the page allocator. We use __GFP_COMP, because we will need to
* know the allocation order to free the pages properly in kfree.
*/
-void *kmalloc_order(size_t size, gfp_t flags, unsigned int order)
+static __always_inline void *__kmalloc_order_memcg(size_t size, gfp_t flags,
+ unsigned int order,
+ struct mem_cgroup *memcg)
{
void *ret;
struct page *page;

flags |= __GFP_COMP;
+
+ /*
+ * Do explicit targeted memcg charging instead of
+ * __alloc_pages_nodemask charging current memcg.
+ */
+ if (memcg && (flags & __GFP_ACCOUNT))
+ flags &= ~__GFP_ACCOUNT;
+
page = alloc_pages(flags, order);
+
+ if (memcg && page && memcg_kmem_enabled() &&
+ memcg_kmem_charge(page, flags, order, memcg)) {
+ __free_pages(page, order);
+ page = NULL;
+ }
+
ret = page ? page_address(page) : NULL;
kmemleak_alloc(ret, size, 1, flags);
kasan_kmalloc_large(ret, size, flags);
return ret;
}
+
+void *kmalloc_order(size_t size, gfp_t flags, unsigned int order)
+{
+ return __kmalloc_order_memcg(size, flags, order, NULL);
+}
EXPORT_SYMBOL(kmalloc_order);

+void *kmalloc_order_memcg(size_t size, gfp_t flags, unsigned int order,
+ struct mem_cgroup *memcg)
+{
+ return __kmalloc_order_memcg(size, flags, order, memcg);
+}
+EXPORT_SYMBOL(kmalloc_order_memcg);
+
#ifdef CONFIG_TRACING
void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
{
@@ -1177,6 +1206,16 @@ void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
return ret;
}
EXPORT_SYMBOL(kmalloc_order_trace);
+
+void *kmalloc_order_memcg_trace(size_t size, gfp_t flags, unsigned int order,
+ struct mem_cgroup *memcg)
+{
+ void *ret = kmalloc_order_memcg(size, flags, order, memcg);
+
+ trace_kmalloc(_RET_IP_, ret, size, PAGE_SIZE << order, flags);
+ return ret;
+}
+EXPORT_SYMBOL(kmalloc_order_memcg_trace);
#endif

#ifdef CONFIG_SLAB_FREELIST_RANDOM
diff --git a/mm/slob.c b/mm/slob.c
index 49cdd24424b0..696baf517bda 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -470,6 +470,12 @@ void *__kmalloc(size_t size, gfp_t gfp)
}
EXPORT_SYMBOL(__kmalloc);

+void *__kmalloc_memcg(size_t size, gfp_t gfp, struct mem_cgroup *memcg)
+{
+ return __do_kmalloc_node(size, gfp, NUMA_NO_NODE, _RET_IP_);
+}
+EXPORT_SYMBOL(__kmalloc_memcg);
+
void *__kmalloc_track_caller(size_t size, gfp_t gfp, unsigned long caller)
{
return __do_kmalloc_node(size, gfp, NUMA_NO_NODE, caller);
diff --git a/mm/slub.c b/mm/slub.c
index 061cfbc7c3d7..5b119f4fb6bc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3791,13 +3791,14 @@ static int __init setup_slub_min_objects(char *str)

__setup("slub_min_objects=", setup_slub_min_objects);

-void *__kmalloc(size_t size, gfp_t flags)
+static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
+ struct mem_cgroup *memcg, unsigned long caller)
{
struct kmem_cache *s;
void *ret;

if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
- return kmalloc_large(size, flags);
+ return kmalloc_large_memcg(size, flags, memcg);

s = kmalloc_slab(size, flags);

@@ -3806,22 +3807,50 @@ void *__kmalloc(size_t size, gfp_t flags)

ret = slab_alloc(s, flags, NULL, _RET_IP_);

- trace_kmalloc(_RET_IP_, ret, size, s->size, flags);
+ trace_kmalloc(caller, ret, size, s->size, flags);

kasan_kmalloc(s, ret, size, flags);

return ret;
}
+
+void *__kmalloc(size_t size, gfp_t flags)
+{
+ return __do_kmalloc(size, flags, NULL, _RET_IP_);
+}
EXPORT_SYMBOL(__kmalloc);

+void *__kmalloc_memcg(size_t size, gfp_t flags, struct mem_cgroup *memcg)
+{
+ return __do_kmalloc(size, flags, memcg, _RET_IP_);
+}
+EXPORT_SYMBOL(__kmalloc_memcg);
+
#ifdef CONFIG_NUMA
-static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
+static void *kmalloc_large_node(size_t size, gfp_t flags, int node,
+ struct mem_cgroup *memcg)
{
struct page *page;
void *ptr = NULL;
+ unsigned int order = get_order(size);

flags |= __GFP_COMP;
- page = alloc_pages_node(node, flags, get_order(size));
+
+ /*
+ * Do explicit targeted memcg charging instead of
+ * __alloc_pages_nodemask charging current memcg.
+ */
+ if (memcg && (flags & __GFP_ACCOUNT))
+ flags &= ~__GFP_ACCOUNT;
+
+ page = alloc_pages_node(node, flags, order);
+
+ if (memcg && page && memcg_kmem_enabled() &&
+ memcg_kmem_charge(page, flags, order, memcg)) {
+ __free_pages(page, order);
+ page = NULL;
+ }
+
if (page)
ptr = page_address(page);

@@ -3829,15 +3858,17 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
return ptr;
}

-void *__kmalloc_node(size_t size, gfp_t flags, int node)
+static __always_inline void *
+__do_kmalloc_node_memcg(size_t size, gfp_t flags, int node,
+ struct mem_cgroup *memcg, unsigned long caller)
{
struct kmem_cache *s;
void *ret;

if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) {
- ret = kmalloc_large_node(size, flags, node);
+ ret = kmalloc_large_node(size, flags, node, memcg);

- trace_kmalloc_node(_RET_IP_, ret,
+ trace_kmalloc_node(caller, ret,
size, PAGE_SIZE << get_order(size),
flags, node);

@@ -3849,15 +3880,27 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node)
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc_node(s, flags, node, NULL, _RET_IP_);
+ ret = slab_alloc_node(s, flags, node, memcg, caller);

- trace_kmalloc_node(_RET_IP_, ret, size, s->size, flags, node);
+ trace_kmalloc_node(caller, ret, size, s->size, flags, node);

kasan_kmalloc(s, ret, size, flags);

return ret;
}
+
+void *__kmalloc_node(size_t size, gfp_t flags, int node)
+{
+ return __do_kmalloc_node_memcg(size, flags, node, NULL, _RET_IP_);
+}
EXPORT_SYMBOL(__kmalloc_node);
+
+void *__kmalloc_node_memcg(size_t size, gfp_t flags, int node,
+ struct mem_cgroup *memcg)
+{
+ return __do_kmalloc_node_memcg(size, flags, node, memcg, _RET_IP_);
+}
+EXPORT_SYMBOL(__kmalloc_node_memcg);
#endif

#ifdef CONFIG_HARDENED_USERCOPY
@@ -4370,7 +4413,7 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
void *ret;

if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) {
- ret = kmalloc_large_node(size, gfpflags, node);
+ ret = kmalloc_large_node(size, gfpflags, node, NULL);

trace_kmalloc_node(caller, ret,
size, PAGE_SIZE << get_order(size),
--
2.16.1.291.g4437f3f132-goog


Subject: Re: [PATCH v2 0/3] Directed kmem charging

Another way to solve this is to switch the user context right?

Isnt it possible to avoid these patches if do the allocation in another
task context instead?

Are there really any other use cases beyond fsnotify?


The charging of the memory works on a per page level but the allocation
occur from the same page for multiple tasks that may be running on a
system. So how relevant is this for other small objects?

Seems that if you do a large amount of allocations for the same purpose
your chance of accounting it to the right memcg increases. But this is a
game of chance.




Subject: Re: [PATCH v2 3/3] fs: fsnotify: account fsnotify metadata to kmemcg

On Tue, 20 Feb 2018, Shakeel Butt wrote:

> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 6702a6a0bbb5..0d9493ebc7cd 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> if (fanotify_is_perm_event(mask)) {
> struct fanotify_perm_event_info *pevent;
>
> - pevent = kmem_cache_alloc(fanotify_perm_event_cachep,
> - GFP_KERNEL);
> + pevent = kmem_cache_alloc_memcg(fanotify_perm_event_cachep,
> + GFP_KERNEL, memcg);
> if (!pevent)

#1

> index 8b73332735ba..ed8e7b5f3981 100644
> --- a/fs/notify/inotify/inotify_fsnotify.c
> +++ b/fs/notify/inotify/inotify_fsnotify.c
> @@ -98,7 +98,7 @@ int inotify_handle_event(struct fsnotify_group *group,
> i_mark = container_of(inode_mark, struct inotify_inode_mark,
> fsn_mark);
>
> - event = kmalloc(alloc_len, GFP_KERNEL);
> + event = kmalloc_memcg(alloc_len, GFP_KERNEL, group->memcg);
> if (unlikely(!event))
> return -ENOMEM;

#2


So we have all this churn for those two allocations which are basically
the same code for two different notification schemes.

Could you store the task that is requesting the fsnotify action instead of
the memcg? Then do the allocation in the context of that task. That
reduces the modifications to fsnotify.


2018-02-21 19:09:17

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Directed kmem charging

On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter <[email protected]> wrote:
> Another way to solve this is to switch the user context right?
>
> Isnt it possible to avoid these patches if do the allocation in another
> task context instead?
>

Sorry, can you please explain what you mean by 'switch the user
context'. Is there any example in kernel which does something similar?

Another way is by adding a field 'remote_memcg_to_charge' in
task_struct and set it before the allocation and in memcontrol.c,
first check if current->remote_memcg_to_charge is set otherwise use
the memcg of current. Also if we provide a wrapper to do that for the
user, there will be a lot less plumbing.

Please let me know if you prefer this approach.


> Are there really any other use cases beyond fsnotify?
>

Another use case I have in mind and plan to upstream is to bind a
filesystem mount with a memcg. So, all the file pages (or anon pages
for shmem) and kmem (like inodes and dentry) will be charged to that
memcg.

>
> The charging of the memory works on a per page level but the allocation
> occur from the same page for multiple tasks that may be running on a
> system. So how relevant is this for other small objects?
>
> Seems that if you do a large amount of allocations for the same purpose
> your chance of accounting it to the right memcg increases. But this is a
> game of chance.
>
>
>

Subject: Re: [PATCH v2 0/3] Directed kmem charging

On Wed, 21 Feb 2018, Shakeel Butt wrote:

> On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter <[email protected]> wrote:
> > Another way to solve this is to switch the user context right?
> >
> > Isnt it possible to avoid these patches if do the allocation in another
> > task context instead?
> >
>
> Sorry, can you please explain what you mean by 'switch the user
> context'. Is there any example in kernel which does something similar?

See include/linux/task_work.h. One use case is in mntput_no_expire() in
linux/fs/namespace.c

> > Are there really any other use cases beyond fsnotify?
> >
>
> Another use case I have in mind and plan to upstream is to bind a
> filesystem mount with a memcg. So, all the file pages (or anon pages
> for shmem) and kmem (like inodes and dentry) will be charged to that
> memcg.

The mount logic already uses task_work.h. That may be the approach to
expand there.

2018-02-21 20:07:45

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Directed kmem charging

On Wed, Feb 21, 2018 at 9:57 AM, Christopher Lameter <[email protected]> wrote:
> On Wed, 21 Feb 2018, Shakeel Butt wrote:
>
>> On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter <[email protected]> wrote:
>> > Another way to solve this is to switch the user context right?
>> >
>> > Isnt it possible to avoid these patches if do the allocation in another
>> > task context instead?
>> >
>>
>> Sorry, can you please explain what you mean by 'switch the user
>> context'. Is there any example in kernel which does something similar?
>
> See include/linux/task_work.h. One use case is in mntput_no_expire() in
> linux/fs/namespace.c
>

From what I understand, using task_work will require fanotify/inotify
event handler to allocate memory asynchronously. IMHO the code will be
much more complex if we go through that route.

> Another way is by adding a field 'remote_memcg_to_charge' in
> task_struct and set it before the allocation and in memcontrol.c,
> first check if current->remote_memcg_to_charge is set otherwise use
> the memcg of current. Also if we provide a wrapper to do that for the
> user, there will be a lot less plumbing.
>
> Please let me know if you prefer this approach.
>

What do you think of the above approach. I think the amount and
complexity of code will be much less.

>> > Are there really any other use cases beyond fsnotify?
>> >
>>
>> Another use case I have in mind and plan to upstream is to bind a
>> filesystem mount with a memcg. So, all the file pages (or anon pages
>> for shmem) and kmem (like inodes and dentry) will be charged to that
>> memcg.
>
> The mount logic already uses task_work.h. That may be the approach to
> expand there.

The task_work approach will require that the job is already running at
the time of mount operation. Usually the mount operations are done by
either admin or the control task starting the job and is a part of
setting up the environment. So, there might not be any process running
at the time of mount operation.

2018-02-21 20:57:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Directed kmem charging

On Wed, 21 Feb 2018 09:18:35 -0800 Shakeel Butt <[email protected]> wrote:

> On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter <[email protected]> wrote:
> > Another way to solve this is to switch the user context right?
> >
> > Isnt it possible to avoid these patches if do the allocation in another
> > task context instead?
> >
>
> Sorry, can you please explain what you mean by 'switch the user
> context'. Is there any example in kernel which does something similar?
>
> Another way is by adding a field 'remote_memcg_to_charge' in
> task_struct and set it before the allocation and in memcontrol.c,
> first check if current->remote_memcg_to_charge is set otherwise use
> the memcg of current. Also if we provide a wrapper to do that for the
> user, there will be a lot less plumbing.
>
> Please let me know if you prefer this approach.

That would be a lot simpler. Passing function arguments via
task_struct is a bit dirty but is sometimes sooo effective. You
should've seen how much mess task_struct.journal_info avoided! And
reclaim_state.

And one always wonders whether we should do a local save/restore before
modifying the task_struct field, so it nests.

What do others think?


Maybe we can rename task_struct.reclaim_state to `struct task_mm_state
*task_mm_state", add remote_memcg_to_charge to struct task_mm_state and
avoid bloating the task_struct?


2018-02-22 13:47:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Directed kmem charging

On Wed 21-02-18 12:54:26, Andrew Morton wrote:
> On Wed, 21 Feb 2018 09:18:35 -0800 Shakeel Butt <[email protected]> wrote:
>
> > On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter <[email protected]> wrote:
> > > Another way to solve this is to switch the user context right?
> > >
> > > Isnt it possible to avoid these patches if do the allocation in another
> > > task context instead?
> > >
> >
> > Sorry, can you please explain what you mean by 'switch the user
> > context'. Is there any example in kernel which does something similar?
> >
> > Another way is by adding a field 'remote_memcg_to_charge' in
> > task_struct and set it before the allocation and in memcontrol.c,
> > first check if current->remote_memcg_to_charge is set otherwise use
> > the memcg of current. Also if we provide a wrapper to do that for the
> > user, there will be a lot less plumbing.
> >
> > Please let me know if you prefer this approach.
>
> That would be a lot simpler. Passing function arguments via
> task_struct is a bit dirty but is sometimes sooo effective. You
> should've seen how much mess task_struct.journal_info avoided! And
> reclaim_state.

Agreed, although from time to time people try to be too creative e.g. with
journal_info and surprising bugs come out of that :).

> And one always wonders whether we should do a local save/restore before
> modifying the task_struct field, so it nests.
>
> What do others think?

Sounds nice to me.

> Maybe we can rename task_struct.reclaim_state to `struct task_mm_state
> *task_mm_state", add remote_memcg_to_charge to struct task_mm_state and
> avoid bloating the task_struct?

Yeah, even better, but then we really need to make sure these things stack
properly.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-02-22 13:50:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] fs: fsnotify: account fsnotify metadata to kmemcg

On Tue 20-02-18 19:01:01, Shakeel Butt wrote:
> A lot of memory can be consumed by the events generated for the huge or
> unlimited queues if there is either no or slow listener. This can cause
> system level memory pressure or OOMs. So, it's better to account the
> fsnotify kmem caches to the memcg of the listener.

How much memory are we talking about here?

> There are seven fsnotify kmem caches and among them allocations from
> dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
> inotify_inode_mark_cachep happens in the context of syscall from the
> listener. So, SLAB_ACCOUNT is enough for these caches.
>
> The objects from fsnotify_mark_connector_cachep are not accounted as
> they are small compared to the notification mark or events and it is
> unclear whom to account connector to since it is shared by all events
> attached to the inode.
>
> The allocations from the event caches happen in the context of the event
> producer. For such caches we will need to remote charge the allocations
> to the listener's memcg. Thus we save the memcg reference in the
> fsnotify_group structure of the listener.

Is it typical that the listener lives in a different memcg and if yes
then cannot this cause one memcg to OOM/DoS the one with the listener?

> This patch has also moved the members of fsnotify_group to keep the
> size same, at least for 64 bit build, even with additional member by
> filling the holes.
>
> Signed-off-by: Shakeel Butt <[email protected]>
[...]
--
Michal Hocko
SUSE Labs

2018-02-22 13:54:46

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Directed kmem charging

On Wed 21-02-18 11:57:47, Christopher Lameter wrote:
> On Wed, 21 Feb 2018, Shakeel Butt wrote:
>
> > On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter <[email protected]> wrote:
> > > Another way to solve this is to switch the user context right?
> > >
> > > Isnt it possible to avoid these patches if do the allocation in another
> > > task context instead?
> > >
> >
> > Sorry, can you please explain what you mean by 'switch the user
> > context'. Is there any example in kernel which does something similar?
>
> See include/linux/task_work.h. One use case is in mntput_no_expire() in
> linux/fs/namespace.c
>
> > > Are there really any other use cases beyond fsnotify?
> > >
> >
> > Another use case I have in mind and plan to upstream is to bind a
> > filesystem mount with a memcg. So, all the file pages (or anon pages
> > for shmem) and kmem (like inodes and dentry) will be charged to that
> > memcg.
>
> The mount logic already uses task_work.h. That may be the approach to
> expand there.

I don't see how task work can be used here. Firstly I don't know of a case
where task work would be used for something else than the current task -
and that is substantial because otherwise you have to deal with lots of
problems like races with task exit, when work gets executed (normally it
gets executed once task exits to userspace) etc. Or do you mean that you'd
queue task work for current task and then somehow magically switch memcg
there? In that case this magic switching isn't clear to me...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-02-22 14:50:19

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] fs: fsnotify: account fsnotify metadata to kmemcg

On Thu 22-02-18 14:49:44, Michal Hocko wrote:
> On Tue 20-02-18 19:01:01, Shakeel Butt wrote:
> > A lot of memory can be consumed by the events generated for the huge or
> > unlimited queues if there is either no or slow listener. This can cause
> > system level memory pressure or OOMs. So, it's better to account the
> > fsnotify kmem caches to the memcg of the listener.
>
> How much memory are we talking about here?

32 bytes per event (on 64-bit) which is small but the number of events is
not limited in any way (if the creator uses a special flag and has
CAP_SYS_ADMIN). In the thread [1] a guy from Alibaba wanted this feature so
among cloud people there is apparently some demand to have a way to limit
memory usage of such application...

> > There are seven fsnotify kmem caches and among them allocations from
> > dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
> > inotify_inode_mark_cachep happens in the context of syscall from the
> > listener. So, SLAB_ACCOUNT is enough for these caches.
> >
> > The objects from fsnotify_mark_connector_cachep are not accounted as
> > they are small compared to the notification mark or events and it is
> > unclear whom to account connector to since it is shared by all events
> > attached to the inode.
> >
> > The allocations from the event caches happen in the context of the event
> > producer. For such caches we will need to remote charge the allocations
> > to the listener's memcg. Thus we save the memcg reference in the
> > fsnotify_group structure of the listener.
>
> Is it typical that the listener lives in a different memcg and if yes
> then cannot this cause one memcg to OOM/DoS the one with the listener?

We have been through these discussions already in [1] back in November :).
I can understand the wish to limit memory usage of an application using
unlimited fanotify queues. And yes, it may mean that it will be easier for
an attacker to get it oom-killed (currently the malicious app would drive
the whole system oom which will presumably take a bit more effort as there
is more memory to consume). But then I expect this is what admin prefers
when he limits memory usage of fanotify listener.

I cannot tell how common it is for producer and listener to be in different
memcgs. From Alibaba request it seems it happens...

Honza

[1] https://lkml.org/lkml/2017/10/27/523
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-02-22 16:46:00

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] fs: fsnotify: account fsnotify metadata to kmemcg

On Thu, Feb 22, 2018 at 6:48 AM, Jan Kara <[email protected]> wrote:
> On Thu 22-02-18 14:49:44, Michal Hocko wrote:
>> On Tue 20-02-18 19:01:01, Shakeel Butt wrote:
>> > A lot of memory can be consumed by the events generated for the huge or
>> > unlimited queues if there is either no or slow listener. This can cause
>> > system level memory pressure or OOMs. So, it's better to account the
>> > fsnotify kmem caches to the memcg of the listener.
>>
>> How much memory are we talking about here?
>
> 32 bytes per event (on 64-bit) which is small but the number of events is
> not limited in any way (if the creator uses a special flag and has
> CAP_SYS_ADMIN). In the thread [1] a guy from Alibaba wanted this feature so
> among cloud people there is apparently some demand to have a way to limit
> memory usage of such application...

Yes, I'm the guy from Alibaba :-) We did run into such issue
occasionally, then I proposed the patch to account fsnotify kmem in
memcg although we fixed the bug in user space applications later.
However, such accounting still sounds useful to me.

>
>> > There are seven fsnotify kmem caches and among them allocations from
>> > dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
>> > inotify_inode_mark_cachep happens in the context of syscall from the
>> > listener. So, SLAB_ACCOUNT is enough for these caches.
>> >
>> > The objects from fsnotify_mark_connector_cachep are not accounted as
>> > they are small compared to the notification mark or events and it is
>> > unclear whom to account connector to since it is shared by all events
>> > attached to the inode.
>> >
>> > The allocations from the event caches happen in the context of the event
>> > producer. For such caches we will need to remote charge the allocations
>> > to the listener's memcg. Thus we save the memcg reference in the
>> > fsnotify_group structure of the listener.
>>
>> Is it typical that the listener lives in a different memcg and if yes
>> then cannot this cause one memcg to OOM/DoS the one with the listener?
>
> We have been through these discussions already in [1] back in November :).
> I can understand the wish to limit memory usage of an application using
> unlimited fanotify queues. And yes, it may mean that it will be easier for
> an attacker to get it oom-killed (currently the malicious app would drive
> the whole system oom which will presumably take a bit more effort as there
> is more memory to consume). But then I expect this is what admin prefers
> when he limits memory usage of fanotify listener.
>
> I cannot tell how common it is for producer and listener to be in different
> memcgs. From Alibaba request it seems it happens...

For our usecase, we didn't have producers and listeners in the
different memcgs (Please see the original discussion here
https://lkml.org/lkml/2017/10/20/819). The different memcg accounting
problem is raised by Amir since the accounting might be unfair if the
listeners don't consume events and heuristic if producer and listener
are in the different memcgs.

However, we don't have strong demand on this from our perspective for
the time being. So, I didn't continue to move forward on this
approach.

Regards,
Yang


>
> Honza
>
> [1] https://lkml.org/lkml/2017/10/27/523
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2018-02-22 19:01:40

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] fs: fsnotify: account fsnotify metadata to kmemcg

On Thu, Feb 22, 2018 at 6:48 AM, Jan Kara <[email protected]> wrote:
> On Thu 22-02-18 14:49:44, Michal Hocko wrote:
>> On Tue 20-02-18 19:01:01, Shakeel Butt wrote:
>> > A lot of memory can be consumed by the events generated for the huge or
>> > unlimited queues if there is either no or slow listener. This can cause
>> > system level memory pressure or OOMs. So, it's better to account the
>> > fsnotify kmem caches to the memcg of the listener.
>>
>> How much memory are we talking about here?
>
> 32 bytes per event (on 64-bit) which is small but the number of events is
> not limited in any way (if the creator uses a special flag and has
> CAP_SYS_ADMIN). In the thread [1] a guy from Alibaba wanted this feature so
> among cloud people there is apparently some demand to have a way to limit
> memory usage of such application...
>
>> > There are seven fsnotify kmem caches and among them allocations from
>> > dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
>> > inotify_inode_mark_cachep happens in the context of syscall from the
>> > listener. So, SLAB_ACCOUNT is enough for these caches.
>> >
>> > The objects from fsnotify_mark_connector_cachep are not accounted as
>> > they are small compared to the notification mark or events and it is
>> > unclear whom to account connector to since it is shared by all events
>> > attached to the inode.
>> >
>> > The allocations from the event caches happen in the context of the event
>> > producer. For such caches we will need to remote charge the allocations
>> > to the listener's memcg. Thus we save the memcg reference in the
>> > fsnotify_group structure of the listener.
>>
>> Is it typical that the listener lives in a different memcg and if yes
>> then cannot this cause one memcg to OOM/DoS the one with the listener?
>
> We have been through these discussions already in [1] back in November :).
> I can understand the wish to limit memory usage of an application using
> unlimited fanotify queues. And yes, it may mean that it will be easier for
> an attacker to get it oom-killed (currently the malicious app would drive
> the whole system oom which will presumably take a bit more effort as there
> is more memory to consume). But then I expect this is what admin prefers
> when he limits memory usage of fanotify listener.
>

Just one clarification, currently the kernel does not trigger
oom-killer for allocations hitting memcg limit in the context of
syscalls but rather return an ENOMEM (after trying memcg reclaim). Jan
has already posted a patch to handle those ENOMEMs.

Subject: Re: [PATCH v2 0/3] Directed kmem charging

On Thu, 22 Feb 2018, Jan Kara wrote:

> I don't see how task work can be used here. Firstly I don't know of a case
> where task work would be used for something else than the current task -
> and that is substantial because otherwise you have to deal with lots of
> problems like races with task exit, when work gets executed (normally it
> gets executed once task exits to userspace) etc. Or do you mean that you'd
> queue task work for current task and then somehow magically switch memcg
> there? In that case this magic switching isn't clear to me...

Thats surprising since one can specify the task. If its only for current
then why do you need a parameter?

I think a capability of executing a function in the context of another
running task could simplify a lot. In particular if something triggers
behavior that is related to another task from the kernel like whats
happening here.

It should not be that difficult to do proper synchronization on the list
of work and then set some flag (maybe SIGPENDING) to make the task execute
whats on the tasklist. Signal delivery is after all similar.



Subject: Re: [PATCH v2 0/3] Directed kmem charging

On Wed, 21 Feb 2018, Andrew Morton wrote:

> What do others think?

I think the changes to the hotpaths of the slab allocators increasing
register pressure in some of the hotttest paths of the kernel are
problematic.

Its better to do the allocation properly in the task context to which it
is finally charged. There may be other restrictions that emerge from other
fields in the task_struct that also may influence allocation and reclaim
behavior. It is cleanest to do this in the proper task context instead of
taking a piece (like the cgroup) out of context.