Currently bpf is using the memlock rlimit for the memory accounting.
This approach has its downsides and over time has created a significant
amount of problems:
1) The limit is per-user, but because most bpf operations are performed
as root, the limit has a little value.
2) It's hard to come up with a specific maximum value. Especially because
the counter is shared with non-bpf users (e.g. memlock() users).
Any specific value is either too low and creates false failures
or too high and useless.
3) Charging is not connected to the actual memory allocation. Bpf code
should manually calculate the estimated cost and precharge the counter,
and then take care of uncharging, including all fail paths.
It adds to the code complexity and makes it easy to leak a charge.
4) There is no simple way of getting the current value of the counter.
We've used drgn for it, but it's far from being convenient.
5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
a function to "explain" this case for users.
In order to overcome these problems let's switch to the memcg-based
memory accounting of bpf objects. With the recent addition of the percpu
memory accounting, now it's possible to provide a comprehensive accounting
of the memory used by bpf programs and maps.
This approach has the following advantages:
1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
a better control over memory usage by different workloads. Of course, it
requires enabled cgroups and kernel memory accounting and properly configured
cgroup tree, but it's a default configuration for a modern Linux system.
2) The actual memory consumption is taken into account. It happens automatically
on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
performed automatically on releasing the memory. So the code on the bpf side
becomes simpler and safer.
3) There is a simple way to get the current value and statistics.
In general, if a process performs a bpf operation (e.g. creates or updates
a map), it's memory cgroup is charged. However map updates performed from
an interrupt context are charged to the memory cgroup which contained
the process, which created the map.
Providing a 1:1 replacement for the rlimit-based memory accounting is
a non-goal of this patchset. Users and memory cgroups are completely
orthogonal, so it's not possible even in theory.
Memcg-based memory accounting requires a properly configured cgroup tree
to be actually useful. However, it's the way how the memory is managed
on a modern Linux system.
The patchset consists of the following parts:
1) 4 mm patches, which are already in the mm tree, but are required
to avoid a regression (otherwise vmallocs cannot be mapped to userspace).
2) memcg-based accounting for various bpf objects: progs and maps
3) removal of the rlimit-based accounting
4) removal of rlimit adjustments in userspace samples
First 4 patches are not supposed to be merged via the bpf tree. I'm including
them to make sure bpf tests will pass.
v6:
- rebased to the latest version of the remote charging API
- fixed signatures, added acks
v5:
- rebased to the latest version of the remote charging API
- implemented kmem accounting from an interrupt context, by Shakeel
- rebased to latest changes in mm allowed to map vmallocs to userspace
- fixed a build issue in kselftests, by Alexei
- fixed a use-after-free bug in bpf_map_free_deferred()
- added bpf line info coverage, by Shakeel
- split bpf map charging preparations into a separate patch
v4:
- covered allocations made from an interrupt context, by Daniel
- added some clarifications to the cover letter
v3:
- droped the userspace part for further discussions/refinements,
by Andrii and Song
v2:
- fixed build issue, caused by the remaining rlimit-based accounting
for sockhash maps
Roman Gushchin (34):
mm: memcontrol: use helpers to read page's memcg data
mm: memcontrol/slab: use helpers to access slab page's memcg_data
mm: introduce page memcg flags
mm: convert page kmemcg type to a page memcg flag
bpf: memcg-based memory accounting for bpf progs
bpf: prepare for memcg-based memory accounting for bpf maps
bpf: memcg-based memory accounting for bpf maps
bpf: refine memcg-based memory accounting for arraymap maps
bpf: refine memcg-based memory accounting for cpumap maps
bpf: memcg-based memory accounting for cgroup storage maps
bpf: refine memcg-based memory accounting for devmap maps
bpf: refine memcg-based memory accounting for hashtab maps
bpf: memcg-based memory accounting for lpm_trie maps
bpf: memcg-based memory accounting for bpf ringbuffer
bpf: memcg-based memory accounting for bpf local storage maps
bpf: refine memcg-based memory accounting for sockmap and sockhash
maps
bpf: refine memcg-based memory accounting for xskmap maps
bpf: eliminate rlimit-based memory accounting for arraymap maps
bpf: eliminate rlimit-based memory accounting for bpf_struct_ops maps
bpf: eliminate rlimit-based memory accounting for cpumap maps
bpf: eliminate rlimit-based memory accounting for cgroup storage maps
bpf: eliminate rlimit-based memory accounting for devmap maps
bpf: eliminate rlimit-based memory accounting for hashtab maps
bpf: eliminate rlimit-based memory accounting for lpm_trie maps
bpf: eliminate rlimit-based memory accounting for queue_stack_maps
maps
bpf: eliminate rlimit-based memory accounting for reuseport_array maps
bpf: eliminate rlimit-based memory accounting for bpf ringbuffer
bpf: eliminate rlimit-based memory accounting for sockmap and sockhash
maps
bpf: eliminate rlimit-based memory accounting for stackmap maps
bpf: eliminate rlimit-based memory accounting for xskmap maps
bpf: eliminate rlimit-based memory accounting for bpf local storage
maps
bpf: eliminate rlimit-based memory accounting infra for bpf maps
bpf: eliminate rlimit-based memory accounting for bpf progs
bpf: samples: do not touch RLIMIT_MEMLOCK
fs/buffer.c | 2 +-
fs/iomap/buffered-io.c | 2 +-
include/linux/bpf.h | 27 +--
include/linux/memcontrol.h | 215 +++++++++++++++++-
include/linux/mm.h | 22 --
include/linux/mm_types.h | 5 +-
include/linux/page-flags.h | 11 +-
include/trace/events/writeback.h | 2 +-
kernel/bpf/arraymap.c | 30 +--
kernel/bpf/bpf_local_storage.c | 18 +-
kernel/bpf/bpf_struct_ops.c | 19 +-
kernel/bpf/core.c | 22 +-
kernel/bpf/cpumap.c | 20 +-
kernel/bpf/devmap.c | 23 +-
kernel/bpf/hashtab.c | 33 +--
kernel/bpf/helpers.c | 37 ++-
kernel/bpf/local_storage.c | 38 +---
kernel/bpf/lpm_trie.c | 17 +-
kernel/bpf/queue_stack_maps.c | 16 +-
kernel/bpf/reuseport_array.c | 12 +-
kernel/bpf/ringbuf.c | 33 +--
kernel/bpf/stackmap.c | 16 +-
kernel/bpf/syscall.c | 177 ++++----------
kernel/fork.c | 7 +-
mm/debug.c | 4 +-
mm/huge_memory.c | 4 +-
mm/memcontrol.c | 139 +++++------
mm/page_alloc.c | 8 +-
mm/page_io.c | 6 +-
mm/slab.h | 38 +---
mm/workingset.c | 2 +-
net/core/bpf_sk_storage.c | 2 +-
net/core/sock_map.c | 40 +---
net/xdp/xskmap.c | 15 +-
samples/bpf/map_perf_test_user.c | 6 -
samples/bpf/offwaketime_user.c | 6 -
samples/bpf/sockex2_user.c | 2 -
samples/bpf/sockex3_user.c | 2 -
samples/bpf/spintest_user.c | 6 -
samples/bpf/syscall_tp_user.c | 2 -
samples/bpf/task_fd_query_user.c | 5 -
samples/bpf/test_lru_dist.c | 3 -
samples/bpf/test_map_in_map_user.c | 6 -
samples/bpf/test_overhead_user.c | 2 -
samples/bpf/trace_event_user.c | 2 -
samples/bpf/tracex2_user.c | 6 -
samples/bpf/tracex3_user.c | 6 -
samples/bpf/tracex4_user.c | 6 -
samples/bpf/tracex5_user.c | 3 -
samples/bpf/tracex6_user.c | 3 -
samples/bpf/xdp1_user.c | 6 -
samples/bpf/xdp_adjust_tail_user.c | 6 -
samples/bpf/xdp_monitor_user.c | 5 -
samples/bpf/xdp_redirect_cpu_user.c | 6 -
samples/bpf/xdp_redirect_map_user.c | 6 -
samples/bpf/xdp_redirect_user.c | 6 -
samples/bpf/xdp_router_ipv4_user.c | 6 -
samples/bpf/xdp_rxq_info_user.c | 6 -
samples/bpf/xdp_sample_pkts_user.c | 6 -
samples/bpf/xdp_tx_iptunnel_user.c | 6 -
samples/bpf/xdpsock_user.c | 7 -
.../selftests/bpf/progs/bpf_iter_bpf_map.c | 2 +-
.../selftests/bpf/progs/map_ptr_kern.c | 7 -
63 files changed, 460 insertions(+), 743 deletions(-)
--
2.26.2
Include metadata and percpu data into the memcg-based memory accounting.
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Song Liu <[email protected]>
---
kernel/bpf/cpumap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index c61a23b564aa..563f96cc8a9d 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -97,7 +97,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
attr->map_flags & ~BPF_F_NUMA_NODE)
return ERR_PTR(-EINVAL);
- cmap = kzalloc(sizeof(*cmap), GFP_USER);
+ cmap = kzalloc(sizeof(*cmap), GFP_USER | __GFP_ACCOUNT);
if (!cmap)
return ERR_PTR(-ENOMEM);
@@ -415,7 +415,7 @@ static struct bpf_cpu_map_entry *
__cpu_map_entry_alloc(struct bpf_cpumap_val *value, u32 cpu, int map_id)
{
int numa, err, i, fd = value->bpf_prog.fd;
- gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
+ gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_NOWARN;
struct bpf_cpu_map_entry *rcpu;
struct xdp_bulk_queue *bq;
--
2.26.2
Do not use rlimit-based memory accounting for queue_stack maps.
It has been replaced with the memcg-based memory accounting.
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Song Liu <[email protected]>
---
kernel/bpf/queue_stack_maps.c | 16 ++++------------
1 file changed, 4 insertions(+), 12 deletions(-)
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index 0ee2347ba510..f9c734aaa990 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -66,29 +66,21 @@ static int queue_stack_map_alloc_check(union bpf_attr *attr)
static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
{
- int ret, numa_node = bpf_map_attr_numa_node(attr);
- struct bpf_map_memory mem = {0};
+ int numa_node = bpf_map_attr_numa_node(attr);
struct bpf_queue_stack *qs;
- u64 size, queue_size, cost;
+ u64 size, queue_size;
size = (u64) attr->max_entries + 1;
- cost = queue_size = sizeof(*qs) + size * attr->value_size;
-
- ret = bpf_map_charge_init(&mem, cost);
- if (ret < 0)
- return ERR_PTR(ret);
+ queue_size = sizeof(*qs) + size * attr->value_size;
qs = bpf_map_area_alloc(queue_size, numa_node);
- if (!qs) {
- bpf_map_charge_finish(&mem);
+ if (!qs)
return ERR_PTR(-ENOMEM);
- }
memset(qs, 0, sizeof(*qs));
bpf_map_init_from_attr(&qs->map, attr);
- bpf_map_charge_move(&qs->map.memory, &mem);
qs->size = size;
raw_spin_lock_init(&qs->lock);
--
2.26.2
Do not use rlimit-based memory accounting for cgroup storage maps.
It has been replaced with the memcg-based memory accounting.
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Song Liu <[email protected]>
---
kernel/bpf/local_storage.c | 21 +--------------------
1 file changed, 1 insertion(+), 20 deletions(-)
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index 212d6dbbc39a..c28a47d5177a 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -288,8 +288,6 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
{
int numa_node = bpf_map_attr_numa_node(attr);
struct bpf_cgroup_storage_map *map;
- struct bpf_map_memory mem;
- int ret;
if (attr->key_size != sizeof(struct bpf_cgroup_storage_key) &&
attr->key_size != sizeof(__u64))
@@ -309,18 +307,10 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
/* max_entries is not used and enforced to be 0 */
return ERR_PTR(-EINVAL);
- ret = bpf_map_charge_init(&mem, sizeof(struct bpf_cgroup_storage_map));
- if (ret < 0)
- return ERR_PTR(ret);
-
map = kmalloc_node(sizeof(struct bpf_cgroup_storage_map),
__GFP_ZERO | GFP_USER | __GFP_ACCOUNT, numa_node);
- if (!map) {
- bpf_map_charge_finish(&mem);
+ if (!map)
return ERR_PTR(-ENOMEM);
- }
-
- bpf_map_charge_move(&map->map.memory, &mem);
/* copy mandatory map attributes */
bpf_map_init_from_attr(&map->map, attr);
@@ -509,9 +499,6 @@ struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog,
size = bpf_cgroup_storage_calculate_size(map, &pages);
- if (bpf_map_charge_memlock(map, pages))
- return ERR_PTR(-EPERM);
-
storage = kmalloc_node(sizeof(struct bpf_cgroup_storage), gfp,
map->numa_node);
if (!storage)
@@ -533,7 +520,6 @@ struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog,
return storage;
enomem:
- bpf_map_uncharge_memlock(map, pages);
kfree(storage);
return ERR_PTR(-ENOMEM);
}
@@ -560,16 +546,11 @@ void bpf_cgroup_storage_free(struct bpf_cgroup_storage *storage)
{
enum bpf_cgroup_storage_type stype;
struct bpf_map *map;
- u32 pages;
if (!storage)
return;
map = &storage->map->map;
-
- bpf_cgroup_storage_calculate_size(map, &pages);
- bpf_map_uncharge_memlock(map, pages);
-
stype = cgroup_storage_type(map);
if (stype == BPF_CGROUP_STORAGE_SHARED)
call_rcu(&storage->rcu, free_shared_cgroup_storage_rcu);
--
2.26.2
Do not use rlimit-based memory accounting for bpf ringbuffer.
It has been replaced with the memcg-based memory accounting.
bpf_ringbuf_alloc() can't return anything except ERR_PTR(-ENOMEM)
and a valid pointer, so to simplify the code make it return NULL
in the first case. This allows to drop a couple of lines in
ringbuf_map_alloc() and also makes it look similar to other memory
allocating function like kmalloc().
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Song Liu <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
---
kernel/bpf/ringbuf.c | 24 ++++--------------------
1 file changed, 4 insertions(+), 20 deletions(-)
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index ee5f55d9276e..c8892b58501e 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -48,7 +48,6 @@ struct bpf_ringbuf {
struct bpf_ringbuf_map {
struct bpf_map map;
- struct bpf_map_memory memory;
struct bpf_ringbuf *rb;
};
@@ -135,7 +134,7 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
rb = bpf_ringbuf_area_alloc(data_sz, numa_node);
if (!rb)
- return ERR_PTR(-ENOMEM);
+ return NULL;
spin_lock_init(&rb->spinlock);
init_waitqueue_head(&rb->waitq);
@@ -151,8 +150,6 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
{
struct bpf_ringbuf_map *rb_map;
- u64 cost;
- int err;
if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK)
return ERR_PTR(-EINVAL);
@@ -174,26 +171,13 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
bpf_map_init_from_attr(&rb_map->map, attr);
- cost = sizeof(struct bpf_ringbuf_map) +
- sizeof(struct bpf_ringbuf) +
- attr->max_entries;
- err = bpf_map_charge_init(&rb_map->map.memory, cost);
- if (err)
- goto err_free_map;
-
rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
- if (IS_ERR(rb_map->rb)) {
- err = PTR_ERR(rb_map->rb);
- goto err_uncharge;
+ if (!rb_map->rb) {
+ kfree(rb_map);
+ return ERR_PTR(-ENOMEM);
}
return &rb_map->map;
-
-err_uncharge:
- bpf_map_charge_finish(&rb_map->map.memory);
-err_free_map:
- kfree(rb_map);
- return ERR_PTR(err);
}
static void bpf_ringbuf_free(struct bpf_ringbuf *rb)
--
2.26.2
In the absolute majority of cases if a process is making a kernel
allocation, it's memory cgroup is getting charged.
Bpf maps can be updated from an interrupt context and in such
case there is no process which can be charged. It makes the memory
accounting of bpf maps non-trivial.
Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
memcg accounting from interrupt contexts") and b87d8cefe43c
("mm, memcg: rework remote charging API to support nesting")
it's finally possible.
To do it, a pointer to the memory cgroup of the process which created
the map is saved, and this cgroup is getting charged for all
allocations made from an interrupt context.
Allocations made from a process context will be accounted in a usual way.
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Song Liu <[email protected]>
---
include/linux/bpf.h | 4 ++++
kernel/bpf/helpers.c | 37 ++++++++++++++++++++++++++++++++++++-
kernel/bpf/syscall.c | 25 +++++++++++++++++++++++++
3 files changed, 65 insertions(+), 1 deletion(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 581b2a2e78eb..1d6e7b125877 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -37,6 +37,7 @@ struct bpf_iter_aux_info;
struct bpf_local_storage;
struct bpf_local_storage_map;
struct kobject;
+struct mem_cgroup;
extern struct idr btf_idr;
extern spinlock_t btf_idr_lock;
@@ -161,6 +162,9 @@ struct bpf_map {
u32 btf_value_type_id;
struct btf *btf;
struct bpf_map_memory memory;
+#ifdef CONFIG_MEMCG_KMEM
+ struct mem_cgroup *memcg;
+#endif
char name[BPF_OBJ_NAME_LEN];
u32 btf_vmlinux_value_type_id;
bool bypass_spec_v1;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 25520f5eeaf6..b6327cbe7e41 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -14,6 +14,7 @@
#include <linux/jiffies.h>
#include <linux/pid_namespace.h>
#include <linux/proc_ns.h>
+#include <linux/sched/mm.h>
#include "../../lib/kstrtox.h"
@@ -41,11 +42,45 @@ const struct bpf_func_proto bpf_map_lookup_elem_proto = {
.arg2_type = ARG_PTR_TO_MAP_KEY,
};
+#ifdef CONFIG_MEMCG_KMEM
+static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void *key,
+ void *value, u64 flags)
+{
+ struct mem_cgroup *old_memcg;
+ bool in_interrupt;
+ int ret;
+
+ /*
+ * If update from an interrupt context results in a memory allocation,
+ * the memory cgroup to charge can't be determined from the context
+ * of the current task. Instead, we charge the memory cgroup, which
+ * contained a process created the map.
+ */
+ in_interrupt = in_interrupt();
+ if (in_interrupt)
+ old_memcg = set_active_memcg(map->memcg);
+
+ ret = map->ops->map_update_elem(map, key, value, flags);
+
+ if (in_interrupt)
+ set_active_memcg(old_memcg);
+
+ return ret;
+}
+#else
+static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void *key,
+ void *value, u64 flags)
+{
+ return map->ops->map_update_elem(map, key, value, flags);
+}
+#endif
+
BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
void *, value, u64, flags)
{
WARN_ON_ONCE(!rcu_read_lock_held());
- return map->ops->map_update_elem(map, key, value, flags);
+
+ return __bpf_map_update_elem(map, key, value, flags);
}
const struct bpf_func_proto bpf_map_update_elem_proto = {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f3fe9f53f93c..2d77fc2496da 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -31,6 +31,7 @@
#include <linux/poll.h>
#include <linux/bpf-netns.h>
#include <linux/rcupdate_trace.h>
+#include <linux/memcontrol.h>
#define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
(map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
@@ -456,6 +457,27 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock)
__release(&map_idr_lock);
}
+#ifdef CONFIG_MEMCG_KMEM
+static void bpf_map_save_memcg(struct bpf_map *map)
+{
+ map->memcg = get_mem_cgroup_from_mm(current->mm);
+}
+
+static void bpf_map_release_memcg(struct bpf_map *map)
+{
+ mem_cgroup_put(map->memcg);
+}
+
+#else
+static void bpf_map_save_memcg(struct bpf_map *map)
+{
+}
+
+static void bpf_map_release_memcg(struct bpf_map *map)
+{
+}
+#endif
+
/* called from workqueue */
static void bpf_map_free_deferred(struct work_struct *work)
{
@@ -464,6 +486,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
bpf_map_charge_move(&mem, &map->memory);
security_bpf_map_free(map);
+ bpf_map_release_memcg(map);
/* implementation dependent freeing */
map->ops->map_free(map);
bpf_map_charge_finish(&mem);
@@ -875,6 +898,8 @@ static int map_create(union bpf_attr *attr)
if (err)
goto free_map_sec;
+ bpf_map_save_memcg(map);
+
err = bpf_map_new_fd(map, f_flags);
if (err < 0) {
/* failed to allocate fd.
--
2.26.2
Do not use rlimit-based memory accounting for cpumap maps.
It has been replaced with the memcg-based memory accounting.
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Song Liu <[email protected]>
---
kernel/bpf/cpumap.c | 16 +---------------
1 file changed, 1 insertion(+), 15 deletions(-)
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 563f96cc8a9d..7103d89a7d41 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -84,8 +84,6 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
u32 value_size = attr->value_size;
struct bpf_cpu_map *cmap;
int err = -ENOMEM;
- u64 cost;
- int ret;
if (!bpf_capable())
return ERR_PTR(-EPERM);
@@ -109,26 +107,14 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
goto free_cmap;
}
- /* make sure page count doesn't overflow */
- cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *);
-
- /* Notice returns -EPERM on if map size is larger than memlock limit */
- ret = bpf_map_charge_init(&cmap->map.memory, cost);
- if (ret) {
- err = ret;
- goto free_cmap;
- }
-
/* Alloc array for possible remote "destination" CPUs */
cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
sizeof(struct bpf_cpu_map_entry *),
cmap->map.numa_node);
if (!cmap->cpu_map)
- goto free_charge;
+ goto free_cmap;
return &cmap->map;
-free_charge:
- bpf_map_charge_finish(&cmap->map.memory);
free_cmap:
kfree(cmap);
return ERR_PTR(err);
--
2.26.2
On 11/17/20 4:40 AM, Roman Gushchin wrote:
> In the absolute majority of cases if a process is making a kernel
> allocation, it's memory cgroup is getting charged.
>
> Bpf maps can be updated from an interrupt context and in such
> case there is no process which can be charged. It makes the memory
> accounting of bpf maps non-trivial.
>
> Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
> memcg accounting from interrupt contexts") and b87d8cefe43c
> ("mm, memcg: rework remote charging API to support nesting")
> it's finally possible.
>
> To do it, a pointer to the memory cgroup of the process which created
> the map is saved, and this cgroup is getting charged for all
> allocations made from an interrupt context.
>
> Allocations made from a process context will be accounted in a usual way.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> Acked-by: Song Liu <[email protected]>
[...]
>
> +#ifdef CONFIG_MEMCG_KMEM
> +static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void *key,
> + void *value, u64 flags)
> +{
> + struct mem_cgroup *old_memcg;
> + bool in_interrupt;
> + int ret;
> +
> + /*
> + * If update from an interrupt context results in a memory allocation,
> + * the memory cgroup to charge can't be determined from the context
> + * of the current task. Instead, we charge the memory cgroup, which
> + * contained a process created the map.
> + */
> + in_interrupt = in_interrupt();
> + if (in_interrupt)
> + old_memcg = set_active_memcg(map->memcg);
> +
> + ret = map->ops->map_update_elem(map, key, value, flags);
> +
> + if (in_interrupt)
> + set_active_memcg(old_memcg);
> +
> + return ret;
Hmm, this approach here won't work, see also commit 09772d92cd5a ("bpf: avoid
retpoline for lookup/update/delete calls on maps") which removes the indirect
call, so the __bpf_map_update_elem() and therefore the set_active_memcg() is
not invoked for the vast majority of cases.
> +}
> +#else
> +static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void *key,
> + void *value, u64 flags)
> +{
> + return map->ops->map_update_elem(map, key, value, flags);
> +}
> +#endif
> +
> BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
> void *, value, u64, flags)
> {
> WARN_ON_ONCE(!rcu_read_lock_held());
> - return map->ops->map_update_elem(map, key, value, flags);
> +
> + return __bpf_map_update_elem(map, key, value, flags);
> }
>
> const struct bpf_func_proto bpf_map_update_elem_proto = {
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index f3fe9f53f93c..2d77fc2496da 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -31,6 +31,7 @@
> #include <linux/poll.h>
> #include <linux/bpf-netns.h>
> #include <linux/rcupdate_trace.h>
> +#include <linux/memcontrol.h>
>
> #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
> (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
> @@ -456,6 +457,27 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock)
> __release(&map_idr_lock);
> }
>
> +#ifdef CONFIG_MEMCG_KMEM
> +static void bpf_map_save_memcg(struct bpf_map *map)
> +{
> + map->memcg = get_mem_cgroup_from_mm(current->mm);
> +}
> +
> +static void bpf_map_release_memcg(struct bpf_map *map)
> +{
> + mem_cgroup_put(map->memcg);
> +}
> +
> +#else
> +static void bpf_map_save_memcg(struct bpf_map *map)
> +{
> +}
> +
> +static void bpf_map_release_memcg(struct bpf_map *map)
> +{
> +}
> +#endif
> +
> /* called from workqueue */
> static void bpf_map_free_deferred(struct work_struct *work)
> {
> @@ -464,6 +486,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
>
> bpf_map_charge_move(&mem, &map->memory);
> security_bpf_map_free(map);
> + bpf_map_release_memcg(map);
> /* implementation dependent freeing */
> map->ops->map_free(map);
> bpf_map_charge_finish(&mem);
> @@ -875,6 +898,8 @@ static int map_create(union bpf_attr *attr)
> if (err)
> goto free_map_sec;
>
> + bpf_map_save_memcg(map);
> +
> err = bpf_map_new_fd(map, f_flags);
> if (err < 0) {
> /* failed to allocate fd.
>
On Wed, Nov 18, 2020 at 01:06:17AM +0100, Daniel Borkmann wrote:
> On 11/17/20 4:40 AM, Roman Gushchin wrote:
> > In the absolute majority of cases if a process is making a kernel
> > allocation, it's memory cgroup is getting charged.
> >
> > Bpf maps can be updated from an interrupt context and in such
> > case there is no process which can be charged. It makes the memory
> > accounting of bpf maps non-trivial.
> >
> > Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
> > memcg accounting from interrupt contexts") and b87d8cefe43c
> > ("mm, memcg: rework remote charging API to support nesting")
> > it's finally possible.
> >
> > To do it, a pointer to the memory cgroup of the process which created
> > the map is saved, and this cgroup is getting charged for all
> > allocations made from an interrupt context.
> >
> > Allocations made from a process context will be accounted in a usual way.
> >
> > Signed-off-by: Roman Gushchin <[email protected]>
> > Acked-by: Song Liu <[email protected]>
> [...]
> > +#ifdef CONFIG_MEMCG_KMEM
> > +static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void *key,
> > + void *value, u64 flags)
> > +{
> > + struct mem_cgroup *old_memcg;
> > + bool in_interrupt;
> > + int ret;
> > +
> > + /*
> > + * If update from an interrupt context results in a memory allocation,
> > + * the memory cgroup to charge can't be determined from the context
> > + * of the current task. Instead, we charge the memory cgroup, which
> > + * contained a process created the map.
> > + */
> > + in_interrupt = in_interrupt();
> > + if (in_interrupt)
> > + old_memcg = set_active_memcg(map->memcg);
> > +
> > + ret = map->ops->map_update_elem(map, key, value, flags);
> > +
> > + if (in_interrupt)
> > + set_active_memcg(old_memcg);
> > +
> > + return ret;
>
> Hmm, this approach here won't work, see also commit 09772d92cd5a ("bpf: avoid
> retpoline for lookup/update/delete calls on maps") which removes the indirect
> call, so the __bpf_map_update_elem() and therefore the set_active_memcg() is
> not invoked for the vast majority of cases.
I see. Well, the first option is to move these calls into map-specific update
functions, but the list is relatively long:
nsim_map_update_elem()
cgroup_storage_update_elem()
htab_map_update_elem()
htab_percpu_map_update_elem()
dev_map_update_elem()
dev_map_hash_update_elem()
trie_update_elem()
cpu_map_update_elem()
bpf_pid_task_storage_update_elem()
bpf_fd_inode_storage_update_elem()
bpf_fd_sk_storage_update_elem()
sock_map_update_elem()
xsk_map_update_elem()
Alternatively, we can set the active memcg for the whole duration of bpf
execution. It's simpler, but will add some overhead. Maybe we can somehow
mark programs calling into update helpers and skip all others?
What do you think?
Thanks!
On Tue, Nov 17, 2020 at 04:46:34PM -0800, Roman Gushchin wrote:
> On Wed, Nov 18, 2020 at 01:06:17AM +0100, Daniel Borkmann wrote:
> > On 11/17/20 4:40 AM, Roman Gushchin wrote:
> > > In the absolute majority of cases if a process is making a kernel
> > > allocation, it's memory cgroup is getting charged.
> > >
> > > Bpf maps can be updated from an interrupt context and in such
> > > case there is no process which can be charged. It makes the memory
> > > accounting of bpf maps non-trivial.
> > >
> > > Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
> > > memcg accounting from interrupt contexts") and b87d8cefe43c
> > > ("mm, memcg: rework remote charging API to support nesting")
> > > it's finally possible.
> > >
> > > To do it, a pointer to the memory cgroup of the process which created
> > > the map is saved, and this cgroup is getting charged for all
> > > allocations made from an interrupt context.
> > >
> > > Allocations made from a process context will be accounted in a usual way.
> > >
> > > Signed-off-by: Roman Gushchin <[email protected]>
> > > Acked-by: Song Liu <[email protected]>
> > [...]
> > > +#ifdef CONFIG_MEMCG_KMEM
> > > +static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void *key,
> > > + void *value, u64 flags)
> > > +{
> > > + struct mem_cgroup *old_memcg;
> > > + bool in_interrupt;
> > > + int ret;
> > > +
> > > + /*
> > > + * If update from an interrupt context results in a memory allocation,
> > > + * the memory cgroup to charge can't be determined from the context
> > > + * of the current task. Instead, we charge the memory cgroup, which
> > > + * contained a process created the map.
> > > + */
> > > + in_interrupt = in_interrupt();
> > > + if (in_interrupt)
> > > + old_memcg = set_active_memcg(map->memcg);
> > > +
> > > + ret = map->ops->map_update_elem(map, key, value, flags);
> > > +
> > > + if (in_interrupt)
> > > + set_active_memcg(old_memcg);
> > > +
> > > + return ret;
> >
> > Hmm, this approach here won't work, see also commit 09772d92cd5a ("bpf: avoid
> > retpoline for lookup/update/delete calls on maps") which removes the indirect
> > call, so the __bpf_map_update_elem() and therefore the set_active_memcg() is
> > not invoked for the vast majority of cases.
>
> I see. Well, the first option is to move these calls into map-specific update
> functions, but the list is relatively long:
> nsim_map_update_elem()
> cgroup_storage_update_elem()
> htab_map_update_elem()
> htab_percpu_map_update_elem()
> dev_map_update_elem()
> dev_map_hash_update_elem()
> trie_update_elem()
> cpu_map_update_elem()
> bpf_pid_task_storage_update_elem()
> bpf_fd_inode_storage_update_elem()
> bpf_fd_sk_storage_update_elem()
> sock_map_update_elem()
> xsk_map_update_elem()
>
> Alternatively, we can set the active memcg for the whole duration of bpf
> execution. It's simpler, but will add some overhead. Maybe we can somehow
> mark programs calling into update helpers and skip all others?
Actually, this is problematic if a program updates several maps, because
in theory they can belong to different cgroups.
So it seems that the first option is the way to go. Do you agree?
On Tue, Nov 17, 2020 at 5:07 PM Roman Gushchin <[email protected]> wrote:
>
> On Tue, Nov 17, 2020 at 04:46:34PM -0800, Roman Gushchin wrote:
> > On Wed, Nov 18, 2020 at 01:06:17AM +0100, Daniel Borkmann wrote:
> > > On 11/17/20 4:40 AM, Roman Gushchin wrote:
> > > > In the absolute majority of cases if a process is making a kernel
> > > > allocation, it's memory cgroup is getting charged.
> > > >
> > > > Bpf maps can be updated from an interrupt context and in such
> > > > case there is no process which can be charged. It makes the memory
> > > > accounting of bpf maps non-trivial.
> > > >
> > > > Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
> > > > memcg accounting from interrupt contexts") and b87d8cefe43c
> > > > ("mm, memcg: rework remote charging API to support nesting")
> > > > it's finally possible.
> > > >
> > > > To do it, a pointer to the memory cgroup of the process which created
> > > > the map is saved, and this cgroup is getting charged for all
> > > > allocations made from an interrupt context.
> > > >
> > > > Allocations made from a process context will be accounted in a usual way.
> > > >
> > > > Signed-off-by: Roman Gushchin <[email protected]>
> > > > Acked-by: Song Liu <[email protected]>
> > > [...]
> > > > +#ifdef CONFIG_MEMCG_KMEM
> > > > +static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void *key,
> > > > + void *value, u64 flags)
> > > > +{
> > > > + struct mem_cgroup *old_memcg;
> > > > + bool in_interrupt;
> > > > + int ret;
> > > > +
> > > > + /*
> > > > + * If update from an interrupt context results in a memory allocation,
> > > > + * the memory cgroup to charge can't be determined from the context
> > > > + * of the current task. Instead, we charge the memory cgroup, which
> > > > + * contained a process created the map.
> > > > + */
> > > > + in_interrupt = in_interrupt();
> > > > + if (in_interrupt)
> > > > + old_memcg = set_active_memcg(map->memcg);
> > > > +
> > > > + ret = map->ops->map_update_elem(map, key, value, flags);
> > > > +
> > > > + if (in_interrupt)
> > > > + set_active_memcg(old_memcg);
> > > > +
> > > > + return ret;
> > >
> > > Hmm, this approach here won't work, see also commit 09772d92cd5a ("bpf: avoid
> > > retpoline for lookup/update/delete calls on maps") which removes the indirect
> > > call, so the __bpf_map_update_elem() and therefore the set_active_memcg() is
> > > not invoked for the vast majority of cases.
> >
> > I see. Well, the first option is to move these calls into map-specific update
> > functions, but the list is relatively long:
> > nsim_map_update_elem()
> > cgroup_storage_update_elem()
> > htab_map_update_elem()
> > htab_percpu_map_update_elem()
> > dev_map_update_elem()
> > dev_map_hash_update_elem()
> > trie_update_elem()
> > cpu_map_update_elem()
> > bpf_pid_task_storage_update_elem()
> > bpf_fd_inode_storage_update_elem()
> > bpf_fd_sk_storage_update_elem()
> > sock_map_update_elem()
> > xsk_map_update_elem()
> >
> > Alternatively, we can set the active memcg for the whole duration of bpf
> > execution. It's simpler, but will add some overhead. Maybe we can somehow
> > mark programs calling into update helpers and skip all others?
>
> Actually, this is problematic if a program updates several maps, because
> in theory they can belong to different cgroups.
> So it seems that the first option is the way to go. Do you agree?
May be instead of kmalloc_node() that is used by most of the map updates
introduce bpf_map_kmalloc_node() that takes a map pointer as an argument?
And do set_memcg inside?
On Tue, Nov 17, 2020 at 05:11:00PM -0800, Alexei Starovoitov wrote:
> On Tue, Nov 17, 2020 at 5:07 PM Roman Gushchin <[email protected]> wrote:
> >
> > On Tue, Nov 17, 2020 at 04:46:34PM -0800, Roman Gushchin wrote:
> > > On Wed, Nov 18, 2020 at 01:06:17AM +0100, Daniel Borkmann wrote:
> > > > On 11/17/20 4:40 AM, Roman Gushchin wrote:
> > > > > In the absolute majority of cases if a process is making a kernel
> > > > > allocation, it's memory cgroup is getting charged.
> > > > >
> > > > > Bpf maps can be updated from an interrupt context and in such
> > > > > case there is no process which can be charged. It makes the memory
> > > > > accounting of bpf maps non-trivial.
> > > > >
> > > > > Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
> > > > > memcg accounting from interrupt contexts") and b87d8cefe43c
> > > > > ("mm, memcg: rework remote charging API to support nesting")
> > > > > it's finally possible.
> > > > >
> > > > > To do it, a pointer to the memory cgroup of the process which created
> > > > > the map is saved, and this cgroup is getting charged for all
> > > > > allocations made from an interrupt context.
> > > > >
> > > > > Allocations made from a process context will be accounted in a usual way.
> > > > >
> > > > > Signed-off-by: Roman Gushchin <[email protected]>
> > > > > Acked-by: Song Liu <[email protected]>
> > > > [...]
> > > > > +#ifdef CONFIG_MEMCG_KMEM
> > > > > +static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void *key,
> > > > > + void *value, u64 flags)
> > > > > +{
> > > > > + struct mem_cgroup *old_memcg;
> > > > > + bool in_interrupt;
> > > > > + int ret;
> > > > > +
> > > > > + /*
> > > > > + * If update from an interrupt context results in a memory allocation,
> > > > > + * the memory cgroup to charge can't be determined from the context
> > > > > + * of the current task. Instead, we charge the memory cgroup, which
> > > > > + * contained a process created the map.
> > > > > + */
> > > > > + in_interrupt = in_interrupt();
> > > > > + if (in_interrupt)
> > > > > + old_memcg = set_active_memcg(map->memcg);
> > > > > +
> > > > > + ret = map->ops->map_update_elem(map, key, value, flags);
> > > > > +
> > > > > + if (in_interrupt)
> > > > > + set_active_memcg(old_memcg);
> > > > > +
> > > > > + return ret;
> > > >
> > > > Hmm, this approach here won't work, see also commit 09772d92cd5a ("bpf: avoid
> > > > retpoline for lookup/update/delete calls on maps") which removes the indirect
> > > > call, so the __bpf_map_update_elem() and therefore the set_active_memcg() is
> > > > not invoked for the vast majority of cases.
> > >
> > > I see. Well, the first option is to move these calls into map-specific update
> > > functions, but the list is relatively long:
> > > nsim_map_update_elem()
> > > cgroup_storage_update_elem()
> > > htab_map_update_elem()
> > > htab_percpu_map_update_elem()
> > > dev_map_update_elem()
> > > dev_map_hash_update_elem()
> > > trie_update_elem()
> > > cpu_map_update_elem()
> > > bpf_pid_task_storage_update_elem()
> > > bpf_fd_inode_storage_update_elem()
> > > bpf_fd_sk_storage_update_elem()
> > > sock_map_update_elem()
> > > xsk_map_update_elem()
> > >
> > > Alternatively, we can set the active memcg for the whole duration of bpf
> > > execution. It's simpler, but will add some overhead. Maybe we can somehow
> > > mark programs calling into update helpers and skip all others?
> >
> > Actually, this is problematic if a program updates several maps, because
> > in theory they can belong to different cgroups.
> > So it seems that the first option is the way to go. Do you agree?
>
> May be instead of kmalloc_node() that is used by most of the map updates
> introduce bpf_map_kmalloc_node() that takes a map pointer as an argument?
> And do set_memcg inside?
I suspect it's not only kmalloc_node(), but if there will be 2-3 allocation
helpers, it sounds like a good idea to me! I'll try and get back with v7 soon.
Thanks!
On 11/18/20 2:28 AM, Roman Gushchin wrote:
> On Tue, Nov 17, 2020 at 05:11:00PM -0800, Alexei Starovoitov wrote:
>> On Tue, Nov 17, 2020 at 5:07 PM Roman Gushchin <[email protected]> wrote:
>>> On Tue, Nov 17, 2020 at 04:46:34PM -0800, Roman Gushchin wrote:
>>>> On Wed, Nov 18, 2020 at 01:06:17AM +0100, Daniel Borkmann wrote:
>>>>> On 11/17/20 4:40 AM, Roman Gushchin wrote:
>>>>>> In the absolute majority of cases if a process is making a kernel
>>>>>> allocation, it's memory cgroup is getting charged.
>>>>>>
>>>>>> Bpf maps can be updated from an interrupt context and in such
>>>>>> case there is no process which can be charged. It makes the memory
>>>>>> accounting of bpf maps non-trivial.
>>>>>>
>>>>>> Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
>>>>>> memcg accounting from interrupt contexts") and b87d8cefe43c
>>>>>> ("mm, memcg: rework remote charging API to support nesting")
>>>>>> it's finally possible.
>>>>>>
>>>>>> To do it, a pointer to the memory cgroup of the process which created
>>>>>> the map is saved, and this cgroup is getting charged for all
>>>>>> allocations made from an interrupt context.
>>>>>>
>>>>>> Allocations made from a process context will be accounted in a usual way.
>>>>>>
>>>>>> Signed-off-by: Roman Gushchin <[email protected]>
>>>>>> Acked-by: Song Liu <[email protected]>
>>>>> [...]
>>>>>> +#ifdef CONFIG_MEMCG_KMEM
>>>>>> +static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void *key,
>>>>>> + void *value, u64 flags)
>>>>>> +{
>>>>>> + struct mem_cgroup *old_memcg;
>>>>>> + bool in_interrupt;
>>>>>> + int ret;
>>>>>> +
>>>>>> + /*
>>>>>> + * If update from an interrupt context results in a memory allocation,
>>>>>> + * the memory cgroup to charge can't be determined from the context
>>>>>> + * of the current task. Instead, we charge the memory cgroup, which
>>>>>> + * contained a process created the map.
>>>>>> + */
>>>>>> + in_interrupt = in_interrupt();
>>>>>> + if (in_interrupt)
>>>>>> + old_memcg = set_active_memcg(map->memcg);
>>>>>> +
>>>>>> + ret = map->ops->map_update_elem(map, key, value, flags);
>>>>>> +
>>>>>> + if (in_interrupt)
>>>>>> + set_active_memcg(old_memcg);
>>>>>> +
>>>>>> + return ret;
>>>>>
>>>>> Hmm, this approach here won't work, see also commit 09772d92cd5a ("bpf: avoid
>>>>> retpoline for lookup/update/delete calls on maps") which removes the indirect
>>>>> call, so the __bpf_map_update_elem() and therefore the set_active_memcg() is
>>>>> not invoked for the vast majority of cases.
>>>>
>>>> I see. Well, the first option is to move these calls into map-specific update
>>>> functions, but the list is relatively long:
>>>> nsim_map_update_elem()
>>>> cgroup_storage_update_elem()
>>>> htab_map_update_elem()
>>>> htab_percpu_map_update_elem()
>>>> dev_map_update_elem()
>>>> dev_map_hash_update_elem()
>>>> trie_update_elem()
>>>> cpu_map_update_elem()
>>>> bpf_pid_task_storage_update_elem()
>>>> bpf_fd_inode_storage_update_elem()
>>>> bpf_fd_sk_storage_update_elem()
>>>> sock_map_update_elem()
>>>> xsk_map_update_elem()
>>>>
>>>> Alternatively, we can set the active memcg for the whole duration of bpf
>>>> execution. It's simpler, but will add some overhead. Maybe we can somehow
>>>> mark programs calling into update helpers and skip all others?
>>>
>>> Actually, this is problematic if a program updates several maps, because
>>> in theory they can belong to different cgroups.
>>> So it seems that the first option is the way to go. Do you agree?
>>
>> May be instead of kmalloc_node() that is used by most of the map updates
>> introduce bpf_map_kmalloc_node() that takes a map pointer as an argument?
>> And do set_memcg inside?
>
> I suspect it's not only kmalloc_node(), but if there will be 2-3 allocation
> helpers, it sounds like a good idea to me! I'll try and get back with v7 soon.
Could this be baked into kmalloc*() API itself given we also need to pass in
__GFP_ACCOUNT everywhere, so we'd have a new API with additional argument where
we pass the memcg pointer to tell it directly where to account it for instead of
having to have the extra set_active_memcg() set/restore dance via BPF wrapper?
It seems there would be not much specifics on BPF itself and if it's more generic
it could also be used by other subsystems.
Thanks,
Daniel
On Wed, Nov 18, 2020 at 11:22:55AM +0100, Daniel Borkmann wrote:
> On 11/18/20 2:28 AM, Roman Gushchin wrote:
> > On Tue, Nov 17, 2020 at 05:11:00PM -0800, Alexei Starovoitov wrote:
> > > On Tue, Nov 17, 2020 at 5:07 PM Roman Gushchin <[email protected]> wrote:
> > > > On Tue, Nov 17, 2020 at 04:46:34PM -0800, Roman Gushchin wrote:
> > > > > On Wed, Nov 18, 2020 at 01:06:17AM +0100, Daniel Borkmann wrote:
> > > > > > On 11/17/20 4:40 AM, Roman Gushchin wrote:
> > > > > > > In the absolute majority of cases if a process is making a kernel
> > > > > > > allocation, it's memory cgroup is getting charged.
> > > > > > >
> > > > > > > Bpf maps can be updated from an interrupt context and in such
> > > > > > > case there is no process which can be charged. It makes the memory
> > > > > > > accounting of bpf maps non-trivial.
> > > > > > >
> > > > > > > Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
> > > > > > > memcg accounting from interrupt contexts") and b87d8cefe43c
> > > > > > > ("mm, memcg: rework remote charging API to support nesting")
> > > > > > > it's finally possible.
> > > > > > >
> > > > > > > To do it, a pointer to the memory cgroup of the process which created
> > > > > > > the map is saved, and this cgroup is getting charged for all
> > > > > > > allocations made from an interrupt context.
> > > > > > >
> > > > > > > Allocations made from a process context will be accounted in a usual way.
> > > > > > >
> > > > > > > Signed-off-by: Roman Gushchin <[email protected]>
> > > > > > > Acked-by: Song Liu <[email protected]>
> > > > > > [...]
> > > > > > > +#ifdef CONFIG_MEMCG_KMEM
> > > > > > > +static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void *key,
> > > > > > > + void *value, u64 flags)
> > > > > > > +{
> > > > > > > + struct mem_cgroup *old_memcg;
> > > > > > > + bool in_interrupt;
> > > > > > > + int ret;
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * If update from an interrupt context results in a memory allocation,
> > > > > > > + * the memory cgroup to charge can't be determined from the context
> > > > > > > + * of the current task. Instead, we charge the memory cgroup, which
> > > > > > > + * contained a process created the map.
> > > > > > > + */
> > > > > > > + in_interrupt = in_interrupt();
> > > > > > > + if (in_interrupt)
> > > > > > > + old_memcg = set_active_memcg(map->memcg);
> > > > > > > +
> > > > > > > + ret = map->ops->map_update_elem(map, key, value, flags);
> > > > > > > +
> > > > > > > + if (in_interrupt)
> > > > > > > + set_active_memcg(old_memcg);
> > > > > > > +
> > > > > > > + return ret;
> > > > > >
> > > > > > Hmm, this approach here won't work, see also commit 09772d92cd5a ("bpf: avoid
> > > > > > retpoline for lookup/update/delete calls on maps") which removes the indirect
> > > > > > call, so the __bpf_map_update_elem() and therefore the set_active_memcg() is
> > > > > > not invoked for the vast majority of cases.
> > > > >
> > > > > I see. Well, the first option is to move these calls into map-specific update
> > > > > functions, but the list is relatively long:
> > > > > nsim_map_update_elem()
> > > > > cgroup_storage_update_elem()
> > > > > htab_map_update_elem()
> > > > > htab_percpu_map_update_elem()
> > > > > dev_map_update_elem()
> > > > > dev_map_hash_update_elem()
> > > > > trie_update_elem()
> > > > > cpu_map_update_elem()
> > > > > bpf_pid_task_storage_update_elem()
> > > > > bpf_fd_inode_storage_update_elem()
> > > > > bpf_fd_sk_storage_update_elem()
> > > > > sock_map_update_elem()
> > > > > xsk_map_update_elem()
> > > > >
> > > > > Alternatively, we can set the active memcg for the whole duration of bpf
> > > > > execution. It's simpler, but will add some overhead. Maybe we can somehow
> > > > > mark programs calling into update helpers and skip all others?
> > > >
> > > > Actually, this is problematic if a program updates several maps, because
> > > > in theory they can belong to different cgroups.
> > > > So it seems that the first option is the way to go. Do you agree?
> > >
> > > May be instead of kmalloc_node() that is used by most of the map updates
> > > introduce bpf_map_kmalloc_node() that takes a map pointer as an argument?
> > > And do set_memcg inside?
> >
> > I suspect it's not only kmalloc_node(), but if there will be 2-3 allocation
> > helpers, it sounds like a good idea to me! I'll try and get back with v7 soon.
>
> Could this be baked into kmalloc*() API itself given we also need to pass in
> __GFP_ACCOUNT everywhere, so we'd have a new API with additional argument where
> we pass the memcg pointer to tell it directly where to account it for instead of
> having to have the extra set_active_memcg() set/restore dance via BPF wrapper?
> It seems there would be not much specifics on BPF itself and if it's more generic
> it could also be used by other subsystems.
Actually BPF is the first example of the kernel memory accounting from an interrupt
context. There are few examples where we do an indirect charging (charging an arbitrary
memory cgroup, not the current one), but not so many. And usually it's easier to
wrap everything into set_active_memcg(), rather than pass a memcg argument into every
function which can do a memory allocation. Also, in !CONFIG_MEMCG or !CONFIG_MEMCG_KMEM
it's easy to compile out set_active_memcg() and everything memcg-related, but not
easy to remove a memcg argument from many functions all over the code.
In this particular case the only reason why it's not easy to wrap everything into
set_active_memcg() pair is the function call "inlining", which is very bpf-specific.
So as now I'd go with what Alexei suggested.
Thanks!