Currently bpf is using the memlock rlimit for the memory accounting.
This approach has its downsides and over time has created a significant
amount of problems:
1) The limit is per-user, but because most bpf operations are performed
as root, the limit has a little value.
2) It's hard to come up with a specific maximum value. Especially because
the counter is shared with non-bpf users (e.g. memlock() users).
Any specific value is either too low and creates false failures
or too high and useless.
3) Charging is not connected to the actual memory allocation. Bpf code
should manually calculate the estimated cost and precharge the counter,
and then take care of uncharging, including all fail paths.
It adds to the code complexity and makes it easy to leak a charge.
4) There is no simple way of getting the current value of the counter.
We've used drgn for it, but it's far from being convenient.
5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
a function to "explain" this case for users.
In order to overcome these problems let's switch to the memcg-based
memory accounting of bpf objects. With the recent addition of the percpu
memory accounting, now it's possible to provide a comprehensive accounting
of memory used by bpf programs and maps.
This approach has the following advantages:
1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
a better control over memory usage by different workloads.
2) The actual memory consumption is taken into account. It happens automatically
on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
performed automatically on releasing the memory. So the code on the bpf side
becomes simpler and safer.
3) There is a simple way to get the current value and statistics.
The patchset consists of the following parts:
1) memcg-based accounting for various bpf objects: progs and maps
2) removal of the rlimit-based accounting
3) removal of rlimit adjustments in userspace samples
v3:
- droped the userspace part for further discussions/refinements,
by Andrii and Song
v2:
- fixed build issue, caused by the remaining rlimit-based accounting
for sockhash maps
Roman Gushchin (29):
bpf: memcg-based memory accounting for bpf progs
bpf: memcg-based memory accounting for bpf maps
bpf: refine memcg-based memory accounting for arraymap maps
bpf: refine memcg-based memory accounting for cpumap maps
bpf: memcg-based memory accounting for cgroup storage maps
bpf: refine memcg-based memory accounting for devmap maps
bpf: refine memcg-based memory accounting for hashtab maps
bpf: memcg-based memory accounting for lpm_trie maps
bpf: memcg-based memory accounting for bpf ringbuffer
bpf: memcg-based memory accounting for socket storage maps
bpf: refine memcg-based memory accounting for sockmap and sockhash
maps
bpf: refine memcg-based memory accounting for xskmap maps
bpf: eliminate rlimit-based memory accounting for arraymap maps
bpf: eliminate rlimit-based memory accounting for bpf_struct_ops maps
bpf: eliminate rlimit-based memory accounting for cpumap maps
bpf: eliminate rlimit-based memory accounting for cgroup storage maps
bpf: eliminate rlimit-based memory accounting for devmap maps
bpf: eliminate rlimit-based memory accounting for hashtab maps
bpf: eliminate rlimit-based memory accounting for lpm_trie maps
bpf: eliminate rlimit-based memory accounting for queue_stack_maps
maps
bpf: eliminate rlimit-based memory accounting for reuseport_array maps
bpf: eliminate rlimit-based memory accounting for bpf ringbuffer
bpf: eliminate rlimit-based memory accounting for sockmap and sockhash
maps
bpf: eliminate rlimit-based memory accounting for stackmap maps
bpf: eliminate rlimit-based memory accounting for socket storage maps
bpf: eliminate rlimit-based memory accounting for xskmap maps
bpf: eliminate rlimit-based memory accounting infra for bpf maps
bpf: eliminate rlimit-based memory accounting for bpf progs
bpf: samples: do not touch RLIMIT_MEMLOCK
include/linux/bpf.h | 23 ---
kernel/bpf/arraymap.c | 30 +---
kernel/bpf/bpf_struct_ops.c | 19 +--
kernel/bpf/core.c | 20 +--
kernel/bpf/cpumap.c | 20 +--
kernel/bpf/devmap.c | 23 +--
kernel/bpf/hashtab.c | 33 +---
kernel/bpf/local_storage.c | 38 ++---
kernel/bpf/lpm_trie.c | 17 +-
kernel/bpf/queue_stack_maps.c | 16 +-
kernel/bpf/reuseport_array.c | 12 +-
kernel/bpf/ringbuf.c | 33 ++--
kernel/bpf/stackmap.c | 16 +-
kernel/bpf/syscall.c | 152 ++----------------
net/core/bpf_sk_storage.c | 23 +--
net/core/sock_map.c | 40 ++---
net/xdp/xskmap.c | 13 +-
samples/bpf/map_perf_test_user.c | 11 --
samples/bpf/offwaketime_user.c | 2 -
samples/bpf/sockex2_user.c | 2 -
samples/bpf/sockex3_user.c | 2 -
samples/bpf/spintest_user.c | 2 -
samples/bpf/syscall_tp_user.c | 2 -
samples/bpf/task_fd_query_user.c | 5 -
samples/bpf/test_lru_dist.c | 3 -
samples/bpf/test_map_in_map_user.c | 9 --
samples/bpf/test_overhead_user.c | 2 -
samples/bpf/trace_event_user.c | 2 -
samples/bpf/tracex2_user.c | 6 -
samples/bpf/tracex3_user.c | 6 -
samples/bpf/tracex4_user.c | 6 -
samples/bpf/tracex5_user.c | 3 -
samples/bpf/tracex6_user.c | 3 -
samples/bpf/xdp1_user.c | 6 -
samples/bpf/xdp_adjust_tail_user.c | 6 -
samples/bpf/xdp_monitor_user.c | 6 -
samples/bpf/xdp_redirect_cpu_user.c | 6 -
samples/bpf/xdp_redirect_map_user.c | 6 -
samples/bpf/xdp_redirect_user.c | 6 -
samples/bpf/xdp_router_ipv4_user.c | 6 -
samples/bpf/xdp_rxq_info_user.c | 6 -
samples/bpf/xdp_sample_pkts_user.c | 6 -
samples/bpf/xdp_tx_iptunnel_user.c | 6 -
samples/bpf/xdpsock_user.c | 7 -
.../selftests/bpf/progs/map_ptr_kern.c | 5 -
45 files changed, 94 insertions(+), 572 deletions(-)
--
2.26.2
Include metadata and percpu data into the memcg-based memory accounting.
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Song Liu <[email protected]>
---
kernel/bpf/cpumap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index f1c46529929b..74ae9fcbe82e 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -99,7 +99,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
attr->map_flags & ~BPF_F_NUMA_NODE)
return ERR_PTR(-EINVAL);
- cmap = kzalloc(sizeof(*cmap), GFP_USER);
+ cmap = kzalloc(sizeof(*cmap), GFP_USER | __GFP_ACCOUNT);
if (!cmap)
return ERR_PTR(-ENOMEM);
@@ -418,7 +418,7 @@ static struct bpf_cpu_map_entry *
__cpu_map_entry_alloc(struct bpf_cpumap_val *value, u32 cpu, int map_id)
{
int numa, err, i, fd = value->bpf_prog.fd;
- gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
+ gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_NOWARN;
struct bpf_cpu_map_entry *rcpu;
struct xdp_bulk_queue *bq;
--
2.26.2
Extend xskmap memory accounting to include the memory taken by
the xsk_map_node structure.
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Song Liu <[email protected]>
---
net/xdp/xskmap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index 8367adbbe9df..e574b22defe5 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -28,7 +28,8 @@ static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map *map,
struct xsk_map_node *node;
int err;
- node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN);
+ node = kzalloc(sizeof(*node),
+ GFP_ATOMIC | __GFP_NOWARN | __GFP_ACCOUNT);
if (!node)
return ERR_PTR(-ENOMEM);
--
2.26.2
Do not use rlimit-based memory accounting for bpf_struct_ops maps.
It has been replaced with the memcg-based memory accounting.
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Song Liu <[email protected]>
---
kernel/bpf/bpf_struct_ops.c | 19 +++----------------
1 file changed, 3 insertions(+), 16 deletions(-)
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 969c5d47f81f..22bfa236683b 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -550,12 +550,10 @@ static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
{
const struct bpf_struct_ops *st_ops;
- size_t map_total_size, st_map_size;
+ size_t st_map_size;
struct bpf_struct_ops_map *st_map;
const struct btf_type *t, *vt;
- struct bpf_map_memory mem;
struct bpf_map *map;
- int err;
if (!bpf_capable())
return ERR_PTR(-EPERM);
@@ -575,20 +573,11 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
* struct bpf_struct_ops_tcp_congestions_ops
*/
(vt->size - sizeof(struct bpf_struct_ops_value));
- map_total_size = st_map_size +
- /* uvalue */
- sizeof(vt->size) +
- /* struct bpf_progs **progs */
- btf_type_vlen(t) * sizeof(struct bpf_prog *);
- err = bpf_map_charge_init(&mem, map_total_size);
- if (err < 0)
- return ERR_PTR(err);
st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
- if (!st_map) {
- bpf_map_charge_finish(&mem);
+ if (!st_map)
return ERR_PTR(-ENOMEM);
- }
+
st_map->st_ops = st_ops;
map = &st_map->map;
@@ -599,14 +588,12 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
st_map->image = bpf_jit_alloc_exec(PAGE_SIZE);
if (!st_map->uvalue || !st_map->progs || !st_map->image) {
bpf_struct_ops_map_free(map);
- bpf_map_charge_finish(&mem);
return ERR_PTR(-ENOMEM);
}
mutex_init(&st_map->lock);
set_vm_flush_reset_perms(st_map->image);
bpf_map_init_from_attr(map, attr);
- bpf_map_charge_move(&map->memory, &mem);
return map;
}
--
2.26.2
Do not use rlimit-based memory accounting for queue_stack maps.
It has been replaced with the memcg-based memory accounting.
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Song Liu <[email protected]>
---
kernel/bpf/queue_stack_maps.c | 16 ++++------------
1 file changed, 4 insertions(+), 12 deletions(-)
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index 44184f82916a..92e73c35a34a 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -66,29 +66,21 @@ static int queue_stack_map_alloc_check(union bpf_attr *attr)
static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
{
- int ret, numa_node = bpf_map_attr_numa_node(attr);
- struct bpf_map_memory mem = {0};
+ int numa_node = bpf_map_attr_numa_node(attr);
struct bpf_queue_stack *qs;
- u64 size, queue_size, cost;
+ u64 size, queue_size;
size = (u64) attr->max_entries + 1;
- cost = queue_size = sizeof(*qs) + size * attr->value_size;
-
- ret = bpf_map_charge_init(&mem, cost);
- if (ret < 0)
- return ERR_PTR(ret);
+ queue_size = sizeof(*qs) + size * attr->value_size;
qs = bpf_map_area_alloc(queue_size, numa_node);
- if (!qs) {
- bpf_map_charge_finish(&mem);
+ if (!qs)
return ERR_PTR(-ENOMEM);
- }
memset(qs, 0, sizeof(*qs));
bpf_map_init_from_attr(&qs->map, attr);
- bpf_map_charge_move(&qs->map.memory, &mem);
qs->size = size;
raw_spin_lock_init(&qs->lock);
--
2.26.2
Include percpu arrays and auxiliary data into the memcg-based memory
accounting.
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Song Liu <[email protected]>
---
kernel/bpf/arraymap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 8ff419b632a6..9597fecff8da 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -28,12 +28,12 @@ static void bpf_array_free_percpu(struct bpf_array *array)
static int bpf_array_alloc_percpu(struct bpf_array *array)
{
+ const gfp_t gfp = GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT;
void __percpu *ptr;
int i;
for (i = 0; i < array->map.max_entries; i++) {
- ptr = __alloc_percpu_gfp(array->elem_size, 8,
- GFP_USER | __GFP_NOWARN);
+ ptr = __alloc_percpu_gfp(array->elem_size, 8, gfp);
if (!ptr) {
bpf_array_free_percpu(array);
return -ENOMEM;
@@ -969,7 +969,7 @@ static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
struct bpf_array_aux *aux;
struct bpf_map *map;
- aux = kzalloc(sizeof(*aux), GFP_KERNEL);
+ aux = kzalloc(sizeof(*aux), GFP_KERNEL_ACCOUNT);
if (!aux)
return ERR_PTR(-ENOMEM);
--
2.26.2
On 7/30/20 11:22 PM, Roman Gushchin wrote:
> Currently bpf is using the memlock rlimit for the memory accounting.
> This approach has its downsides and over time has created a significant
> amount of problems:
>
> 1) The limit is per-user, but because most bpf operations are performed
> as root, the limit has a little value.
>
> 2) It's hard to come up with a specific maximum value. Especially because
> the counter is shared with non-bpf users (e.g. memlock() users).
> Any specific value is either too low and creates false failures
> or too high and useless.
>
> 3) Charging is not connected to the actual memory allocation. Bpf code
> should manually calculate the estimated cost and precharge the counter,
> and then take care of uncharging, including all fail paths.
> It adds to the code complexity and makes it easy to leak a charge.
>
> 4) There is no simple way of getting the current value of the counter.
> We've used drgn for it, but it's far from being convenient.
>
> 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
> a function to "explain" this case for users.
>
> In order to overcome these problems let's switch to the memcg-based
> memory accounting of bpf objects. With the recent addition of the percpu
> memory accounting, now it's possible to provide a comprehensive accounting
> of memory used by bpf programs and maps.
>
> This approach has the following advantages:
> 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
> a better control over memory usage by different workloads.
>
> 2) The actual memory consumption is taken into account. It happens automatically
> on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
> performed automatically on releasing the memory. So the code on the bpf side
> becomes simpler and safer.
>
> 3) There is a simple way to get the current value and statistics.
>
> The patchset consists of the following parts:
> 1) memcg-based accounting for various bpf objects: progs and maps
> 2) removal of the rlimit-based accounting
> 3) removal of rlimit adjustments in userspace samples
The diff stat looks nice & agree that rlimit sucks, but I'm missing how this is set
is supposed to work reliably, at least I currently fail to see it. Elaborating on this
in more depth especially for the case of unprivileged users should be a /fundamental/
part of the commit message.
Lets take an example: unprivileged user adds a max sized hashtable to one of its
programs, and configures the map that it will perform runtime allocation. The load
succeeds as it doesn't surpass the limits set for the current memcg. Kernel then
processes packets from softirq. Given the runtime allocations, we end up mischarging
to whoever ended up triggering __do_softirq(). If, for example, ksoftirq thread, then
it's probably reasonable to assume that this might not be accounted e.g. limits are
not imposed on the root cgroup. If so we would probably need to drag the context of
/where/ this must be charged to __memcg_kmem_charge_page() to do it reliably. Otherwise
how do you protect unprivileged users to OOM the machine?
Similarly, what happens to unprivileged users if kmemcg was not configured into the
kernel or has been disabled?
Thanks,
Daniel
On Mon, Aug 03, 2020 at 02:05:29PM +0200, Daniel Borkmann wrote:
> On 7/30/20 11:22 PM, Roman Gushchin wrote:
> > Currently bpf is using the memlock rlimit for the memory accounting.
> > This approach has its downsides and over time has created a significant
> > amount of problems:
> >
> > 1) The limit is per-user, but because most bpf operations are performed
> > as root, the limit has a little value.
> >
> > 2) It's hard to come up with a specific maximum value. Especially because
> > the counter is shared with non-bpf users (e.g. memlock() users).
> > Any specific value is either too low and creates false failures
> > or too high and useless.
> >
> > 3) Charging is not connected to the actual memory allocation. Bpf code
> > should manually calculate the estimated cost and precharge the counter,
> > and then take care of uncharging, including all fail paths.
> > It adds to the code complexity and makes it easy to leak a charge.
> >
> > 4) There is no simple way of getting the current value of the counter.
> > We've used drgn for it, but it's far from being convenient.
> >
> > 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
> > a function to "explain" this case for users.
> >
> > In order to overcome these problems let's switch to the memcg-based
> > memory accounting of bpf objects. With the recent addition of the percpu
> > memory accounting, now it's possible to provide a comprehensive accounting
> > of memory used by bpf programs and maps.
> >
> > This approach has the following advantages:
> > 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
> > a better control over memory usage by different workloads.
> >
> > 2) The actual memory consumption is taken into account. It happens automatically
> > on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
> > performed automatically on releasing the memory. So the code on the bpf side
> > becomes simpler and safer.
> >
> > 3) There is a simple way to get the current value and statistics.
> >
> > The patchset consists of the following parts:
> > 1) memcg-based accounting for various bpf objects: progs and maps
> > 2) removal of the rlimit-based accounting
> > 3) removal of rlimit adjustments in userspace samples
Hi Daniel,
>
> The diff stat looks nice & agree that rlimit sucks, but I'm missing how this is set
> is supposed to work reliably, at least I currently fail to see it. Elaborating on this
> in more depth especially for the case of unprivileged users should be a /fundamental/
> part of the commit message.
>
> Lets take an example: unprivileged user adds a max sized hashtable to one of its
> programs, and configures the map that it will perform runtime allocation. The load
> succeeds as it doesn't surpass the limits set for the current memcg. Kernel then
> processes packets from softirq. Given the runtime allocations, we end up mischarging
> to whoever ended up triggering __do_softirq(). If, for example, ksoftirq thread, then
> it's probably reasonable to assume that this might not be accounted e.g. limits are
> not imposed on the root cgroup. If so we would probably need to drag the context of
> /where/ this must be charged to __memcg_kmem_charge_page() to do it reliably. Otherwise
> how do you protect unprivileged users to OOM the machine?
this is a valid concern, thank you for bringing it in. It can be resolved by
associating a map with a memory cgroup on creation, so that we can charge
this memory cgroup later, even from a soft-irq context. The question here is
whether we want to do it for all maps, or just for dynamic hashtables
(or any similar cases, if there are any)? I think the second option
is better. With the first option we have to annotate all memory allocations
in bpf maps code with memalloc_use_memcg()/memalloc_unuse_memcg(),
so it's easy to mess it up in the future.
What do you think?
>
> Similarly, what happens to unprivileged users if kmemcg was not configured into the
> kernel or has been disabled?
Well, I don't think we can address it. Memcg-based memory accounting requires
enabled memory cgroups, a properly configured cgroup tree and also the kernel
memory accounting turned on to function properly.
Because we at Facebook are using cgroup for the memory accounting and control
everywhere, I might be biased. If there are real !memcg systems which are
actively using non-privileged bpf, we should keep the old system in place
and make it optional, so everyone can choose between having both accounting
systems or just the new one. Or we can disable the rlimit-based accounting
for root. But eliminating it completely looks so much nicer to me.
Thanks!
On 8/3/20 5:34 PM, Roman Gushchin wrote:
> On Mon, Aug 03, 2020 at 02:05:29PM +0200, Daniel Borkmann wrote:
>> On 7/30/20 11:22 PM, Roman Gushchin wrote:
>>> Currently bpf is using the memlock rlimit for the memory accounting.
>>> This approach has its downsides and over time has created a significant
>>> amount of problems:
>>>
>>> 1) The limit is per-user, but because most bpf operations are performed
>>> as root, the limit has a little value.
>>>
>>> 2) It's hard to come up with a specific maximum value. Especially because
>>> the counter is shared with non-bpf users (e.g. memlock() users).
>>> Any specific value is either too low and creates false failures
>>> or too high and useless.
>>>
>>> 3) Charging is not connected to the actual memory allocation. Bpf code
>>> should manually calculate the estimated cost and precharge the counter,
>>> and then take care of uncharging, including all fail paths.
>>> It adds to the code complexity and makes it easy to leak a charge.
>>>
>>> 4) There is no simple way of getting the current value of the counter.
>>> We've used drgn for it, but it's far from being convenient.
>>>
>>> 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
>>> a function to "explain" this case for users.
>>>
>>> In order to overcome these problems let's switch to the memcg-based
>>> memory accounting of bpf objects. With the recent addition of the percpu
>>> memory accounting, now it's possible to provide a comprehensive accounting
>>> of memory used by bpf programs and maps.
>>>
>>> This approach has the following advantages:
>>> 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
>>> a better control over memory usage by different workloads.
>>>
>>> 2) The actual memory consumption is taken into account. It happens automatically
>>> on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
>>> performed automatically on releasing the memory. So the code on the bpf side
>>> becomes simpler and safer.
>>>
>>> 3) There is a simple way to get the current value and statistics.
>>>
>>> The patchset consists of the following parts:
>>> 1) memcg-based accounting for various bpf objects: progs and maps
>>> 2) removal of the rlimit-based accounting
>>> 3) removal of rlimit adjustments in userspace samples
>
>> The diff stat looks nice & agree that rlimit sucks, but I'm missing how this is set
>> is supposed to work reliably, at least I currently fail to see it. Elaborating on this
>> in more depth especially for the case of unprivileged users should be a /fundamental/
>> part of the commit message.
>>
>> Lets take an example: unprivileged user adds a max sized hashtable to one of its
>> programs, and configures the map that it will perform runtime allocation. The load
>> succeeds as it doesn't surpass the limits set for the current memcg. Kernel then
>> processes packets from softirq. Given the runtime allocations, we end up mischarging
>> to whoever ended up triggering __do_softirq(). If, for example, ksoftirq thread, then
>> it's probably reasonable to assume that this might not be accounted e.g. limits are
>> not imposed on the root cgroup. If so we would probably need to drag the context of
>> /where/ this must be charged to __memcg_kmem_charge_page() to do it reliably. Otherwise
>> how do you protect unprivileged users to OOM the machine?
>
> this is a valid concern, thank you for bringing it in. It can be resolved by
> associating a map with a memory cgroup on creation, so that we can charge
> this memory cgroup later, even from a soft-irq context. The question here is
> whether we want to do it for all maps, or just for dynamic hashtables
> (or any similar cases, if there are any)? I think the second option
> is better. With the first option we have to annotate all memory allocations
> in bpf maps code with memalloc_use_memcg()/memalloc_unuse_memcg(),
> so it's easy to mess it up in the future.
> What do you think?
We would need to do it for all maps that are configured with non-prealloc, e.g. not
only hash/LRU table but also others like LPM maps etc. I wonder whether program entry/
exit could do the memalloc_use_memcg() / memalloc_unuse_memcg() and then everything
would be accounted against the prog's memcg from runtime side, but then there's the
usual issue with 'unuse'-restore on tail calls, and it doesn't solve the syscall side.
But seems like the memalloc_{use,unuse}_memcg()'s remote charging is lightweight
anyway compared to some of the other map update work such as taking bucket lock etc.
>> Similarly, what happens to unprivileged users if kmemcg was not configured into the
>> kernel or has been disabled?
>
> Well, I don't think we can address it. Memcg-based memory accounting requires
> enabled memory cgroups, a properly configured cgroup tree and also the kernel
> memory accounting turned on to function properly.
> Because we at Facebook are using cgroup for the memory accounting and control
> everywhere, I might be biased. If there are real !memcg systems which are
> actively using non-privileged bpf, we should keep the old system in place
> and make it optional, so everyone can choose between having both accounting
> systems or just the new one. Or we can disable the rlimit-based accounting
> for root. But eliminating it completely looks so much nicer to me.
Eliminating it entirely feels better indeed. Another option could be that BPF kconfig
would select memcg, so it's always built with it. Perhaps that is an acceptable tradeoff.
Thanks,
Daniel
On Mon, Aug 03, 2020 at 06:39:01PM +0200, Daniel Borkmann wrote:
> On 8/3/20 5:34 PM, Roman Gushchin wrote:
> > On Mon, Aug 03, 2020 at 02:05:29PM +0200, Daniel Borkmann wrote:
> > > On 7/30/20 11:22 PM, Roman Gushchin wrote:
> > > > Currently bpf is using the memlock rlimit for the memory accounting.
> > > > This approach has its downsides and over time has created a significant
> > > > amount of problems:
> > > >
> > > > 1) The limit is per-user, but because most bpf operations are performed
> > > > as root, the limit has a little value.
> > > >
> > > > 2) It's hard to come up with a specific maximum value. Especially because
> > > > the counter is shared with non-bpf users (e.g. memlock() users).
> > > > Any specific value is either too low and creates false failures
> > > > or too high and useless.
> > > >
> > > > 3) Charging is not connected to the actual memory allocation. Bpf code
> > > > should manually calculate the estimated cost and precharge the counter,
> > > > and then take care of uncharging, including all fail paths.
> > > > It adds to the code complexity and makes it easy to leak a charge.
> > > >
> > > > 4) There is no simple way of getting the current value of the counter.
> > > > We've used drgn for it, but it's far from being convenient.
> > > >
> > > > 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
> > > > a function to "explain" this case for users.
> > > >
> > > > In order to overcome these problems let's switch to the memcg-based
> > > > memory accounting of bpf objects. With the recent addition of the percpu
> > > > memory accounting, now it's possible to provide a comprehensive accounting
> > > > of memory used by bpf programs and maps.
> > > >
> > > > This approach has the following advantages:
> > > > 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
> > > > a better control over memory usage by different workloads.
> > > >
> > > > 2) The actual memory consumption is taken into account. It happens automatically
> > > > on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
> > > > performed automatically on releasing the memory. So the code on the bpf side
> > > > becomes simpler and safer.
> > > >
> > > > 3) There is a simple way to get the current value and statistics.
> > > >
> > > > The patchset consists of the following parts:
> > > > 1) memcg-based accounting for various bpf objects: progs and maps
> > > > 2) removal of the rlimit-based accounting
> > > > 3) removal of rlimit adjustments in userspace samples
> >
> > > The diff stat looks nice & agree that rlimit sucks, but I'm missing how this is set
> > > is supposed to work reliably, at least I currently fail to see it. Elaborating on this
> > > in more depth especially for the case of unprivileged users should be a /fundamental/
> > > part of the commit message.
> > >
> > > Lets take an example: unprivileged user adds a max sized hashtable to one of its
> > > programs, and configures the map that it will perform runtime allocation. The load
> > > succeeds as it doesn't surpass the limits set for the current memcg. Kernel then
> > > processes packets from softirq. Given the runtime allocations, we end up mischarging
> > > to whoever ended up triggering __do_softirq(). If, for example, ksoftirq thread, then
> > > it's probably reasonable to assume that this might not be accounted e.g. limits are
> > > not imposed on the root cgroup. If so we would probably need to drag the context of
> > > /where/ this must be charged to __memcg_kmem_charge_page() to do it reliably. Otherwise
> > > how do you protect unprivileged users to OOM the machine?
> >
> > this is a valid concern, thank you for bringing it in. It can be resolved by
> > associating a map with a memory cgroup on creation, so that we can charge
> > this memory cgroup later, even from a soft-irq context. The question here is
> > whether we want to do it for all maps, or just for dynamic hashtables
> > (or any similar cases, if there are any)? I think the second option
> > is better. With the first option we have to annotate all memory allocations
> > in bpf maps code with memalloc_use_memcg()/memalloc_unuse_memcg(),
> > so it's easy to mess it up in the future.
> > What do you think?
>
> We would need to do it for all maps that are configured with non-prealloc, e.g. not
> only hash/LRU table but also others like LPM maps etc. I wonder whether program entry/
> exit could do the memalloc_use_memcg() / memalloc_unuse_memcg() and then everything
> would be accounted against the prog's memcg from runtime side, but then there's the
> usual issue with 'unuse'-restore on tail calls, and it doesn't solve the syscall side.
> But seems like the memalloc_{use,unuse}_memcg()'s remote charging is lightweight
> anyway compared to some of the other map update work such as taking bucket lock etc.
I'll explore it and address in the next version. Thank you for suggestions!
>
> > > Similarly, what happens to unprivileged users if kmemcg was not configured into the
> > > kernel or has been disabled?
> >
> > Well, I don't think we can address it. Memcg-based memory accounting requires
> > enabled memory cgroups, a properly configured cgroup tree and also the kernel
> > memory accounting turned on to function properly.
> > Because we at Facebook are using cgroup for the memory accounting and control
> > everywhere, I might be biased. If there are real !memcg systems which are
> > actively using non-privileged bpf, we should keep the old system in place
> > and make it optional, so everyone can choose between having both accounting
> > systems or just the new one. Or we can disable the rlimit-based accounting
> > for root. But eliminating it completely looks so much nicer to me.
>
> Eliminating it entirely feels better indeed. Another option could be that BPF kconfig
> would select memcg, so it's always built with it. Perhaps that is an acceptable tradeoff.
But wouldn't it limit the usage of bpf on embedded devices?
Where memory cgroups are probably not used, but bpf still can be useful for tracing,
for example.
Adding this build dependency doesn't really guarantee anything (e.g. cgroupfs
can be simple not mounted on the system), so I'm not sure if we really need it.
Maybe we can print a warning if memcg is not properly configured and somebody
is creating a map? Idk.
Thanks!
On 8/3/20 7:05 PM, Roman Gushchin wrote:
> On Mon, Aug 03, 2020 at 06:39:01PM +0200, Daniel Borkmann wrote:
>> On 8/3/20 5:34 PM, Roman Gushchin wrote:
>>> On Mon, Aug 03, 2020 at 02:05:29PM +0200, Daniel Borkmann wrote:
>>>> On 7/30/20 11:22 PM, Roman Gushchin wrote:
>>>>> Currently bpf is using the memlock rlimit for the memory accounting.
>>>>> This approach has its downsides and over time has created a significant
>>>>> amount of problems:
>>>>>
>>>>> 1) The limit is per-user, but because most bpf operations are performed
>>>>> as root, the limit has a little value.
>>>>>
>>>>> 2) It's hard to come up with a specific maximum value. Especially because
>>>>> the counter is shared with non-bpf users (e.g. memlock() users).
>>>>> Any specific value is either too low and creates false failures
>>>>> or too high and useless.
>>>>>
>>>>> 3) Charging is not connected to the actual memory allocation. Bpf code
>>>>> should manually calculate the estimated cost and precharge the counter,
>>>>> and then take care of uncharging, including all fail paths.
>>>>> It adds to the code complexity and makes it easy to leak a charge.
>>>>>
>>>>> 4) There is no simple way of getting the current value of the counter.
>>>>> We've used drgn for it, but it's far from being convenient.
>>>>>
>>>>> 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
>>>>> a function to "explain" this case for users.
>>>>>
>>>>> In order to overcome these problems let's switch to the memcg-based
>>>>> memory accounting of bpf objects. With the recent addition of the percpu
>>>>> memory accounting, now it's possible to provide a comprehensive accounting
>>>>> of memory used by bpf programs and maps.
>>>>>
>>>>> This approach has the following advantages:
>>>>> 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
>>>>> a better control over memory usage by different workloads.
>>>>>
>>>>> 2) The actual memory consumption is taken into account. It happens automatically
>>>>> on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
>>>>> performed automatically on releasing the memory. So the code on the bpf side
>>>>> becomes simpler and safer.
>>>>>
>>>>> 3) There is a simple way to get the current value and statistics.
>>>>>
>>>>> The patchset consists of the following parts:
>>>>> 1) memcg-based accounting for various bpf objects: progs and maps
>>>>> 2) removal of the rlimit-based accounting
>>>>> 3) removal of rlimit adjustments in userspace samples
>>>
>>>> The diff stat looks nice & agree that rlimit sucks, but I'm missing how this is set
>>>> is supposed to work reliably, at least I currently fail to see it. Elaborating on this
>>>> in more depth especially for the case of unprivileged users should be a /fundamental/
>>>> part of the commit message.
>>>>
>>>> Lets take an example: unprivileged user adds a max sized hashtable to one of its
>>>> programs, and configures the map that it will perform runtime allocation. The load
>>>> succeeds as it doesn't surpass the limits set for the current memcg. Kernel then
>>>> processes packets from softirq. Given the runtime allocations, we end up mischarging
>>>> to whoever ended up triggering __do_softirq(). If, for example, ksoftirq thread, then
>>>> it's probably reasonable to assume that this might not be accounted e.g. limits are
>>>> not imposed on the root cgroup. If so we would probably need to drag the context of
>>>> /where/ this must be charged to __memcg_kmem_charge_page() to do it reliably. Otherwise
>>>> how do you protect unprivileged users to OOM the machine?
>>>
>>> this is a valid concern, thank you for bringing it in. It can be resolved by
>>> associating a map with a memory cgroup on creation, so that we can charge
>>> this memory cgroup later, even from a soft-irq context. The question here is
>>> whether we want to do it for all maps, or just for dynamic hashtables
>>> (or any similar cases, if there are any)? I think the second option
>>> is better. With the first option we have to annotate all memory allocations
>>> in bpf maps code with memalloc_use_memcg()/memalloc_unuse_memcg(),
>>> so it's easy to mess it up in the future.
>>> What do you think?
>>
>> We would need to do it for all maps that are configured with non-prealloc, e.g. not
>> only hash/LRU table but also others like LPM maps etc. I wonder whether program entry/
>> exit could do the memalloc_use_memcg() / memalloc_unuse_memcg() and then everything
>> would be accounted against the prog's memcg from runtime side, but then there's the
>> usual issue with 'unuse'-restore on tail calls, and it doesn't solve the syscall side.
>> But seems like the memalloc_{use,unuse}_memcg()'s remote charging is lightweight
>> anyway compared to some of the other map update work such as taking bucket lock etc.
>
> I'll explore it and address in the next version. Thank you for suggestions!
Ok.
I'm probably still missing one more thing, but could you elaborate what limits would
be enforced if an unprivileged user creates a prog/map on the host (w/o further action
such as moving to a specific cgroup)?
From what I can tell via looking at systemd:
$ cat /proc/self/cgroup
11:cpuset:/
10:hugetlb:/
9:devices:/user.slice
8:cpu,cpuacct:/
7:freezer:/
6:pids:/user.slice/user-1000.slice/[email protected]
5:memory:/user.slice/user-1000.slice/[email protected]
4:net_cls,net_prio:/
3:perf_event:/
2:blkio:/
1:name=systemd:/user.slice/user-1000.slice/[email protected]/gnome-terminal-server.service
0::/user.slice/user-1000.slice/[email protected]/gnome-terminal-server.service
And then:
$ systemctl cat user-1000.slice
# /usr/lib/systemd/system/user-.slice.d/10-defaults.conf
# SPDX-License-Identifier: LGPL-2.1+
#
# This file is part of systemd.
#
# systemd is free software; you can redistribute it and/or modify it
# under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation; either version 2.1 of the License, or
# (at your option) any later version.
[Unit]
Description=User Slice of UID %j
Documentation=man:[email protected](5)
After=systemd-user-sessions.service
StopWhenUnneeded=yes
[Slice]
TasksMax=33%
So that has a Pid limit in place by default, but it does not say anything on memory. I
presume the accounting relevant to us is tracked in memory.kmem.limit_in_bytes and
memory.kmem.usage_in_bytes, is that correct? If true, it looks like the default would
not prevent from OOM, no?
$ cat /sys/fs/cgroup/memory/user.slice/user-1000.slice/[email protected]/memory.kmem.usage_in_bytes
257966080
$ cat /sys/fs/cgroup/memory/user.slice/user-1000.slice/[email protected]/memory.kmem.limit_in_bytes
9223372036854771712
>>>> Similarly, what happens to unprivileged users if kmemcg was not configured into the
>>>> kernel or has been disabled?
>>>
>>> Well, I don't think we can address it. Memcg-based memory accounting requires
>>> enabled memory cgroups, a properly configured cgroup tree and also the kernel
>>> memory accounting turned on to function properly.
>>> Because we at Facebook are using cgroup for the memory accounting and control
>>> everywhere, I might be biased. If there are real !memcg systems which are
>>> actively using non-privileged bpf, we should keep the old system in place
>>> and make it optional, so everyone can choose between having both accounting
>>> systems or just the new one. Or we can disable the rlimit-based accounting
>>> for root. But eliminating it completely looks so much nicer to me.
>>
>> Eliminating it entirely feels better indeed. Another option could be that BPF kconfig
>> would select memcg, so it's always built with it. Perhaps that is an acceptable tradeoff.
>
> But wouldn't it limit the usage of bpf on embedded devices?
> Where memory cgroups are probably not used, but bpf still can be useful for tracing,
> for example.
>
> Adding this build dependency doesn't really guarantee anything (e.g. cgroupfs
> can be simple not mounted on the system), so I'm not sure if we really need it.
Argh, true as well. :/ Is there some fallback accounting/limitation that could be done
either explicit or ideally hidden via __GFP_ACCOUNT for unprivileged? We still need to
prevent unprivileged users to easily cause OOM damage in those situations, too.
Thanks,
Daniel
On Mon, Aug 03, 2020 at 08:37:15PM +0200, Daniel Borkmann wrote:
> On 8/3/20 7:05 PM, Roman Gushchin wrote:
> > On Mon, Aug 03, 2020 at 06:39:01PM +0200, Daniel Borkmann wrote:
> > > On 8/3/20 5:34 PM, Roman Gushchin wrote:
> > > > On Mon, Aug 03, 2020 at 02:05:29PM +0200, Daniel Borkmann wrote:
> > > > > On 7/30/20 11:22 PM, Roman Gushchin wrote:
> > > > > > Currently bpf is using the memlock rlimit for the memory accounting.
> > > > > > This approach has its downsides and over time has created a significant
> > > > > > amount of problems:
> > > > > >
> > > > > > 1) The limit is per-user, but because most bpf operations are performed
> > > > > > as root, the limit has a little value.
> > > > > >
> > > > > > 2) It's hard to come up with a specific maximum value. Especially because
> > > > > > the counter is shared with non-bpf users (e.g. memlock() users).
> > > > > > Any specific value is either too low and creates false failures
> > > > > > or too high and useless.
> > > > > >
> > > > > > 3) Charging is not connected to the actual memory allocation. Bpf code
> > > > > > should manually calculate the estimated cost and precharge the counter,
> > > > > > and then take care of uncharging, including all fail paths.
> > > > > > It adds to the code complexity and makes it easy to leak a charge.
> > > > > >
> > > > > > 4) There is no simple way of getting the current value of the counter.
> > > > > > We've used drgn for it, but it's far from being convenient.
> > > > > >
> > > > > > 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
> > > > > > a function to "explain" this case for users.
> > > > > >
> > > > > > In order to overcome these problems let's switch to the memcg-based
> > > > > > memory accounting of bpf objects. With the recent addition of the percpu
> > > > > > memory accounting, now it's possible to provide a comprehensive accounting
> > > > > > of memory used by bpf programs and maps.
> > > > > >
> > > > > > This approach has the following advantages:
> > > > > > 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
> > > > > > a better control over memory usage by different workloads.
> > > > > >
> > > > > > 2) The actual memory consumption is taken into account. It happens automatically
> > > > > > on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
> > > > > > performed automatically on releasing the memory. So the code on the bpf side
> > > > > > becomes simpler and safer.
> > > > > >
> > > > > > 3) There is a simple way to get the current value and statistics.
> > > > > >
> > > > > > The patchset consists of the following parts:
> > > > > > 1) memcg-based accounting for various bpf objects: progs and maps
> > > > > > 2) removal of the rlimit-based accounting
> > > > > > 3) removal of rlimit adjustments in userspace samples
> > > >
> > > > > The diff stat looks nice & agree that rlimit sucks, but I'm missing how this is set
> > > > > is supposed to work reliably, at least I currently fail to see it. Elaborating on this
> > > > > in more depth especially for the case of unprivileged users should be a /fundamental/
> > > > > part of the commit message.
> > > > >
> > > > > Lets take an example: unprivileged user adds a max sized hashtable to one of its
> > > > > programs, and configures the map that it will perform runtime allocation. The load
> > > > > succeeds as it doesn't surpass the limits set for the current memcg. Kernel then
> > > > > processes packets from softirq. Given the runtime allocations, we end up mischarging
> > > > > to whoever ended up triggering __do_softirq(). If, for example, ksoftirq thread, then
> > > > > it's probably reasonable to assume that this might not be accounted e.g. limits are
> > > > > not imposed on the root cgroup. If so we would probably need to drag the context of
> > > > > /where/ this must be charged to __memcg_kmem_charge_page() to do it reliably. Otherwise
> > > > > how do you protect unprivileged users to OOM the machine?
> > > >
> > > > this is a valid concern, thank you for bringing it in. It can be resolved by
> > > > associating a map with a memory cgroup on creation, so that we can charge
> > > > this memory cgroup later, even from a soft-irq context. The question here is
> > > > whether we want to do it for all maps, or just for dynamic hashtables
> > > > (or any similar cases, if there are any)? I think the second option
> > > > is better. With the first option we have to annotate all memory allocations
> > > > in bpf maps code with memalloc_use_memcg()/memalloc_unuse_memcg(),
> > > > so it's easy to mess it up in the future.
> > > > What do you think?
> > >
> > > We would need to do it for all maps that are configured with non-prealloc, e.g. not
> > > only hash/LRU table but also others like LPM maps etc. I wonder whether program entry/
> > > exit could do the memalloc_use_memcg() / memalloc_unuse_memcg() and then everything
> > > would be accounted against the prog's memcg from runtime side, but then there's the
> > > usual issue with 'unuse'-restore on tail calls, and it doesn't solve the syscall side.
> > > But seems like the memalloc_{use,unuse}_memcg()'s remote charging is lightweight
> > > anyway compared to some of the other map update work such as taking bucket lock etc.
> >
> > I'll explore it and address in the next version. Thank you for suggestions!
>
> Ok.
>
> I'm probably still missing one more thing, but could you elaborate what limits would
> be enforced if an unprivileged user creates a prog/map on the host (w/o further action
> such as moving to a specific cgroup)?
If cgroups are not configured properly, no limits can be enforced. Memory cgroups
are completely orthogonal to users.
However, in the most popular case (at least in our setup) when all bpf operations
are performed by root, per-user accounting is useless.
>
> From what I can tell via looking at systemd:
>
> $ cat /proc/self/cgroup
> 11:cpuset:/
> 10:hugetlb:/
> 9:devices:/user.slice
> 8:cpu,cpuacct:/
> 7:freezer:/
> 6:pids:/user.slice/user-1000.slice/[email protected]
> 5:memory:/user.slice/user-1000.slice/[email protected]
> 4:net_cls,net_prio:/
> 3:perf_event:/
> 2:blkio:/
> 1:name=systemd:/user.slice/user-1000.slice/[email protected]/gnome-terminal-server.service
> 0::/user.slice/user-1000.slice/[email protected]/gnome-terminal-server.service
>
> And then:
>
> $ systemctl cat user-1000.slice
> # /usr/lib/systemd/system/user-.slice.d/10-defaults.conf
> # SPDX-License-Identifier: LGPL-2.1+
> #
> # This file is part of systemd.
> #
> # systemd is free software; you can redistribute it and/or modify it
> # under the terms of the GNU Lesser General Public License as published by
> # the Free Software Foundation; either version 2.1 of the License, or
> # (at your option) any later version.
>
> [Unit]
> Description=User Slice of UID %j
> Documentation=man:[email protected](5)
> After=systemd-user-sessions.service
> StopWhenUnneeded=yes
>
> [Slice]
> TasksMax=33%
>
> So that has a Pid limit in place by default, but it does not say anything on memory. I
> presume the accounting relevant to us is tracked in memory.kmem.limit_in_bytes and
> memory.kmem.usage_in_bytes, is that correct? If true, it looks like the default would
> not prevent from OOM, no?
Yeah, it's true.
Also in general we're moving from setting hard limits to pressure-based oom handling,
where we detect high continuous memory pressure in a cgroups using psi metrics
and handle it in userspace. Memory.high can be used to slow down the workload
to avoid an extensive overfilling over the limit.
So the most important "feature" is that bpf memory is accounted in
memory.current (cgroup v2) and memory.usage_in_bytes (cgroup v1).
>
> $ cat /sys/fs/cgroup/memory/user.slice/user-1000.slice/[email protected]/memory.kmem.usage_in_bytes
> 257966080
> $ cat /sys/fs/cgroup/memory/user.slice/user-1000.slice/[email protected]/memory.kmem.limit_in_bytes
> 9223372036854771712
>
> > > > > Similarly, what happens to unprivileged users if kmemcg was not configured into the
> > > > > kernel or has been disabled?
> > > >
> > > > Well, I don't think we can address it. Memcg-based memory accounting requires
> > > > enabled memory cgroups, a properly configured cgroup tree and also the kernel
> > > > memory accounting turned on to function properly.
> > > > Because we at Facebook are using cgroup for the memory accounting and control
> > > > everywhere, I might be biased. If there are real !memcg systems which are
> > > > actively using non-privileged bpf, we should keep the old system in place
> > > > and make it optional, so everyone can choose between having both accounting
> > > > systems or just the new one. Or we can disable the rlimit-based accounting
> > > > for root. But eliminating it completely looks so much nicer to me.
> > >
> > > Eliminating it entirely feels better indeed. Another option could be that BPF kconfig
> > > would select memcg, so it's always built with it. Perhaps that is an acceptable tradeoff.
> >
> > But wouldn't it limit the usage of bpf on embedded devices?
> > Where memory cgroups are probably not used, but bpf still can be useful for tracing,
> > for example.
> >
> > Adding this build dependency doesn't really guarantee anything (e.g. cgroupfs
> > can be simple not mounted on the system), so I'm not sure if we really need it.
>
> Argh, true as well. :/ Is there some fallback accounting/limitation that could be done
> either explicit or ideally hidden via __GFP_ACCOUNT for unprivileged? We still need to
> prevent unprivileged users to easily cause OOM damage in those situations, too.
Users and memory cgroups are orthogonal, so an unprivileged user can have a process
in the root memory cgroup and it shouldn't be limited. And the opposite: a root process
in a non-root memory cgroup might be limited. We can't really emulate the old semantics
using cgroups.
But I'm not sure if it's a problem: there are other ways to exhaust (kernel) memory
beside bpf. So if a user is not limited by a memory cgroup with the enabled kernel
memory accounting, it's not completely safe anyway.
If we want to save the old behavior, I think the best thing is to keep it as it is,
only add an option (sysctl?) to disable it, which everybody who relies on cgroups
can do to avoid all this hassle with rlimits.
I actually wonder, does anybody rely on this memlock limit?
Or everybody's just bumping it to be "big enough" to avoid getting errors.
Thanks!