2020-11-19 17:44:41

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH bpf-next v7 00/34] bpf: switch to memcg-based memory accounting

Currently bpf is using the memlock rlimit for the memory accounting.
This approach has its downsides and over time has created a significant
amount of problems:

1) The limit is per-user, but because most bpf operations are performed
as root, the limit has a little value.

2) It's hard to come up with a specific maximum value. Especially because
the counter is shared with non-bpf users (e.g. memlock() users).
Any specific value is either too low and creates false failures
or too high and useless.

3) Charging is not connected to the actual memory allocation. Bpf code
should manually calculate the estimated cost and precharge the counter,
and then take care of uncharging, including all fail paths.
It adds to the code complexity and makes it easy to leak a charge.

4) There is no simple way of getting the current value of the counter.
We've used drgn for it, but it's far from being convenient.

5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
a function to "explain" this case for users.

In order to overcome these problems let's switch to the memcg-based
memory accounting of bpf objects. With the recent addition of the percpu
memory accounting, now it's possible to provide a comprehensive accounting
of the memory used by bpf programs and maps.

This approach has the following advantages:
1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
a better control over memory usage by different workloads. Of course, it
requires enabled cgroups and kernel memory accounting and properly configured
cgroup tree, but it's a default configuration for a modern Linux system.

2) The actual memory consumption is taken into account. It happens automatically
on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
performed automatically on releasing the memory. So the code on the bpf side
becomes simpler and safer.

3) There is a simple way to get the current value and statistics.

In general, if a process performs a bpf operation (e.g. creates or updates
a map), it's memory cgroup is charged. However map updates performed from
an interrupt context are charged to the memory cgroup which contained
the process, which created the map.

Providing a 1:1 replacement for the rlimit-based memory accounting is
a non-goal of this patchset. Users and memory cgroups are completely
orthogonal, so it's not possible even in theory.
Memcg-based memory accounting requires a properly configured cgroup tree
to be actually useful. However, it's the way how the memory is managed
on a modern Linux system.


The patchset consists of the following parts:
1) 4 mm patches, which are already in the mm tree, but are required
to avoid a regression (otherwise vmallocs cannot be mapped to userspace).
2) memcg-based accounting for various bpf objects: progs and maps
3) removal of the rlimit-based accounting
4) removal of rlimit adjustments in userspace samples

First 4 patches are not supposed to be merged via the bpf tree. I'm including
them to make sure bpf tests will pass.

v7:
- introduced bpf_map_kmalloc_node() and bpf_map_alloc_percpu(), by Alexei
- switched allocations made from an interrupt context to new helpers,
by Daniel
- rebase and minor fixes

v6:
- rebased to the latest version of the remote charging API
- fixed signatures, added acks

v5:
- rebased to the latest version of the remote charging API
- implemented kmem accounting from an interrupt context, by Shakeel
- rebased to latest changes in mm allowed to map vmallocs to userspace
- fixed a build issue in kselftests, by Alexei
- fixed a use-after-free bug in bpf_map_free_deferred()
- added bpf line info coverage, by Shakeel
- split bpf map charging preparations into a separate patch

v4:
- covered allocations made from an interrupt context, by Daniel
- added some clarifications to the cover letter

v3:
- droped the userspace part for further discussions/refinements,
by Andrii and Song

v2:
- fixed build issue, caused by the remaining rlimit-based accounting
for sockhash maps


Roman Gushchin (34):
mm: memcontrol: use helpers to read page's memcg data
mm: memcontrol/slab: use helpers to access slab page's memcg_data
mm: introduce page memcg flags
mm: convert page kmemcg type to a page memcg flag
bpf: memcg-based memory accounting for bpf progs
bpf: prepare for memcg-based memory accounting for bpf maps
bpf: memcg-based memory accounting for bpf maps
bpf: refine memcg-based memory accounting for arraymap maps
bpf: refine memcg-based memory accounting for cpumap maps
bpf: memcg-based memory accounting for cgroup storage maps
bpf: refine memcg-based memory accounting for devmap maps
bpf: refine memcg-based memory accounting for hashtab maps
bpf: memcg-based memory accounting for lpm_trie maps
bpf: memcg-based memory accounting for bpf ringbuffer
bpf: memcg-based memory accounting for bpf local storage maps
bpf: refine memcg-based memory accounting for sockmap and sockhash
maps
bpf: refine memcg-based memory accounting for xskmap maps
bpf: eliminate rlimit-based memory accounting for arraymap maps
bpf: eliminate rlimit-based memory accounting for bpf_struct_ops maps
bpf: eliminate rlimit-based memory accounting for cpumap maps
bpf: eliminate rlimit-based memory accounting for cgroup storage maps
bpf: eliminate rlimit-based memory accounting for devmap maps
bpf: eliminate rlimit-based memory accounting for hashtab maps
bpf: eliminate rlimit-based memory accounting for lpm_trie maps
bpf: eliminate rlimit-based memory accounting for queue_stack_maps
maps
bpf: eliminate rlimit-based memory accounting for reuseport_array maps
bpf: eliminate rlimit-based memory accounting for bpf ringbuffer
bpf: eliminate rlimit-based memory accounting for sockmap and sockhash
maps
bpf: eliminate rlimit-based memory accounting for stackmap maps
bpf: eliminate rlimit-based memory accounting for xskmap maps
bpf: eliminate rlimit-based memory accounting for bpf local storage
maps
bpf: eliminate rlimit-based memory accounting infra for bpf maps
bpf: eliminate rlimit-based memory accounting for bpf progs
bpf: samples: do not touch RLIMIT_MEMLOCK

fs/buffer.c | 2 +-
fs/iomap/buffered-io.c | 2 +-
include/linux/bpf.h | 49 ++--
include/linux/memcontrol.h | 215 ++++++++++++++++-
include/linux/mm.h | 22 --
include/linux/mm_types.h | 5 +-
include/linux/page-flags.h | 11 +-
include/trace/events/writeback.h | 2 +-
kernel/bpf/arraymap.c | 30 +--
kernel/bpf/bpf_local_storage.c | 23 +-
kernel/bpf/bpf_struct_ops.c | 19 +-
kernel/bpf/core.c | 22 +-
kernel/bpf/cpumap.c | 39 ++-
kernel/bpf/devmap.c | 25 +-
kernel/bpf/hashtab.c | 34 +--
kernel/bpf/local_storage.c | 43 +---
kernel/bpf/lpm_trie.c | 20 +-
kernel/bpf/queue_stack_maps.c | 16 +-
kernel/bpf/reuseport_array.c | 12 +-
kernel/bpf/ringbuf.c | 33 +--
kernel/bpf/stackmap.c | 16 +-
kernel/bpf/syscall.c | 228 +++++++-----------
kernel/fork.c | 7 +-
mm/debug.c | 4 +-
mm/huge_memory.c | 4 +-
mm/memcontrol.c | 139 +++++------
mm/page_alloc.c | 8 +-
mm/page_io.c | 6 +-
mm/slab.h | 38 +--
mm/workingset.c | 2 +-
net/core/sock_map.c | 42 +---
net/xdp/xskmap.c | 16 +-
samples/bpf/map_perf_test_user.c | 6 -
samples/bpf/offwaketime_user.c | 6 -
samples/bpf/sockex2_user.c | 2 -
samples/bpf/sockex3_user.c | 2 -
samples/bpf/spintest_user.c | 6 -
samples/bpf/syscall_tp_user.c | 2 -
samples/bpf/task_fd_query_user.c | 5 -
samples/bpf/test_lru_dist.c | 3 -
samples/bpf/test_map_in_map_user.c | 6 -
samples/bpf/test_overhead_user.c | 2 -
samples/bpf/trace_event_user.c | 2 -
samples/bpf/tracex2_user.c | 6 -
samples/bpf/tracex3_user.c | 6 -
samples/bpf/tracex4_user.c | 6 -
samples/bpf/tracex5_user.c | 3 -
samples/bpf/tracex6_user.c | 3 -
samples/bpf/xdp1_user.c | 6 -
samples/bpf/xdp_adjust_tail_user.c | 6 -
samples/bpf/xdp_monitor_user.c | 5 -
samples/bpf/xdp_redirect_cpu_user.c | 6 -
samples/bpf/xdp_redirect_map_user.c | 6 -
samples/bpf/xdp_redirect_user.c | 6 -
samples/bpf/xdp_router_ipv4_user.c | 6 -
samples/bpf/xdp_rxq_info_user.c | 6 -
samples/bpf/xdp_sample_pkts_user.c | 6 -
samples/bpf/xdp_tx_iptunnel_user.c | 6 -
samples/bpf/xdpsock_user.c | 7 -
.../selftests/bpf/progs/bpf_iter_bpf_map.c | 2 +-
.../selftests/bpf/progs/map_ptr_kern.c | 7 -
61 files changed, 519 insertions(+), 756 deletions(-)

--
2.26.2


2020-11-19 17:44:46

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH bpf-next v7 10/34] bpf: memcg-based memory accounting for cgroup storage maps

Account memory used by cgroup storage maps including metadata
structures.

Account the percpu memory for the percpu flavor of cgroup storage.

Signed-off-by: Roman Gushchin <[email protected]>
---
kernel/bpf/local_storage.c | 22 ++++++++++------------
1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index 571bb351ed3b..aae17d29538e 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -164,10 +164,10 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *key,
return 0;
}

- new = kmalloc_node(sizeof(struct bpf_storage_buffer) +
- map->value_size,
- __GFP_ZERO | GFP_ATOMIC | __GFP_NOWARN,
- map->numa_node);
+ new = bpf_map_kmalloc_node(map, sizeof(struct bpf_storage_buffer) +
+ map->value_size, __GFP_ZERO | GFP_ATOMIC |
+ __GFP_NOWARN | __GFP_ACCOUNT,
+ map->numa_node);
if (!new)
return -ENOMEM;

@@ -313,7 +313,7 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
return ERR_PTR(ret);

map = kmalloc_node(sizeof(struct bpf_cgroup_storage_map),
- __GFP_ZERO | GFP_USER, numa_node);
+ __GFP_ZERO | GFP_USER | __GFP_ACCOUNT, numa_node);
if (!map) {
bpf_map_charge_finish(&mem);
return ERR_PTR(-ENOMEM);
@@ -496,9 +496,9 @@ static size_t bpf_cgroup_storage_calculate_size(struct bpf_map *map, u32 *pages)
struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog,
enum bpf_cgroup_storage_type stype)
{
+ const gfp_t gfp = __GFP_ZERO | GFP_USER | __GFP_ACCOUNT;
struct bpf_cgroup_storage *storage;
struct bpf_map *map;
- gfp_t flags;
size_t size;
u32 pages;

@@ -511,20 +511,18 @@ struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog,
if (bpf_map_charge_memlock(map, pages))
return ERR_PTR(-EPERM);

- storage = kmalloc_node(sizeof(struct bpf_cgroup_storage),
- __GFP_ZERO | GFP_USER, map->numa_node);
+ storage = kmalloc_node(sizeof(struct bpf_cgroup_storage), gfp,
+ map->numa_node);
if (!storage)
goto enomem;

- flags = __GFP_ZERO | GFP_USER;
-
if (stype == BPF_CGROUP_STORAGE_SHARED) {
- storage->buf = kmalloc_node(size, flags, map->numa_node);
+ storage->buf = kmalloc_node(size, gfp, map->numa_node);
if (!storage->buf)
goto enomem;
check_and_init_map_lock(map, storage->buf->data);
} else {
- storage->percpu_buf = __alloc_percpu_gfp(size, 8, flags);
+ storage->percpu_buf = __alloc_percpu_gfp(size, 8, gfp);
if (!storage->percpu_buf)
goto enomem;
}
--
2.26.2

2020-11-19 17:45:59

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH bpf-next v7 08/34] bpf: refine memcg-based memory accounting for arraymap maps

Include percpu arrays and auxiliary data into the memcg-based memory
accounting.

Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Song Liu <[email protected]>
---
kernel/bpf/arraymap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index c6c81eceb68f..92b650123c22 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -30,12 +30,12 @@ static void bpf_array_free_percpu(struct bpf_array *array)

static int bpf_array_alloc_percpu(struct bpf_array *array)
{
+ const gfp_t gfp = GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT;
void __percpu *ptr;
int i;

for (i = 0; i < array->map.max_entries; i++) {
- ptr = __alloc_percpu_gfp(array->elem_size, 8,
- GFP_USER | __GFP_NOWARN);
+ ptr = __alloc_percpu_gfp(array->elem_size, 8, gfp);
if (!ptr) {
bpf_array_free_percpu(array);
return -ENOMEM;
@@ -1018,7 +1018,7 @@ static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
struct bpf_array_aux *aux;
struct bpf_map *map;

- aux = kzalloc(sizeof(*aux), GFP_KERNEL);
+ aux = kzalloc(sizeof(*aux), GFP_KERNEL_ACCOUNT);
if (!aux)
return ERR_PTR(-ENOMEM);

--
2.26.2

2020-11-19 17:46:06

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH bpf-next v7 18/34] bpf: eliminate rlimit-based memory accounting for arraymap maps

Do not use rlimit-based memory accounting for arraymap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Song Liu <[email protected]>
---
kernel/bpf/arraymap.c | 24 ++++--------------------
1 file changed, 4 insertions(+), 20 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 92b650123c22..20f751a1d993 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -81,11 +81,10 @@ int array_map_alloc_check(union bpf_attr *attr)
static struct bpf_map *array_map_alloc(union bpf_attr *attr)
{
bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
- int ret, numa_node = bpf_map_attr_numa_node(attr);
+ int numa_node = bpf_map_attr_numa_node(attr);
u32 elem_size, index_mask, max_entries;
bool bypass_spec_v1 = bpf_bypass_spec_v1();
- u64 cost, array_size, mask64;
- struct bpf_map_memory mem;
+ u64 array_size, mask64;
struct bpf_array *array;

elem_size = round_up(attr->value_size, 8);
@@ -126,44 +125,29 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
}
}

- /* make sure there is no u32 overflow later in round_up() */
- cost = array_size;
- if (percpu)
- cost += (u64)attr->max_entries * elem_size * num_possible_cpus();
-
- ret = bpf_map_charge_init(&mem, cost);
- if (ret < 0)
- return ERR_PTR(ret);
-
/* allocate all map elements and zero-initialize them */
if (attr->map_flags & BPF_F_MMAPABLE) {
void *data;

/* kmalloc'ed memory can't be mmap'ed, use explicit vmalloc */
data = bpf_map_area_mmapable_alloc(array_size, numa_node);
- if (!data) {
- bpf_map_charge_finish(&mem);
+ if (!data)
return ERR_PTR(-ENOMEM);
- }
array = data + PAGE_ALIGN(sizeof(struct bpf_array))
- offsetof(struct bpf_array, value);
} else {
array = bpf_map_area_alloc(array_size, numa_node);
}
- if (!array) {
- bpf_map_charge_finish(&mem);
+ if (!array)
return ERR_PTR(-ENOMEM);
- }
array->index_mask = index_mask;
array->map.bypass_spec_v1 = bypass_spec_v1;

/* copy mandatory map attributes */
bpf_map_init_from_attr(&array->map, attr);
- bpf_map_charge_move(&array->map.memory, &mem);
array->elem_size = elem_size;

if (percpu && bpf_array_alloc_percpu(array)) {
- bpf_map_charge_finish(&array->map.memory);
bpf_map_area_free(array);
return ERR_PTR(-ENOMEM);
}
--
2.26.2

2020-11-20 01:41:13

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH bpf-next v7 10/34] bpf: memcg-based memory accounting for cgroup storage maps



> On Nov 19, 2020, at 9:37 AM, Roman Gushchin <[email protected]> wrote:
>
> Account memory used by cgroup storage maps including metadata
> structures.
>
> Account the percpu memory for the percpu flavor of cgroup storage.
>
> Signed-off-by: Roman Gushchin <[email protected]>
>

Acked-by: Song Liu <[email protected]>

2020-11-24 00:10:51

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH bpf-next v7 00/34] bpf: switch to memcg-based memory accounting

On 11/19/20 6:37 PM, Roman Gushchin wrote:
> Currently bpf is using the memlock rlimit for the memory accounting.
> This approach has its downsides and over time has created a significant
> amount of problems:
>
> 1) The limit is per-user, but because most bpf operations are performed
> as root, the limit has a little value.
>
> 2) It's hard to come up with a specific maximum value. Especially because
> the counter is shared with non-bpf users (e.g. memlock() users).
> Any specific value is either too low and creates false failures
> or too high and useless.
>
> 3) Charging is not connected to the actual memory allocation. Bpf code
> should manually calculate the estimated cost and precharge the counter,
> and then take care of uncharging, including all fail paths.
> It adds to the code complexity and makes it easy to leak a charge.
>
> 4) There is no simple way of getting the current value of the counter.
> We've used drgn for it, but it's far from being convenient.
>
> 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
> a function to "explain" this case for users.
>
> In order to overcome these problems let's switch to the memcg-based
> memory accounting of bpf objects. With the recent addition of the percpu
> memory accounting, now it's possible to provide a comprehensive accounting
> of the memory used by bpf programs and maps.
>
> This approach has the following advantages:
> 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
> a better control over memory usage by different workloads. Of course, it
> requires enabled cgroups and kernel memory accounting and properly configured
> cgroup tree, but it's a default configuration for a modern Linux system.
>
> 2) The actual memory consumption is taken into account. It happens automatically
> on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
> performed automatically on releasing the memory. So the code on the bpf side
> becomes simpler and safer.
>
> 3) There is a simple way to get the current value and statistics.
>
> In general, if a process performs a bpf operation (e.g. creates or updates
> a map), it's memory cgroup is charged. However map updates performed from
> an interrupt context are charged to the memory cgroup which contained
> the process, which created the map.
>
> Providing a 1:1 replacement for the rlimit-based memory accounting is
> a non-goal of this patchset. Users and memory cgroups are completely
> orthogonal, so it's not possible even in theory.
> Memcg-based memory accounting requires a properly configured cgroup tree
> to be actually useful. However, it's the way how the memory is managed
> on a modern Linux system.

The cover letter here only describes the advantages of this series, but leaves
out discussion of the disadvantages. They definitely must be part of the series
to provide a clear description of the semantic changes to readers. Last time we
discussed them, they were i) no mem limits in general on unprivileged users when
memory cgroups was not configured in the kernel, and ii) no mem limits by default
if not configured in the cgroup specifically. Did we made any progress on these
in the meantime? How do we want to address them? What is the concrete justification
to not address them?

Also I wonder what are the risk of regressions here, for example, if an existing
orchestrator has configured memory cgroup limits that are tailored to the application's
needs.. now, with kernel upgrade BPF will start to interfere, e.g. if a BPF program
attached to cgroups (e.g. connect/sendmsg/recvmsg or general cgroup skb egress hook)
starts charging to the process' memcg due to map updates?

[0] https://lore.kernel.org/bpf/[email protected]/

> The patchset consists of the following parts:
> 1) 4 mm patches, which are already in the mm tree, but are required
> to avoid a regression (otherwise vmallocs cannot be mapped to userspace).
> 2) memcg-based accounting for various bpf objects: progs and maps
> 3) removal of the rlimit-based accounting
> 4) removal of rlimit adjustments in userspace samples
>
> First 4 patches are not supposed to be merged via the bpf tree. I'm including
> them to make sure bpf tests will pass.
>
> v7:
> - introduced bpf_map_kmalloc_node() and bpf_map_alloc_percpu(), by Alexei
> - switched allocations made from an interrupt context to new helpers,
> by Daniel
> - rebase and minor fixes

2020-11-24 00:43:55

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH bpf-next v7 00/34] bpf: switch to memcg-based memory accounting

On Mon, Nov 23, 2020 at 02:30:09PM +0100, Daniel Borkmann wrote:
> On 11/19/20 6:37 PM, Roman Gushchin wrote:
> > Currently bpf is using the memlock rlimit for the memory accounting.
> > This approach has its downsides and over time has created a significant
> > amount of problems:
> >
> > 1) The limit is per-user, but because most bpf operations are performed
> > as root, the limit has a little value.
> >
> > 2) It's hard to come up with a specific maximum value. Especially because
> > the counter is shared with non-bpf users (e.g. memlock() users).
> > Any specific value is either too low and creates false failures
> > or too high and useless.
> >
> > 3) Charging is not connected to the actual memory allocation. Bpf code
> > should manually calculate the estimated cost and precharge the counter,
> > and then take care of uncharging, including all fail paths.
> > It adds to the code complexity and makes it easy to leak a charge.
> >
> > 4) There is no simple way of getting the current value of the counter.
> > We've used drgn for it, but it's far from being convenient.
> >
> > 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
> > a function to "explain" this case for users.
> >
> > In order to overcome these problems let's switch to the memcg-based
> > memory accounting of bpf objects. With the recent addition of the percpu
> > memory accounting, now it's possible to provide a comprehensive accounting
> > of the memory used by bpf programs and maps.
> >
> > This approach has the following advantages:
> > 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
> > a better control over memory usage by different workloads. Of course, it
> > requires enabled cgroups and kernel memory accounting and properly configured
> > cgroup tree, but it's a default configuration for a modern Linux system.
> >
> > 2) The actual memory consumption is taken into account. It happens automatically
> > on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
> > performed automatically on releasing the memory. So the code on the bpf side
> > becomes simpler and safer.
> >
> > 3) There is a simple way to get the current value and statistics.
> >
> > In general, if a process performs a bpf operation (e.g. creates or updates
> > a map), it's memory cgroup is charged. However map updates performed from
> > an interrupt context are charged to the memory cgroup which contained
> > the process, which created the map.
> >
> > Providing a 1:1 replacement for the rlimit-based memory accounting is
> > a non-goal of this patchset. Users and memory cgroups are completely
> > orthogonal, so it's not possible even in theory.
> > Memcg-based memory accounting requires a properly configured cgroup tree
> > to be actually useful. However, it's the way how the memory is managed
> > on a modern Linux system.

Hi Daniel!

>
> The cover letter here only describes the advantages of this series, but leaves
> out discussion of the disadvantages. They definitely must be part of the series
> to provide a clear description of the semantic changes to readers.

Honestly, I don't see them as disadvantages. Cgroups are basic units in which
resource control limits/guarantees/accounting are expressed. If there are
no cgroups created and configured in the system, it's obvious (maybe only to me)
that no limits are applied.

Users (rlimits) are to some extent similar units, but they do not provide
a comprehensive resource control system. Some parts are deprecated (like rss limits),
some parts are just missing. Aside from bpf nobody uses users to control
the memory as a physical resource. It simple doesn't work (and never did).
If somebody expects that a non-privileged user can't harm the system by depleting
it's memory (and other resources), it's simple not correct. There are multiple ways
for doing it. Accounting or not accounting bpf maps doesn't really change anything.
If we see them not as a real security mechanism, but as a way to prevent "mistakes",
which can harm the system, it's to some extent legit. The question is only if
it justifies the amount of problems we had with these limits.

Switching to memory cgroups, which are the way how the memory control is expressed,
IMO doesn't need an additional justification. During the last year I remember 2 or 3
times when various people (internally in Fb and in public mailing lists) were asking
why bpf memory is not accounted to memory cgroups. I think it's basically expected
these days.

I'll try to make more obvious that we're switching from users to cgroups and
describe the consequences of it on an unconfigured system. I'll update the cover.

> Last time we
> discussed them, they were i) no mem limits in general on unprivileged users when
> memory cgroups was not configured in the kernel, and ii) no mem limits by default
> if not configured in the cgroup specifically.
> Did we made any progress on these
> in the meantime? How do we want to address them? What is the concrete justification
> to not address them?

I don't see how they can and should be addressed.
Cgroups are the way how the resource consumption of a group of processes can be
limited. If there are no cgroups configured, it means all resources are available
to everyone. Maybe a user wants to use the whole memory for a bpf map? Why not?

Do you have any specific use case in your mind?
If you see a real value in the old system (I don't), which can justify an additional
complexity of keeping them both in a working state, we can discuss this option too.
We can make a switch in few steps, if you think it's too risky.

>
> Also I wonder what are the risk of regressions here, for example, if an existing
> orchestrator has configured memory cgroup limits that are tailored to the application's
> needs.. now, with kernel upgrade BPF will start to interfere, e.g. if a BPF program
> attached to cgroups (e.g. connect/sendmsg/recvmsg or general cgroup skb egress hook)
> starts charging to the process' memcg due to map updates?

Well, if somebody has a tight memory limit and large bpf map(s), they can see a "regression".
However the kernel memory usage and accounting implementation details vary from a version
to a version, so nobody should expect that limits once set will work forever.
If for some strange reason it'll create a critical problem, as a workaround it's possible
to disable the kernel memory accounting as a whole (via a boot option).

Actually, it seems that the usefulness of strict limits is generally limited, because
it's hard to get and assign any specific value. They are always either too relaxed
(and have no value), either too strict (and causing production issues). Memory cgroups
are generally moving towards soft limits and protections. But it's a separate theme...

Thanks!