2022-03-25 18:26:31

by Hao Luo

[permalink] [raw]
Subject: [PATCH RFC bpf-next 0/2] Mmapable task local storage.

Some map types support mmap operation, which allows userspace to
communicate with BPF programs directly. Currently only arraymap
and ringbuf have mmap implemented.

However, in some use cases, when multiple program instances can
run concurrently, global mmapable memory can cause race. In that
case, userspace needs to provide necessary synchronizations to
coordinate the usage of mapped global data. This can be a source
of bottleneck.

It would be great to have a mmapable local storage in that case.
This patch adds that.

Mmap isn't BPF syscall, so unpriv users can also use it to
interact with maps.

Currently the only way of allocating mmapable map area is using
vmalloc() and it's only used at map allocation time. Vmalloc()
may sleep, therefore it's not suitable for maps that may allocate
memory in an atomic context such as local storage. Local storage
uses kmalloc() with GFP_ATOMIC, which doesn't sleep. This patch
uses kmalloc() with GFP_ATOMIC as well for mmapable map area.

Allocating mmapable memory has requirment on page alignment. So we
have to deliberately allocate more memory than necessary to obtain
an address that has sdata->data aligned at page boundary. The
calculations for mmapable allocation size, and the actual
allocation/deallocation are packaged in three functions:

- bpf_map_mmapable_alloc_size()
- bpf_map_mmapable_kzalloc()
- bpf_map_mmapable_kfree()

BPF local storage uses them to provide generic mmap API:

- bpf_local_storage_mmap()

And task local storage adds the mmap callback:

- task_storage_map_mmap()

When application calls mmap on a task local storage, it gets its
own local storage.

Overall, mmapable local storage trades off memory with flexibility
and efficiency. It brings memory fragmentation but can make programs
stateless. Therefore useful in some cases.

Hao Luo (2):
bpf: Mmapable local storage.
selftests/bpf: Test mmapable task local storage.

include/linux/bpf.h | 4 +
include/linux/bpf_local_storage.h | 5 +-
kernel/bpf/bpf_local_storage.c | 73 +++++++++++++++++--
kernel/bpf/bpf_task_storage.c | 40 ++++++++++
kernel/bpf/syscall.c | 67 +++++++++++++++++
.../bpf/prog_tests/task_local_storage.c | 38 ++++++++++
.../bpf/progs/task_local_storage_mmapable.c | 38 ++++++++++
7 files changed, 257 insertions(+), 8 deletions(-)
create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c

--
2.35.1.1021.g381101b075-goog


2022-03-25 20:26:59

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.



On 3/24/22 4:41 PM, Hao Luo wrote:
> Some map types support mmap operation, which allows userspace to
> communicate with BPF programs directly. Currently only arraymap
> and ringbuf have mmap implemented.
>
> However, in some use cases, when multiple program instances can
> run concurrently, global mmapable memory can cause race. In that
> case, userspace needs to provide necessary synchronizations to
> coordinate the usage of mapped global data. This can be a source
> of bottleneck.

I can see your use case here. Each calling process can get the
corresponding bpf program task local storage data through
mmap interface. As you mentioned, there is a tradeoff
between more memory vs. non-global synchronization.

I am thinking that another bpf_iter approach can retrieve
the similar result. We could implement a bpf_iter
for task local storage map, optionally it can provide
a tid to retrieve the data for that particular tid.
This way, user space needs an explicit syscall, but
does not need to allocate more memory than necessary.

WDYT?

>
> It would be great to have a mmapable local storage in that case.
> This patch adds that.
>
> Mmap isn't BPF syscall, so unpriv users can also use it to
> interact with maps.
>
> Currently the only way of allocating mmapable map area is using
> vmalloc() and it's only used at map allocation time. Vmalloc()
> may sleep, therefore it's not suitable for maps that may allocate
> memory in an atomic context such as local storage. Local storage
> uses kmalloc() with GFP_ATOMIC, which doesn't sleep. This patch
> uses kmalloc() with GFP_ATOMIC as well for mmapable map area.
>
> Allocating mmapable memory has requirment on page alignment. So we
> have to deliberately allocate more memory than necessary to obtain
> an address that has sdata->data aligned at page boundary. The
> calculations for mmapable allocation size, and the actual
> allocation/deallocation are packaged in three functions:
>
> - bpf_map_mmapable_alloc_size()
> - bpf_map_mmapable_kzalloc()
> - bpf_map_mmapable_kfree()
>
> BPF local storage uses them to provide generic mmap API:
>
> - bpf_local_storage_mmap()
>
> And task local storage adds the mmap callback:
>
> - task_storage_map_mmap()
>
> When application calls mmap on a task local storage, it gets its
> own local storage.
>
> Overall, mmapable local storage trades off memory with flexibility
> and efficiency. It brings memory fragmentation but can make programs
> stateless. Therefore useful in some cases.
>
> Hao Luo (2):
> bpf: Mmapable local storage.
> selftests/bpf: Test mmapable task local storage.
>
> include/linux/bpf.h | 4 +
> include/linux/bpf_local_storage.h | 5 +-
> kernel/bpf/bpf_local_storage.c | 73 +++++++++++++++++--
> kernel/bpf/bpf_task_storage.c | 40 ++++++++++
> kernel/bpf/syscall.c | 67 +++++++++++++++++
> .../bpf/prog_tests/task_local_storage.c | 38 ++++++++++
> .../bpf/progs/task_local_storage_mmapable.c | 38 ++++++++++
> 7 files changed, 257 insertions(+), 8 deletions(-)
> create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c
>

2022-03-28 21:47:15

by Hao Luo

[permalink] [raw]
Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.

Hi Yonghong,

On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <[email protected]> wrote:
>
> On 3/24/22 4:41 PM, Hao Luo wrote:
> > Some map types support mmap operation, which allows userspace to
> > communicate with BPF programs directly. Currently only arraymap
> > and ringbuf have mmap implemented.
> >
> > However, in some use cases, when multiple program instances can
> > run concurrently, global mmapable memory can cause race. In that
> > case, userspace needs to provide necessary synchronizations to
> > coordinate the usage of mapped global data. This can be a source
> > of bottleneck.
>
> I can see your use case here. Each calling process can get the
> corresponding bpf program task local storage data through
> mmap interface. As you mentioned, there is a tradeoff
> between more memory vs. non-global synchronization.
>
> I am thinking that another bpf_iter approach can retrieve
> the similar result. We could implement a bpf_iter
> for task local storage map, optionally it can provide
> a tid to retrieve the data for that particular tid.
> This way, user space needs an explicit syscall, but
> does not need to allocate more memory than necessary.
>
> WDYT?
>

Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:

- mmap prevents the calling task from reading other task's value.
Using bpf_iter, one can pass other task's tid to get their values. I
assume there are two potential ways of passing tid to bpf_iter: one is
to use global data in bpf prog, the other is adding tid parameterized
iter_link. For the first, it's not easy for unpriv tasks to use. For
the second, we need to create one iter_link object for each interested
tid. It may not be easy to use either.

- Regarding adding an explicit syscall. I thought about adding
write/read syscalls for task local storage maps, just like reading
values from iter_link. Writing or reading task local storage map
updates/reads the current task's value. I think this could achieve the
same effect as mmap.

Hao

> >
> > It would be great to have a mmapable local storage in that case.
> > This patch adds that.
> >
> > Mmap isn't BPF syscall, so unpriv users can also use it to
> > interact with maps.
> >
> > Currently the only way of allocating mmapable map area is using
> > vmalloc() and it's only used at map allocation time. Vmalloc()
> > may sleep, therefore it's not suitable for maps that may allocate
> > memory in an atomic context such as local storage. Local storage
> > uses kmalloc() with GFP_ATOMIC, which doesn't sleep. This patch
> > uses kmalloc() with GFP_ATOMIC as well for mmapable map area.
> >
> > Allocating mmapable memory has requirment on page alignment. So we
> > have to deliberately allocate more memory than necessary to obtain
> > an address that has sdata->data aligned at page boundary. The
> > calculations for mmapable allocation size, and the actual
> > allocation/deallocation are packaged in three functions:
> >
> > - bpf_map_mmapable_alloc_size()
> > - bpf_map_mmapable_kzalloc()
> > - bpf_map_mmapable_kfree()
> >
> > BPF local storage uses them to provide generic mmap API:
> >
> > - bpf_local_storage_mmap()
> >
> > And task local storage adds the mmap callback:
> >
> > - task_storage_map_mmap()
> >
> > When application calls mmap on a task local storage, it gets its
> > own local storage.
> >
> > Overall, mmapable local storage trades off memory with flexibility
> > and efficiency. It brings memory fragmentation but can make programs
> > stateless. Therefore useful in some cases.
> >
> > Hao Luo (2):
> > bpf: Mmapable local storage.
> > selftests/bpf: Test mmapable task local storage.
> >
> > include/linux/bpf.h | 4 +
> > include/linux/bpf_local_storage.h | 5 +-
> > kernel/bpf/bpf_local_storage.c | 73 +++++++++++++++++--
> > kernel/bpf/bpf_task_storage.c | 40 ++++++++++
> > kernel/bpf/syscall.c | 67 +++++++++++++++++
> > .../bpf/prog_tests/task_local_storage.c | 38 ++++++++++
> > .../bpf/progs/task_local_storage_mmapable.c | 38 ++++++++++
> > 7 files changed, 257 insertions(+), 8 deletions(-)
> > create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c
> >

2022-03-28 22:35:26

by Hao Luo

[permalink] [raw]
Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.

On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <[email protected]> wrote:
>
> Hi Yonghong,
>
> On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <[email protected]> wrote:
> >
> > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > Some map types support mmap operation, which allows userspace to
> > > communicate with BPF programs directly. Currently only arraymap
> > > and ringbuf have mmap implemented.
> > >
> > > However, in some use cases, when multiple program instances can
> > > run concurrently, global mmapable memory can cause race. In that
> > > case, userspace needs to provide necessary synchronizations to
> > > coordinate the usage of mapped global data. This can be a source
> > > of bottleneck.
> >
> > I can see your use case here. Each calling process can get the
> > corresponding bpf program task local storage data through
> > mmap interface. As you mentioned, there is a tradeoff
> > between more memory vs. non-global synchronization.
> >
> > I am thinking that another bpf_iter approach can retrieve
> > the similar result. We could implement a bpf_iter
> > for task local storage map, optionally it can provide
> > a tid to retrieve the data for that particular tid.
> > This way, user space needs an explicit syscall, but
> > does not need to allocate more memory than necessary.
> >
> > WDYT?
> >
>
> Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
>
> - mmap prevents the calling task from reading other task's value.
> Using bpf_iter, one can pass other task's tid to get their values. I
> assume there are two potential ways of passing tid to bpf_iter: one is
> to use global data in bpf prog, the other is adding tid parameterized
> iter_link. For the first, it's not easy for unpriv tasks to use. For
> the second, we need to create one iter_link object for each interested
> tid. It may not be easy to use either.
>
> - Regarding adding an explicit syscall. I thought about adding
> write/read syscalls for task local storage maps, just like reading
> values from iter_link. Writing or reading task local storage map
> updates/reads the current task's value. I think this could achieve the
> same effect as mmap.
>

Actually, my use case of using mmap on task local storage is to allow
userspace to pass FDs into bpf prog. Some of the helpers I want to add
need to take an FD as parameter and the bpf progs can run
concurrently, thus using global data is racy. Mmapable task local
storage is the best solution I can find for this purpose.

Song also mentioned to me offline, that mmapable task local storage
may be useful for his use case.

I am actually open to other proposals.

> > >
> > > It would be great to have a mmapable local storage in that case.
> > > This patch adds that.
> > >
> > > Mmap isn't BPF syscall, so unpriv users can also use it to
> > > interact with maps.
> > >
> > > Currently the only way of allocating mmapable map area is using
> > > vmalloc() and it's only used at map allocation time. Vmalloc()
> > > may sleep, therefore it's not suitable for maps that may allocate
> > > memory in an atomic context such as local storage. Local storage
> > > uses kmalloc() with GFP_ATOMIC, which doesn't sleep. This patch
> > > uses kmalloc() with GFP_ATOMIC as well for mmapable map area.
> > >
> > > Allocating mmapable memory has requirment on page alignment. So we
> > > have to deliberately allocate more memory than necessary to obtain
> > > an address that has sdata->data aligned at page boundary. The
> > > calculations for mmapable allocation size, and the actual
> > > allocation/deallocation are packaged in three functions:
> > >
> > > - bpf_map_mmapable_alloc_size()
> > > - bpf_map_mmapable_kzalloc()
> > > - bpf_map_mmapable_kfree()
> > >
> > > BPF local storage uses them to provide generic mmap API:
> > >
> > > - bpf_local_storage_mmap()
> > >
> > > And task local storage adds the mmap callback:
> > >
> > > - task_storage_map_mmap()
> > >
> > > When application calls mmap on a task local storage, it gets its
> > > own local storage.
> > >
> > > Overall, mmapable local storage trades off memory with flexibility
> > > and efficiency. It brings memory fragmentation but can make programs
> > > stateless. Therefore useful in some cases.
> > >
> > > Hao Luo (2):
> > > bpf: Mmapable local storage.
> > > selftests/bpf: Test mmapable task local storage.
> > >
> > > include/linux/bpf.h | 4 +
> > > include/linux/bpf_local_storage.h | 5 +-
> > > kernel/bpf/bpf_local_storage.c | 73 +++++++++++++++++--
> > > kernel/bpf/bpf_task_storage.c | 40 ++++++++++
> > > kernel/bpf/syscall.c | 67 +++++++++++++++++
> > > .../bpf/prog_tests/task_local_storage.c | 38 ++++++++++
> > > .../bpf/progs/task_local_storage_mmapable.c | 38 ++++++++++
> > > 7 files changed, 257 insertions(+), 8 deletions(-)
> > > create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c
> > >

2022-03-29 11:41:43

by Kumar Kartikeya Dwivedi

[permalink] [raw]
Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.

On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <[email protected]> wrote:
> >
> > Hi Yonghong,
> >
> > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <[email protected]> wrote:
> > >
> > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > Some map types support mmap operation, which allows userspace to
> > > > communicate with BPF programs directly. Currently only arraymap
> > > > and ringbuf have mmap implemented.
> > > >
> > > > However, in some use cases, when multiple program instances can
> > > > run concurrently, global mmapable memory can cause race. In that
> > > > case, userspace needs to provide necessary synchronizations to
> > > > coordinate the usage of mapped global data. This can be a source
> > > > of bottleneck.
> > >
> > > I can see your use case here. Each calling process can get the
> > > corresponding bpf program task local storage data through
> > > mmap interface. As you mentioned, there is a tradeoff
> > > between more memory vs. non-global synchronization.
> > >
> > > I am thinking that another bpf_iter approach can retrieve
> > > the similar result. We could implement a bpf_iter
> > > for task local storage map, optionally it can provide
> > > a tid to retrieve the data for that particular tid.
> > > This way, user space needs an explicit syscall, but
> > > does not need to allocate more memory than necessary.
> > >
> > > WDYT?
> > >
> >
> > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> >
> > - mmap prevents the calling task from reading other task's value.
> > Using bpf_iter, one can pass other task's tid to get their values. I
> > assume there are two potential ways of passing tid to bpf_iter: one is
> > to use global data in bpf prog, the other is adding tid parameterized
> > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > the second, we need to create one iter_link object for each interested
> > tid. It may not be easy to use either.
> >
> > - Regarding adding an explicit syscall. I thought about adding
> > write/read syscalls for task local storage maps, just like reading
> > values from iter_link. Writing or reading task local storage map
> > updates/reads the current task's value. I think this could achieve the
> > same effect as mmap.
> >
>
> Actually, my use case of using mmap on task local storage is to allow
> userspace to pass FDs into bpf prog. Some of the helpers I want to add
> need to take an FD as parameter and the bpf progs can run
> concurrently, thus using global data is racy. Mmapable task local
> storage is the best solution I can find for this purpose.
>
> Song also mentioned to me offline, that mmapable task local storage
> may be useful for his use case.
>
> I am actually open to other proposals.
>

You could also use a syscall prog, and use bpf_prog_test_run to update local
storage for current. Data can be passed for that specific prog invocation using
ctx. You might have to enable bpf_task_storage helpers in it though, since they
are not allowed to be called right now.

> > > >
> > > > It would be great to have a mmapable local storage in that case.
> > > > This patch adds that.
> > > >
> > > > Mmap isn't BPF syscall, so unpriv users can also use it to
> > > > interact with maps.
> > > >
> > > > Currently the only way of allocating mmapable map area is using
> > > > vmalloc() and it's only used at map allocation time. Vmalloc()
> > > > may sleep, therefore it's not suitable for maps that may allocate
> > > > memory in an atomic context such as local storage. Local storage
> > > > uses kmalloc() with GFP_ATOMIC, which doesn't sleep. This patch
> > > > uses kmalloc() with GFP_ATOMIC as well for mmapable map area.
> > > >
> > > > Allocating mmapable memory has requirment on page alignment. So we
> > > > have to deliberately allocate more memory than necessary to obtain
> > > > an address that has sdata->data aligned at page boundary. The
> > > > calculations for mmapable allocation size, and the actual
> > > > allocation/deallocation are packaged in three functions:
> > > >
> > > > - bpf_map_mmapable_alloc_size()
> > > > - bpf_map_mmapable_kzalloc()
> > > > - bpf_map_mmapable_kfree()
> > > >
> > > > BPF local storage uses them to provide generic mmap API:
> > > >
> > > > - bpf_local_storage_mmap()
> > > >
> > > > And task local storage adds the mmap callback:
> > > >
> > > > - task_storage_map_mmap()
> > > >
> > > > When application calls mmap on a task local storage, it gets its
> > > > own local storage.
> > > >
> > > > Overall, mmapable local storage trades off memory with flexibility
> > > > and efficiency. It brings memory fragmentation but can make programs
> > > > stateless. Therefore useful in some cases.
> > > >
> > > > Hao Luo (2):
> > > > bpf: Mmapable local storage.
> > > > selftests/bpf: Test mmapable task local storage.
> > > >
> > > > include/linux/bpf.h | 4 +
> > > > include/linux/bpf_local_storage.h | 5 +-
> > > > kernel/bpf/bpf_local_storage.c | 73 +++++++++++++++++--
> > > > kernel/bpf/bpf_task_storage.c | 40 ++++++++++
> > > > kernel/bpf/syscall.c | 67 +++++++++++++++++
> > > > .../bpf/prog_tests/task_local_storage.c | 38 ++++++++++
> > > > .../bpf/progs/task_local_storage_mmapable.c | 38 ++++++++++
> > > > 7 files changed, 257 insertions(+), 8 deletions(-)
> > > > create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c
> > > >

--
Kartikeya

2022-03-30 15:17:54

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.

On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> <[email protected]> wrote:
> >
> > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <[email protected]> wrote:
> > > >
> > > > Hi Yonghong,
> > > >
> > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <[email protected]> wrote:
> > > > >
> > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > and ringbuf have mmap implemented.
> > > > > >
> > > > > > However, in some use cases, when multiple program instances can
> > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > of bottleneck.
> > > > >
> > > > > I can see your use case here. Each calling process can get the
> > > > > corresponding bpf program task local storage data through
> > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > between more memory vs. non-global synchronization.
> > > > >
> > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > the similar result. We could implement a bpf_iter
> > > > > for task local storage map, optionally it can provide
> > > > > a tid to retrieve the data for that particular tid.
> > > > > This way, user space needs an explicit syscall, but
> > > > > does not need to allocate more memory than necessary.
> > > > >
> > > > > WDYT?
> > > > >
> > > >
> > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > >
> > > > - mmap prevents the calling task from reading other task's value.
> > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > the second, we need to create one iter_link object for each interested
> > > > tid. It may not be easy to use either.
> > > >
> > > > - Regarding adding an explicit syscall. I thought about adding
> > > > write/read syscalls for task local storage maps, just like reading
> > > > values from iter_link. Writing or reading task local storage map
> > > > updates/reads the current task's value. I think this could achieve the
> > > > same effect as mmap.
> > > >
> > >
> > > Actually, my use case of using mmap on task local storage is to allow
> > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > need to take an FD as parameter and the bpf progs can run
> > > concurrently, thus using global data is racy. Mmapable task local
> > > storage is the best solution I can find for this purpose.
> > >
> > > Song also mentioned to me offline, that mmapable task local storage
> > > may be useful for his use case.
> > >
> > > I am actually open to other proposals.
> > >
> >
> > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > storage for current. Data can be passed for that specific prog invocation using
> > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > are not allowed to be called right now.
> >
>
> The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> thinking of allowing any thread including unpriv ones to be able to
> pass data to the prog and update their own storage.

If I understand the use case correctly all of this mmap-ing is only to
allow unpriv userspace to access a priv map via unpriv mmap() syscall.
But the map can be accessed as unpriv already.
Pin it with the world read creds and do map_lookup sys_bpf cmd on it.

2022-03-30 21:15:29

by Hao Luo

[permalink] [raw]
Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.

On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
<[email protected]> wrote:
>
> On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <[email protected]> wrote:
> > >
> > > Hi Yonghong,
> > >
> > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <[email protected]> wrote:
> > > >
> > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > Some map types support mmap operation, which allows userspace to
> > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > and ringbuf have mmap implemented.
> > > > >
> > > > > However, in some use cases, when multiple program instances can
> > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > case, userspace needs to provide necessary synchronizations to
> > > > > coordinate the usage of mapped global data. This can be a source
> > > > > of bottleneck.
> > > >
> > > > I can see your use case here. Each calling process can get the
> > > > corresponding bpf program task local storage data through
> > > > mmap interface. As you mentioned, there is a tradeoff
> > > > between more memory vs. non-global synchronization.
> > > >
> > > > I am thinking that another bpf_iter approach can retrieve
> > > > the similar result. We could implement a bpf_iter
> > > > for task local storage map, optionally it can provide
> > > > a tid to retrieve the data for that particular tid.
> > > > This way, user space needs an explicit syscall, but
> > > > does not need to allocate more memory than necessary.
> > > >
> > > > WDYT?
> > > >
> > >
> > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > >
> > > - mmap prevents the calling task from reading other task's value.
> > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > to use global data in bpf prog, the other is adding tid parameterized
> > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > the second, we need to create one iter_link object for each interested
> > > tid. It may not be easy to use either.
> > >
> > > - Regarding adding an explicit syscall. I thought about adding
> > > write/read syscalls for task local storage maps, just like reading
> > > values from iter_link. Writing or reading task local storage map
> > > updates/reads the current task's value. I think this could achieve the
> > > same effect as mmap.
> > >
> >
> > Actually, my use case of using mmap on task local storage is to allow
> > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > need to take an FD as parameter and the bpf progs can run
> > concurrently, thus using global data is racy. Mmapable task local
> > storage is the best solution I can find for this purpose.
> >
> > Song also mentioned to me offline, that mmapable task local storage
> > may be useful for his use case.
> >
> > I am actually open to other proposals.
> >
>
> You could also use a syscall prog, and use bpf_prog_test_run to update local
> storage for current. Data can be passed for that specific prog invocation using
> ctx. You might have to enable bpf_task_storage helpers in it though, since they
> are not allowed to be called right now.
>

The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
thinking of allowing any thread including unpriv ones to be able to
pass data to the prog and update their own storage.

> > > > >
> > > > > It would be great to have a mmapable local storage in that case.
> > > > > This patch adds that.
> > > > >
> > > > > Mmap isn't BPF syscall, so unpriv users can also use it to
> > > > > interact with maps.
> > > > >
> > > > > Currently the only way of allocating mmapable map area is using
> > > > > vmalloc() and it's only used at map allocation time. Vmalloc()
> > > > > may sleep, therefore it's not suitable for maps that may allocate
> > > > > memory in an atomic context such as local storage. Local storage
> > > > > uses kmalloc() with GFP_ATOMIC, which doesn't sleep. This patch
> > > > > uses kmalloc() with GFP_ATOMIC as well for mmapable map area.
> > > > >
> > > > > Allocating mmapable memory has requirment on page alignment. So we
> > > > > have to deliberately allocate more memory than necessary to obtain
> > > > > an address that has sdata->data aligned at page boundary. The
> > > > > calculations for mmapable allocation size, and the actual
> > > > > allocation/deallocation are packaged in three functions:
> > > > >
> > > > > - bpf_map_mmapable_alloc_size()
> > > > > - bpf_map_mmapable_kzalloc()
> > > > > - bpf_map_mmapable_kfree()
> > > > >
> > > > > BPF local storage uses them to provide generic mmap API:
> > > > >
> > > > > - bpf_local_storage_mmap()
> > > > >
> > > > > And task local storage adds the mmap callback:
> > > > >
> > > > > - task_storage_map_mmap()
> > > > >
> > > > > When application calls mmap on a task local storage, it gets its
> > > > > own local storage.
> > > > >
> > > > > Overall, mmapable local storage trades off memory with flexibility
> > > > > and efficiency. It brings memory fragmentation but can make programs
> > > > > stateless. Therefore useful in some cases.
> > > > >
> > > > > Hao Luo (2):
> > > > > bpf: Mmapable local storage.
> > > > > selftests/bpf: Test mmapable task local storage.
> > > > >
> > > > > include/linux/bpf.h | 4 +
> > > > > include/linux/bpf_local_storage.h | 5 +-
> > > > > kernel/bpf/bpf_local_storage.c | 73 +++++++++++++++++--
> > > > > kernel/bpf/bpf_task_storage.c | 40 ++++++++++
> > > > > kernel/bpf/syscall.c | 67 +++++++++++++++++
> > > > > .../bpf/prog_tests/task_local_storage.c | 38 ++++++++++
> > > > > .../bpf/progs/task_local_storage_mmapable.c | 38 ++++++++++
> > > > > 7 files changed, 257 insertions(+), 8 deletions(-)
> > > > > create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c
> > > > >
>
> --
> Kartikeya

2022-03-31 02:31:24

by Hao Luo

[permalink] [raw]
Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.

On Wed, Mar 30, 2022 at 11:16 AM Alexei Starovoitov
<[email protected]> wrote:
>
> On Wed, Mar 30, 2022 at 11:06 AM Hao Luo <[email protected]> wrote:
> >
> > On Tue, Mar 29, 2022 at 4:30 PM Alexei Starovoitov
> > <[email protected]> wrote:
> > >
> > > On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> > > > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <[email protected]> wrote:
> > > > > > >
> > > > > > > Hi Yonghong,
> > > > > > >
> > > > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > > > > and ringbuf have mmap implemented.
> > > > > > > > >
> > > > > > > > > However, in some use cases, when multiple program instances can
> > > > > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > > > > of bottleneck.
> > > > > > > >
> > > > > > > > I can see your use case here. Each calling process can get the
> > > > > > > > corresponding bpf program task local storage data through
> > > > > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > > > > between more memory vs. non-global synchronization.
> > > > > > > >
> > > > > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > > > > the similar result. We could implement a bpf_iter
> > > > > > > > for task local storage map, optionally it can provide
> > > > > > > > a tid to retrieve the data for that particular tid.
> > > > > > > > This way, user space needs an explicit syscall, but
> > > > > > > > does not need to allocate more memory than necessary.
> > > > > > > >
> > > > > > > > WDYT?
> > > > > > > >
> > > > > > >
> > > > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > > > > >
> > > > > > > - mmap prevents the calling task from reading other task's value.
> > > > > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > > > > the second, we need to create one iter_link object for each interested
> > > > > > > tid. It may not be easy to use either.
> > > > > > >
> > > > > > > - Regarding adding an explicit syscall. I thought about adding
> > > > > > > write/read syscalls for task local storage maps, just like reading
> > > > > > > values from iter_link. Writing or reading task local storage map
> > > > > > > updates/reads the current task's value. I think this could achieve the
> > > > > > > same effect as mmap.
> > > > > > >
> > > > > >
> > > > > > Actually, my use case of using mmap on task local storage is to allow
> > > > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > > > > need to take an FD as parameter and the bpf progs can run
> > > > > > concurrently, thus using global data is racy. Mmapable task local
> > > > > > storage is the best solution I can find for this purpose.
> > > > > >
> > > > > > Song also mentioned to me offline, that mmapable task local storage
> > > > > > may be useful for his use case.
> > > > > >
> > > > > > I am actually open to other proposals.
> > > > > >
> > > > >
> > > > > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > > > > storage for current. Data can be passed for that specific prog invocation using
> > > > > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > > > > are not allowed to be called right now.
> > > > >
> > > >
> > > > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> > > > thinking of allowing any thread including unpriv ones to be able to
> > > > pass data to the prog and update their own storage.
> > >
> > > If I understand the use case correctly all of this mmap-ing is only to
> > > allow unpriv userspace to access a priv map via unpriv mmap() syscall.
> > > But the map can be accessed as unpriv already.
> > > Pin it with the world read creds and do map_lookup sys_bpf cmd on it.
> >
> > Right, but, if I understand correctly, with
> > sysctl_unprivileged_bpf_disabled, unpriv tasks are not able to make
> > use of __sys_bpf(). Is there anything I missed?
>
> That sysctl is a heavy hammer. Let's fix it instead.
> map lookup/update/delete can be allowed for unpriv for certain map types.
> There are permissions checks in corresponding lookup/update calls already.

This sounds great. If we can allow basic map operations for some map
types, it will change many use cases I'm looking at. Let me take a
look and report back.

2022-03-31 04:18:42

by Hao Luo

[permalink] [raw]
Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.

On Tue, Mar 29, 2022 at 4:30 PM Alexei Starovoitov
<[email protected]> wrote:
>
> On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> > <[email protected]> wrote:
> > >
> > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <[email protected]> wrote:
> > > > >
> > > > > Hi Yonghong,
> > > > >
> > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <[email protected]> wrote:
> > > > > >
> > > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > > and ringbuf have mmap implemented.
> > > > > > >
> > > > > > > However, in some use cases, when multiple program instances can
> > > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > > of bottleneck.
> > > > > >
> > > > > > I can see your use case here. Each calling process can get the
> > > > > > corresponding bpf program task local storage data through
> > > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > > between more memory vs. non-global synchronization.
> > > > > >
> > > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > > the similar result. We could implement a bpf_iter
> > > > > > for task local storage map, optionally it can provide
> > > > > > a tid to retrieve the data for that particular tid.
> > > > > > This way, user space needs an explicit syscall, but
> > > > > > does not need to allocate more memory than necessary.
> > > > > >
> > > > > > WDYT?
> > > > > >
> > > > >
> > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > > >
> > > > > - mmap prevents the calling task from reading other task's value.
> > > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > > the second, we need to create one iter_link object for each interested
> > > > > tid. It may not be easy to use either.
> > > > >
> > > > > - Regarding adding an explicit syscall. I thought about adding
> > > > > write/read syscalls for task local storage maps, just like reading
> > > > > values from iter_link. Writing or reading task local storage map
> > > > > updates/reads the current task's value. I think this could achieve the
> > > > > same effect as mmap.
> > > > >
> > > >
> > > > Actually, my use case of using mmap on task local storage is to allow
> > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > > need to take an FD as parameter and the bpf progs can run
> > > > concurrently, thus using global data is racy. Mmapable task local
> > > > storage is the best solution I can find for this purpose.
> > > >
> > > > Song also mentioned to me offline, that mmapable task local storage
> > > > may be useful for his use case.
> > > >
> > > > I am actually open to other proposals.
> > > >
> > >
> > > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > > storage for current. Data can be passed for that specific prog invocation using
> > > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > > are not allowed to be called right now.
> > >
> >
> > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> > thinking of allowing any thread including unpriv ones to be able to
> > pass data to the prog and update their own storage.
>
> If I understand the use case correctly all of this mmap-ing is only to
> allow unpriv userspace to access a priv map via unpriv mmap() syscall.
> But the map can be accessed as unpriv already.
> Pin it with the world read creds and do map_lookup sys_bpf cmd on it.

Right, but, if I understand correctly, with
sysctl_unprivileged_bpf_disabled, unpriv tasks are not able to make
use of __sys_bpf(). Is there anything I missed?

2022-03-31 04:20:07

by Hao Luo

[permalink] [raw]
Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.

On Tue, Mar 29, 2022 at 2:45 PM Martin KaFai Lau <[email protected]> wrote:
>
> On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> > <[email protected]> wrote:
> > >
> > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <[email protected]> wrote:
> > > > >
> > > > > Hi Yonghong,
> > > > >
> > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <[email protected]> wrote:
> > > > > >
> > > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > > and ringbuf have mmap implemented.
> > > > > > >
> > > > > > > However, in some use cases, when multiple program instances can
> > > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > > of bottleneck.
> > > > > >
> > > > > > I can see your use case here. Each calling process can get the
> > > > > > corresponding bpf program task local storage data through
> > > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > > between more memory vs. non-global synchronization.
> > > > > >
> > > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > > the similar result. We could implement a bpf_iter
> > > > > > for task local storage map, optionally it can provide
> > > > > > a tid to retrieve the data for that particular tid.
> > > > > > This way, user space needs an explicit syscall, but
> > > > > > does not need to allocate more memory than necessary.
> > > > > >
> > > > > > WDYT?
> > > > > >
> > > > >
> > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > > >
> > > > > - mmap prevents the calling task from reading other task's value.
> > > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > > the second, we need to create one iter_link object for each interested
> > > > > tid. It may not be easy to use either.
> > > > >
> > > > > - Regarding adding an explicit syscall. I thought about adding
> > > > > write/read syscalls for task local storage maps, just like reading
> > > > > values from iter_link. Writing or reading task local storage map
> > > > > updates/reads the current task's value. I think this could achieve the
> > > > > same effect as mmap.
> > > > >
> > > >
> > > > Actually, my use case of using mmap on task local storage is to allow
> > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > > need to take an FD as parameter and the bpf progs can run
> > > > concurrently, thus using global data is racy. Mmapable task local
> > > > storage is the best solution I can find for this purpose.
> Some more details is needed about the use case. As long as there is
> storage local to an individual task, racing within this one task's
> specific storage is a non issue?
>

Race is still possible. In the use case I was thinking, the workflow is:

1. Current task mmaps a local storage map, writes a value to its local storage.
2. Current task then makes a syscall, which triggers a bpf prog.
3. The bpf prog reads the current task's local storage and fetches the
value stored by the task.

The steps above are sequential, therefore no race between task and bpf
prog. If a task accesses other task's local storage, there is still
race.

> The patch 2 example is doable with the current api and is pretty far from
> the above use case description. The existing bpf_map_update_elem() and
> bpf_map_lookup_elem() can not solve your use case?
>

With sysctl_unprivileged_bpf_disabled, tasks without CAP_BPF are not
able to make use of __sys_bpf(). bpf_map_update_elem() and
bpf_map_lookup_elem() call __sys_bpf underneath IIRC. So unpriv tasks
can't use these two syscalls.

> or the current bpf_map_{update,lookup}_elem() works but
> prefer a direct data read/write interface?
>
> btw, how delete is going to look like ?
>

Good question. Deletion is not done from the userspace. It's done at
bpf prog side. The task mmaps its storage (which creates the storage),
writes a value. The bpf prog reads the value and deletes the storage.

> and do you see the mmap could be used with sk and inode storage instead
> of the 'current' task?
>

Yes. I think this patch can certainly extend to sk or inode or cgroup
local storage. But I don't have a use case yet.

> > > >
> > > > Song also mentioned to me offline, that mmapable task local storage
> > > > may be useful for his use case.
> > > >
> > > > I am actually open to other proposals.
> If the arraymap is local to an individual task, does it solve
> your use case? Have you thought about storing an arraymap (which is mmap-able)
> in the local storage? That could then be used to store ringbuf map and
> other bpf maps in local storage. It is logically similar to map-in-map.
> The outer map is the pseudo local storage map here.
>

I haven't thought about this. IMHO, I feel it might be complicated to
properly set up the set of maps.


> > > >
> > >
> > > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > > storage for current. Data can be passed for that specific prog invocation using
> > > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > > are not allowed to be called right now.
> > >
> >
> > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> > thinking of allowing any thread including unpriv ones to be able to
> > pass data to the prog and update their own storage.

2022-03-31 04:38:27

by Martin KaFai Lau

[permalink] [raw]
Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.

On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> <[email protected]> wrote:
> >
> > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <[email protected]> wrote:
> > > >
> > > > Hi Yonghong,
> > > >
> > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <[email protected]> wrote:
> > > > >
> > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > and ringbuf have mmap implemented.
> > > > > >
> > > > > > However, in some use cases, when multiple program instances can
> > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > of bottleneck.
> > > > >
> > > > > I can see your use case here. Each calling process can get the
> > > > > corresponding bpf program task local storage data through
> > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > between more memory vs. non-global synchronization.
> > > > >
> > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > the similar result. We could implement a bpf_iter
> > > > > for task local storage map, optionally it can provide
> > > > > a tid to retrieve the data for that particular tid.
> > > > > This way, user space needs an explicit syscall, but
> > > > > does not need to allocate more memory than necessary.
> > > > >
> > > > > WDYT?
> > > > >
> > > >
> > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > >
> > > > - mmap prevents the calling task from reading other task's value.
> > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > the second, we need to create one iter_link object for each interested
> > > > tid. It may not be easy to use either.
> > > >
> > > > - Regarding adding an explicit syscall. I thought about adding
> > > > write/read syscalls for task local storage maps, just like reading
> > > > values from iter_link. Writing or reading task local storage map
> > > > updates/reads the current task's value. I think this could achieve the
> > > > same effect as mmap.
> > > >
> > >
> > > Actually, my use case of using mmap on task local storage is to allow
> > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > need to take an FD as parameter and the bpf progs can run
> > > concurrently, thus using global data is racy. Mmapable task local
> > > storage is the best solution I can find for this purpose.
Some more details is needed about the use case. As long as there is
storage local to an individual task, racing within this one task's
specific storage is a non issue?

The patch 2 example is doable with the current api and is pretty far from
the above use case description. The existing bpf_map_update_elem() and
bpf_map_lookup_elem() can not solve your use case?

or the current bpf_map_{update,lookup}_elem() works but
prefer a direct data read/write interface?

btw, how delete is going to look like ?

and do you see the mmap could be used with sk and inode storage instead
of the 'current' task?

> > >
> > > Song also mentioned to me offline, that mmapable task local storage
> > > may be useful for his use case.
> > >
> > > I am actually open to other proposals.
If the arraymap is local to an individual task, does it solve
your use case? Have you thought about storing an arraymap (which is mmap-able)
in the local storage? That could then be used to store ringbuf map and
other bpf maps in local storage. It is logically similar to map-in-map.
The outer map is the pseudo local storage map here.

> > >
> >
> > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > storage for current. Data can be passed for that specific prog invocation using
> > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > are not allowed to be called right now.
> >
>
> The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> thinking of allowing any thread including unpriv ones to be able to
> pass data to the prog and update their own storage.

2022-03-31 04:49:35

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.

On Wed, Mar 30, 2022 at 11:06 AM Hao Luo <[email protected]> wrote:
>
> On Tue, Mar 29, 2022 at 4:30 PM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> > > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> > > <[email protected]> wrote:
> > > >
> > > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <[email protected]> wrote:
> > > > > >
> > > > > > Hi Yonghong,
> > > > > >
> > > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <[email protected]> wrote:
> > > > > > >
> > > > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > > > and ringbuf have mmap implemented.
> > > > > > > >
> > > > > > > > However, in some use cases, when multiple program instances can
> > > > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > > > of bottleneck.
> > > > > > >
> > > > > > > I can see your use case here. Each calling process can get the
> > > > > > > corresponding bpf program task local storage data through
> > > > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > > > between more memory vs. non-global synchronization.
> > > > > > >
> > > > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > > > the similar result. We could implement a bpf_iter
> > > > > > > for task local storage map, optionally it can provide
> > > > > > > a tid to retrieve the data for that particular tid.
> > > > > > > This way, user space needs an explicit syscall, but
> > > > > > > does not need to allocate more memory than necessary.
> > > > > > >
> > > > > > > WDYT?
> > > > > > >
> > > > > >
> > > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > > > >
> > > > > > - mmap prevents the calling task from reading other task's value.
> > > > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > > > the second, we need to create one iter_link object for each interested
> > > > > > tid. It may not be easy to use either.
> > > > > >
> > > > > > - Regarding adding an explicit syscall. I thought about adding
> > > > > > write/read syscalls for task local storage maps, just like reading
> > > > > > values from iter_link. Writing or reading task local storage map
> > > > > > updates/reads the current task's value. I think this could achieve the
> > > > > > same effect as mmap.
> > > > > >
> > > > >
> > > > > Actually, my use case of using mmap on task local storage is to allow
> > > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > > > need to take an FD as parameter and the bpf progs can run
> > > > > concurrently, thus using global data is racy. Mmapable task local
> > > > > storage is the best solution I can find for this purpose.
> > > > >
> > > > > Song also mentioned to me offline, that mmapable task local storage
> > > > > may be useful for his use case.
> > > > >
> > > > > I am actually open to other proposals.
> > > > >
> > > >
> > > > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > > > storage for current. Data can be passed for that specific prog invocation using
> > > > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > > > are not allowed to be called right now.
> > > >
> > >
> > > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> > > thinking of allowing any thread including unpriv ones to be able to
> > > pass data to the prog and update their own storage.
> >
> > If I understand the use case correctly all of this mmap-ing is only to
> > allow unpriv userspace to access a priv map via unpriv mmap() syscall.
> > But the map can be accessed as unpriv already.
> > Pin it with the world read creds and do map_lookup sys_bpf cmd on it.
>
> Right, but, if I understand correctly, with
> sysctl_unprivileged_bpf_disabled, unpriv tasks are not able to make
> use of __sys_bpf(). Is there anything I missed?

That sysctl is a heavy hammer. Let's fix it instead.
map lookup/update/delete can be allowed for unpriv for certain map types.
There are permissions checks in corresponding lookup/update calls already.

2022-04-01 17:40:23

by KP Singh

[permalink] [raw]
Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.

On Wed, Mar 30, 2022 at 8:26 PM Hao Luo <[email protected]> wrote:
>
> On Wed, Mar 30, 2022 at 11:16 AM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Wed, Mar 30, 2022 at 11:06 AM Hao Luo <[email protected]> wrote:
> > >
> > > On Tue, Mar 29, 2022 at 4:30 PM Alexei Starovoitov
> > > <[email protected]> wrote:
> > > >
> > > > On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> > > > > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > > > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Hi Yonghong,
> > > > > > > >
> > > > > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > > > > > and ringbuf have mmap implemented.
> > > > > > > > > >
> > > > > > > > > > However, in some use cases, when multiple program instances can
> > > > > > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > > > > > of bottleneck.
> > > > > > > > >
> > > > > > > > > I can see your use case here. Each calling process can get the
> > > > > > > > > corresponding bpf program task local storage data through
> > > > > > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > > > > > between more memory vs. non-global synchronization.
> > > > > > > > >
> > > > > > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > > > > > the similar result. We could implement a bpf_iter
> > > > > > > > > for task local storage map, optionally it can provide
> > > > > > > > > a tid to retrieve the data for that particular tid.
> > > > > > > > > This way, user space needs an explicit syscall, but
> > > > > > > > > does not need to allocate more memory than necessary.
> > > > > > > > >
> > > > > > > > > WDYT?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > > > > > >
> > > > > > > > - mmap prevents the calling task from reading other task's value.
> > > > > > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > > > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > > > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > > > > > the second, we need to create one iter_link object for each interested
> > > > > > > > tid. It may not be easy to use either.
> > > > > > > >
> > > > > > > > - Regarding adding an explicit syscall. I thought about adding
> > > > > > > > write/read syscalls for task local storage maps, just like reading
> > > > > > > > values from iter_link. Writing or reading task local storage map
> > > > > > > > updates/reads the current task's value. I think this could achieve the
> > > > > > > > same effect as mmap.
> > > > > > > >
> > > > > > >
> > > > > > > Actually, my use case of using mmap on task local storage is to allow
> > > > > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > > > > > need to take an FD as parameter and the bpf progs can run
> > > > > > > concurrently, thus using global data is racy. Mmapable task local
> > > > > > > storage is the best solution I can find for this purpose.
> > > > > > >
> > > > > > > Song also mentioned to me offline, that mmapable task local storage
> > > > > > > may be useful for his use case.
> > > > > > >
> > > > > > > I am actually open to other proposals.
> > > > > > >
> > > > > >
> > > > > > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > > > > > storage for current. Data can be passed for that specific prog invocation using
> > > > > > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > > > > > are not allowed to be called right now.
> > > > > >
> > > > >
> > > > > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> > > > > thinking of allowing any thread including unpriv ones to be able to
> > > > > pass data to the prog and update their own storage.
> > > >
> > > > If I understand the use case correctly all of this mmap-ing is only to
> > > > allow unpriv userspace to access a priv map via unpriv mmap() syscall.
> > > > But the map can be accessed as unpriv already.
> > > > Pin it with the world read creds and do map_lookup sys_bpf cmd on it.
> > >
> > > Right, but, if I understand correctly, with
> > > sysctl_unprivileged_bpf_disabled, unpriv tasks are not able to make
> > > use of __sys_bpf(). Is there anything I missed?
> >
> > That sysctl is a heavy hammer. Let's fix it instead.
> > map lookup/update/delete can be allowed for unpriv for certain map types.
> > There are permissions checks in corresponding lookup/update calls already.
>

(Adding Jann)

I wonder if we can tag a map as BPF_F_UNPRIVILEGED and allow the writes to
only maps that are explicitly marked as writable by unprivileged processes.

We will have task local storage in LSM programs that we
won't like unprivileged processes to write to as well.

struct {
__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
__uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_UNPRIVILEGED);
__type(key, int);
__type(value, struct fd_storage);
} task_fd_storage_map SEC(".maps");

- KP

> This sounds great. If we can allow basic map operations for some map
> types, it will change many use cases I'm looking at. Let me take a
> look and report back.

2022-04-02 17:12:56

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.

On Thu, Mar 31, 2022 at 3:32 PM KP Singh <[email protected]> wrote:
>
> On Wed, Mar 30, 2022 at 8:26 PM Hao Luo <[email protected]> wrote:
> >
> > On Wed, Mar 30, 2022 at 11:16 AM Alexei Starovoitov
> > <[email protected]> wrote:
> > >
> > > On Wed, Mar 30, 2022 at 11:06 AM Hao Luo <[email protected]> wrote:
> > > >
> > > > On Tue, Mar 29, 2022 at 4:30 PM Alexei Starovoitov
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> > > > > > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > > > > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Hi Yonghong,
> > > > > > > > >
> > > > > > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > > > > > > and ringbuf have mmap implemented.
> > > > > > > > > > >
> > > > > > > > > > > However, in some use cases, when multiple program instances can
> > > > > > > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > > > > > > of bottleneck.
> > > > > > > > > >
> > > > > > > > > > I can see your use case here. Each calling process can get the
> > > > > > > > > > corresponding bpf program task local storage data through
> > > > > > > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > > > > > > between more memory vs. non-global synchronization.
> > > > > > > > > >
> > > > > > > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > > > > > > the similar result. We could implement a bpf_iter
> > > > > > > > > > for task local storage map, optionally it can provide
> > > > > > > > > > a tid to retrieve the data for that particular tid.
> > > > > > > > > > This way, user space needs an explicit syscall, but
> > > > > > > > > > does not need to allocate more memory than necessary.
> > > > > > > > > >
> > > > > > > > > > WDYT?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > > > > > > >
> > > > > > > > > - mmap prevents the calling task from reading other task's value.
> > > > > > > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > > > > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > > > > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > > > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > > > > > > the second, we need to create one iter_link object for each interested
> > > > > > > > > tid. It may not be easy to use either.
> > > > > > > > >
> > > > > > > > > - Regarding adding an explicit syscall. I thought about adding
> > > > > > > > > write/read syscalls for task local storage maps, just like reading
> > > > > > > > > values from iter_link. Writing or reading task local storage map
> > > > > > > > > updates/reads the current task's value. I think this could achieve the
> > > > > > > > > same effect as mmap.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Actually, my use case of using mmap on task local storage is to allow
> > > > > > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > > > > > > need to take an FD as parameter and the bpf progs can run
> > > > > > > > concurrently, thus using global data is racy. Mmapable task local
> > > > > > > > storage is the best solution I can find for this purpose.
> > > > > > > >
> > > > > > > > Song also mentioned to me offline, that mmapable task local storage
> > > > > > > > may be useful for his use case.
> > > > > > > >
> > > > > > > > I am actually open to other proposals.
> > > > > > > >
> > > > > > >
> > > > > > > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > > > > > > storage for current. Data can be passed for that specific prog invocation using
> > > > > > > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > > > > > > are not allowed to be called right now.
> > > > > > >
> > > > > >
> > > > > > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> > > > > > thinking of allowing any thread including unpriv ones to be able to
> > > > > > pass data to the prog and update their own storage.
> > > > >
> > > > > If I understand the use case correctly all of this mmap-ing is only to
> > > > > allow unpriv userspace to access a priv map via unpriv mmap() syscall.
> > > > > But the map can be accessed as unpriv already.
> > > > > Pin it with the world read creds and do map_lookup sys_bpf cmd on it.
> > > >
> > > > Right, but, if I understand correctly, with
> > > > sysctl_unprivileged_bpf_disabled, unpriv tasks are not able to make
> > > > use of __sys_bpf(). Is there anything I missed?
> > >
> > > That sysctl is a heavy hammer. Let's fix it instead.
> > > map lookup/update/delete can be allowed for unpriv for certain map types.
> > > There are permissions checks in corresponding lookup/update calls already.
> >
>
> (Adding Jann)
>
> I wonder if we can tag a map as BPF_F_UNPRIVILEGED and allow the writes to
> only maps that are explicitly marked as writable by unprivileged processes.

I think it's overkill for existing unpriv maps like hash and array.
These maps by themself don't pose a security threat.
The sysctl was/is in the wrong place.

> We will have task local storage in LSM programs that we
> won't like unprivileged processes to write to as well.
>
> struct {
> __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> __uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_UNPRIVILEGED);
> __type(key, int);
> __type(value, struct fd_storage);
> } task_fd_storage_map SEC(".maps");

local storage map was not exposed to unpriv before.
This would be a different consideration.
But even in such a case the extra flag looks unnecessary.

2022-04-05 03:19:50

by KP Singh

[permalink] [raw]
Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.

On Fri, Apr 1, 2022 at 1:06 AM Alexei Starovoitov
<[email protected]> wrote:
>
> On Thu, Mar 31, 2022 at 3:32 PM KP Singh <[email protected]> wrote:
> >
> > On Wed, Mar 30, 2022 at 8:26 PM Hao Luo <[email protected]> wrote:
> > >
> > > On Wed, Mar 30, 2022 at 11:16 AM Alexei Starovoitov
> > > <[email protected]> wrote:
> > > >
> > > > On Wed, Mar 30, 2022 at 11:06 AM Hao Luo <[email protected]> wrote:
> > > > >
> > > > > On Tue, Mar 29, 2022 at 4:30 PM Alexei Starovoitov
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> > > > > > > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > > > > > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Yonghong,
> > > > > > > > > >
> > > > > > > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > > > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > > > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > > > > > > > and ringbuf have mmap implemented.
> > > > > > > > > > > >
> > > > > > > > > > > > However, in some use cases, when multiple program instances can
> > > > > > > > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > > > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > > > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > > > > > > > of bottleneck.
> > > > > > > > > > >
> > > > > > > > > > > I can see your use case here. Each calling process can get the
> > > > > > > > > > > corresponding bpf program task local storage data through
> > > > > > > > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > > > > > > > between more memory vs. non-global synchronization.
> > > > > > > > > > >
> > > > > > > > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > > > > > > > the similar result. We could implement a bpf_iter
> > > > > > > > > > > for task local storage map, optionally it can provide
> > > > > > > > > > > a tid to retrieve the data for that particular tid.
> > > > > > > > > > > This way, user space needs an explicit syscall, but
> > > > > > > > > > > does not need to allocate more memory than necessary.
> > > > > > > > > > >
> > > > > > > > > > > WDYT?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > > > > > > > >
> > > > > > > > > > - mmap prevents the calling task from reading other task's value.
> > > > > > > > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > > > > > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > > > > > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > > > > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > > > > > > > the second, we need to create one iter_link object for each interested
> > > > > > > > > > tid. It may not be easy to use either.
> > > > > > > > > >
> > > > > > > > > > - Regarding adding an explicit syscall. I thought about adding
> > > > > > > > > > write/read syscalls for task local storage maps, just like reading
> > > > > > > > > > values from iter_link. Writing or reading task local storage map
> > > > > > > > > > updates/reads the current task's value. I think this could achieve the
> > > > > > > > > > same effect as mmap.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Actually, my use case of using mmap on task local storage is to allow
> > > > > > > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > > > > > > > need to take an FD as parameter and the bpf progs can run
> > > > > > > > > concurrently, thus using global data is racy. Mmapable task local
> > > > > > > > > storage is the best solution I can find for this purpose.
> > > > > > > > >
> > > > > > > > > Song also mentioned to me offline, that mmapable task local storage
> > > > > > > > > may be useful for his use case.
> > > > > > > > >
> > > > > > > > > I am actually open to other proposals.
> > > > > > > > >
> > > > > > > >
> > > > > > > > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > > > > > > > storage for current. Data can be passed for that specific prog invocation using
> > > > > > > > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > > > > > > > are not allowed to be called right now.
> > > > > > > >
> > > > > > >
> > > > > > > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> > > > > > > thinking of allowing any thread including unpriv ones to be able to
> > > > > > > pass data to the prog and update their own storage.
> > > > > >
> > > > > > If I understand the use case correctly all of this mmap-ing is only to
> > > > > > allow unpriv userspace to access a priv map via unpriv mmap() syscall.
> > > > > > But the map can be accessed as unpriv already.
> > > > > > Pin it with the world read creds and do map_lookup sys_bpf cmd on it.
> > > > >
> > > > > Right, but, if I understand correctly, with
> > > > > sysctl_unprivileged_bpf_disabled, unpriv tasks are not able to make
> > > > > use of __sys_bpf(). Is there anything I missed?
> > > >
> > > > That sysctl is a heavy hammer. Let's fix it instead.
> > > > map lookup/update/delete can be allowed for unpriv for certain map types.
> > > > There are permissions checks in corresponding lookup/update calls already.
> > >
> >
> > (Adding Jann)
> >
> > I wonder if we can tag a map as BPF_F_UNPRIVILEGED and allow the writes to
> > only maps that are explicitly marked as writable by unprivileged processes.
>
> I think it's overkill for existing unpriv maps like hash and array.
> These maps by themself don't pose a security threat.
> The sysctl was/is in the wrong place.
>
> > We will have task local storage in LSM programs that we
> > won't like unprivileged processes to write to as well.
> >
> > struct {
> > __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > __uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_UNPRIVILEGED);
> > __type(key, int);
> > __type(value, struct fd_storage);
> > } task_fd_storage_map SEC(".maps");
>
> local storage map was not exposed to unpriv before.
> This would be a different consideration.
> But even in such a case the extra flag looks unnecessary.

I took a look at the code and it makes sense to allow maps like hash and array
which are already mmapable. These should not really need the BPF_F_UNPRIVILEGED.

Hao, I think you can send a patch that removes these map operations
from the scope of
the sysctl. Regarding the local storages, let's do them separately
since it would be a newer
access surface.