On Wed, Sep 8, 2021 at 5:16 PM Prakash Sangappa
<[email protected]> wrote:
>
> Including liunx-kernel..
>
> Resending RFC. This patchset is not final. I am looking for feedback on
> this proposal to share thread specific data for us in latency sensitive
> codepath.
Hi Prakash,
I'd like to add here that Jann and I have been discussing a similar
feature for my UMCG patchset:
https://lore.kernel.org/lkml/CAG48ez0mgCXpXnqAUsa0TcFBPjrid-74Gj=xG8HZqj2n+OPoKw@mail.gmail.com/
In short, due to the need to read/write to the userspace from
non-sleepable contexts in the kernel it seems that we need to have some
form of per task/thread kernel/userspace shared memory that is pinned,
similar to what your sys_task_getshared does.
Do you think your sys_task_getshared can be tweaked to return an
arbitrarily-sized block of memory (subject to overall constraints)
rather than a fixed number of "options"?
On a more general note, we have a kernel extension internally at
Google, named "kuchannel", that is similar to what you propose here:
per task/thread shared memory with counters and other stat fields that
the kernel populates and the userspace reads (and some additional
functionality that is not too relevant to the discussion).
Thanks,
Peter
[...]
> On Sep 10, 2021, at 8:18 AM, Peter Oskolkov <[email protected]> wrote:
>
> On Wed, Sep 8, 2021 at 5:16 PM Prakash Sangappa
> <[email protected]> wrote:
>>
>> Including liunx-kernel..
>>
>> Resending RFC. This patchset is not final. I am looking for feedback on
>> this proposal to share thread specific data for us in latency sensitive
>> codepath.
>
> Hi Prakash,
>
> I'd like to add here that Jann and I have been discussing a similar
> feature for my UMCG patchset:
>
> https://lore.kernel.org/lkml/CAG48ez0mgCXpXnqAUsa0TcFBPjrid-74Gj=xG8HZqj2n+OPoKw@mail.gmail.com/
Hi Peter,
I will take a look.
>
> In short, due to the need to read/write to the userspace from
> non-sleepable contexts in the kernel it seems that we need to have some
> form of per task/thread kernel/userspace shared memory that is pinned,
> similar to what your sys_task_getshared does.
Exactly. For this reason wanted kernel to allocate the pinned memory.
Didn’t want to deal with files etc as a large number threads will be using
the shared structure mechanism.
>
> Do you think your sys_task_getshared can be tweaked to return an
> arbitrarily-sized block of memory (subject to overall constraints)
> rather than a fixed number of "options"?
I suppose it could. How big of a size? We don’t want to hold on to
arbitrarily large amount of pinned memory. The preference would
be for the kernel to decide what is going to be shared based on
what functionality/data sharing is supported. In that sense the size
is pre defined not something the userspace/application can ask.
I have not looked at your use case.
>
> On a more general note, we have a kernel extension internally at
> Google, named "kuchannel", that is similar to what you propose here:
> per task/thread shared memory with counters and other stat fields that
> the kernel populates and the userspace reads (and some additional
> functionality that is not too relevant to the discussion).
We have few other use cases for this we are looking at, which I can
describe later.
-Prakash
>
> Thanks,
> Peter
>
> [...]
On Fri, Sep 10, 2021 at 9:13 AM Prakash Sangappa
<[email protected]> wrote:
>
>
>
> > On Sep 10, 2021, at 8:18 AM, Peter Oskolkov <[email protected]> wrote:
> >
> > On Wed, Sep 8, 2021 at 5:16 PM Prakash Sangappa
> > <[email protected]> wrote:
> >>
> >> Including liunx-kernel..
> >>
> >> Resending RFC. This patchset is not final. I am looking for feedback on
> >> this proposal to share thread specific data for us in latency sensitive
> >> codepath.
> >
> > Hi Prakash,
>
>
> >
> > I'd like to add here that Jann and I have been discussing a similar
> > feature for my UMCG patchset:
> >
> > https://lore.kernel.org/lkml/CAG48ez0mgCXpXnqAUsa0TcFBPjrid-74Gj=xG8HZqj2n+OPoKw@mail.gmail.com/
>
> Hi Peter,
>
> I will take a look.
>
> >
> > In short, due to the need to read/write to the userspace from
> > non-sleepable contexts in the kernel it seems that we need to have some
> > form of per task/thread kernel/userspace shared memory that is pinned,
> > similar to what your sys_task_getshared does.
>
> Exactly. For this reason wanted kernel to allocate the pinned memory.
> Didn’t want to deal with files etc as a large number threads will be using
> the shared structure mechanism.
>
> >
> > Do you think your sys_task_getshared can be tweaked to return an
> > arbitrarily-sized block of memory (subject to overall constraints)
> > rather than a fixed number of "options"?
>
> I suppose it could. How big of a size? We don’t want to hold on to
> arbitrarily large amount of pinned memory. The preference would
> be for the kernel to decide what is going to be shared based on
> what functionality/data sharing is supported. In that sense the size
> is pre defined not something the userspace/application can ask.
There could be a sysctl or some other mechanism that limits the amount
of memory pinned per mm (or per task). Having "options" hardcoded for
such a generally useful feature seems limiting...
>
> I have not looked at your use case.
>
> >
> > On a more general note, we have a kernel extension internally at
> > Google, named "kuchannel", that is similar to what you propose here:
> > per task/thread shared memory with counters and other stat fields that
> > the kernel populates and the userspace reads (and some additional
> > functionality that is not too relevant to the discussion).
>
> We have few other use cases for this we are looking at, which I can
> describe later.
>
> -Prakash
>
> >
> > Thanks,
> > Peter
> >
> > [...]
>
* Peter Oskolkov:
> In short, due to the need to read/write to the userspace from
> non-sleepable contexts in the kernel it seems that we need to have some
> form of per task/thread kernel/userspace shared memory that is pinned,
> similar to what your sys_task_getshared does.
In glibc, we'd also like to have this for PID and TID. Eventually,
rt_sigprocmask without kernel roundtrip in most cases would be very nice
as well. For performance and simplicity in userspace, it would be best
if the memory region could be at the same offset from the TCB for all
threads.
For KTLS, the idea was that the auxiliary vector would contain size and
alignment of the KTLS. Userspace would reserve that memory, register it
with the kernel like rseq (or the robust list pointers), and pass its
address to the vDSO functions that need them. The last part ensures
that the vDSO functions do not need non-global data to determine the
offset from the TCB. Registration is still needed for the caches.
I think previous discussions (in the KTLS and rseq context) did not have
the pinning constraint.
Thanks,
Florian
----- On Sep 10, 2021, at 12:37 PM, Florian Weimer [email protected] wrote:
> * Peter Oskolkov:
>
>> In short, due to the need to read/write to the userspace from
>> non-sleepable contexts in the kernel it seems that we need to have some
>> form of per task/thread kernel/userspace shared memory that is pinned,
>> similar to what your sys_task_getshared does.
>
> In glibc, we'd also like to have this for PID and TID. Eventually,
> rt_sigprocmask without kernel roundtrip in most cases would be very nice
> as well. For performance and simplicity in userspace, it would be best
> if the memory region could be at the same offset from the TCB for all
> threads.
>
> For KTLS, the idea was that the auxiliary vector would contain size and
> alignment of the KTLS. Userspace would reserve that memory, register it
> with the kernel like rseq (or the robust list pointers), and pass its
> address to the vDSO functions that need them. The last part ensures
> that the vDSO functions do not need non-global data to determine the
> offset from the TCB. Registration is still needed for the caches.
>
> I think previous discussions (in the KTLS and rseq context) did not have
> the pinning constraint.
If this data is per-thread, and read from user-space, why is it relevant
to update this data from non-sleepable kernel context rather than update it as
needed on return-to-userspace ? When returning to userspace, sleeping due to a
page fault is entirely acceptable. This is what we currently do for rseq.
In short, the data could be accessible from the task struct. Flags in the
task struct can let return-to-userspace know that it has outdated ktls
data. So before returning to userspace, the kernel can copy the relevant data
from the task struct to the shared memory area, without requiring any pinning.
What am I missing ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Fri, Sep 10, 2021 at 10:33 AM Mathieu Desnoyers
<[email protected]> wrote:
>
> ----- On Sep 10, 2021, at 12:37 PM, Florian Weimer [email protected] wrote:
>
> > * Peter Oskolkov:
> >
> >> In short, due to the need to read/write to the userspace from
> >> non-sleepable contexts in the kernel it seems that we need to have some
> >> form of per task/thread kernel/userspace shared memory that is pinned,
> >> similar to what your sys_task_getshared does.
> >
> > In glibc, we'd also like to have this for PID and TID. Eventually,
> > rt_sigprocmask without kernel roundtrip in most cases would be very nice
> > as well. For performance and simplicity in userspace, it would be best
> > if the memory region could be at the same offset from the TCB for all
> > threads.
> >
> > For KTLS, the idea was that the auxiliary vector would contain size and
> > alignment of the KTLS. Userspace would reserve that memory, register it
> > with the kernel like rseq (or the robust list pointers), and pass its
> > address to the vDSO functions that need them. The last part ensures
> > that the vDSO functions do not need non-global data to determine the
> > offset from the TCB. Registration is still needed for the caches.
> >
> > I think previous discussions (in the KTLS and rseq context) did not have
> > the pinning constraint.
>
> If this data is per-thread, and read from user-space, why is it relevant
> to update this data from non-sleepable kernel context rather than update it as
> needed on return-to-userspace ? When returning to userspace, sleeping due to a
> page fault is entirely acceptable. This is what we currently do for rseq.
>
> In short, the data could be accessible from the task struct. Flags in the
> task struct can let return-to-userspace know that it has outdated ktls
> data. So before returning to userspace, the kernel can copy the relevant data
> from the task struct to the shared memory area, without requiring any pinning.
>
> What am I missing ?
I can't speak about other use cases, but in the context of userspace
scheduling, the information that a task has blocked in the kernel and
is going to be removed from its runqueue cannot wait to be delivered
to the userspace until the task wakes up, as the userspace scheduler
needs to know of the even when it happened so that it can schedule
another task in place of the blocked one. See the discussion here:
https://lore.kernel.org/lkml/CAG48ez0mgCXpXnqAUsa0TcFBPjrid-74Gj=xG8HZqj2n+OPoKw@mail.gmail.com/
>
> Thanks,
>
> Mathieu
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
----- On Sep 10, 2021, at 1:48 PM, Peter Oskolkov [email protected] wrote:
> On Fri, Sep 10, 2021 at 10:33 AM Mathieu Desnoyers
> <[email protected]> wrote:
>>
>> ----- On Sep 10, 2021, at 12:37 PM, Florian Weimer [email protected] wrote:
>>
>> > * Peter Oskolkov:
>> >
>> >> In short, due to the need to read/write to the userspace from
>> >> non-sleepable contexts in the kernel it seems that we need to have some
>> >> form of per task/thread kernel/userspace shared memory that is pinned,
>> >> similar to what your sys_task_getshared does.
>> >
>> > In glibc, we'd also like to have this for PID and TID. Eventually,
>> > rt_sigprocmask without kernel roundtrip in most cases would be very nice
>> > as well. For performance and simplicity in userspace, it would be best
>> > if the memory region could be at the same offset from the TCB for all
>> > threads.
>> >
>> > For KTLS, the idea was that the auxiliary vector would contain size and
>> > alignment of the KTLS. Userspace would reserve that memory, register it
>> > with the kernel like rseq (or the robust list pointers), and pass its
>> > address to the vDSO functions that need them. The last part ensures
>> > that the vDSO functions do not need non-global data to determine the
>> > offset from the TCB. Registration is still needed for the caches.
>> >
>> > I think previous discussions (in the KTLS and rseq context) did not have
>> > the pinning constraint.
>>
>> If this data is per-thread, and read from user-space, why is it relevant
>> to update this data from non-sleepable kernel context rather than update it as
>> needed on return-to-userspace ? When returning to userspace, sleeping due to a
>> page fault is entirely acceptable. This is what we currently do for rseq.
>>
>> In short, the data could be accessible from the task struct. Flags in the
>> task struct can let return-to-userspace know that it has outdated ktls
>> data. So before returning to userspace, the kernel can copy the relevant data
>> from the task struct to the shared memory area, without requiring any pinning.
>>
>> What am I missing ?
>
> I can't speak about other use cases, but in the context of userspace
> scheduling, the information that a task has blocked in the kernel and
> is going to be removed from its runqueue cannot wait to be delivered
> to the userspace until the task wakes up, as the userspace scheduler
> needs to know of the even when it happened so that it can schedule
> another task in place of the blocked one. See the discussion here:
>
> https://lore.kernel.org/lkml/CAG48ez0mgCXpXnqAUsa0TcFBPjrid-74Gj=xG8HZqj2n+OPoKw@mail.gmail.com/
OK, just to confirm my understanding, so the use-case here is per-thread
state which can be read by other threads (in this case the userspace scheduler) ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Fri, Sep 10, 2021 at 10:55 AM Mathieu Desnoyers
<[email protected]> wrote:
>
> ----- On Sep 10, 2021, at 1:48 PM, Peter Oskolkov [email protected] wrote:
>
> > On Fri, Sep 10, 2021 at 10:33 AM Mathieu Desnoyers
> > <[email protected]> wrote:
> >>
> >> ----- On Sep 10, 2021, at 12:37 PM, Florian Weimer [email protected] wrote:
> >>
> >> > * Peter Oskolkov:
> >> >
> >> >> In short, due to the need to read/write to the userspace from
> >> >> non-sleepable contexts in the kernel it seems that we need to have some
> >> >> form of per task/thread kernel/userspace shared memory that is pinned,
> >> >> similar to what your sys_task_getshared does.
> >> >
> >> > In glibc, we'd also like to have this for PID and TID. Eventually,
> >> > rt_sigprocmask without kernel roundtrip in most cases would be very nice
> >> > as well. For performance and simplicity in userspace, it would be best
> >> > if the memory region could be at the same offset from the TCB for all
> >> > threads.
> >> >
> >> > For KTLS, the idea was that the auxiliary vector would contain size and
> >> > alignment of the KTLS. Userspace would reserve that memory, register it
> >> > with the kernel like rseq (or the robust list pointers), and pass its
> >> > address to the vDSO functions that need them. The last part ensures
> >> > that the vDSO functions do not need non-global data to determine the
> >> > offset from the TCB. Registration is still needed for the caches.
> >> >
> >> > I think previous discussions (in the KTLS and rseq context) did not have
> >> > the pinning constraint.
> >>
> >> If this data is per-thread, and read from user-space, why is it relevant
> >> to update this data from non-sleepable kernel context rather than update it as
> >> needed on return-to-userspace ? When returning to userspace, sleeping due to a
> >> page fault is entirely acceptable. This is what we currently do for rseq.
> >>
> >> In short, the data could be accessible from the task struct. Flags in the
> >> task struct can let return-to-userspace know that it has outdated ktls
> >> data. So before returning to userspace, the kernel can copy the relevant data
> >> from the task struct to the shared memory area, without requiring any pinning.
> >>
> >> What am I missing ?
> >
> > I can't speak about other use cases, but in the context of userspace
> > scheduling, the information that a task has blocked in the kernel and
> > is going to be removed from its runqueue cannot wait to be delivered
> > to the userspace until the task wakes up, as the userspace scheduler
> > needs to know of the even when it happened so that it can schedule
> > another task in place of the blocked one. See the discussion here:
> >
> > https://lore.kernel.org/lkml/CAG48ez0mgCXpXnqAUsa0TcFBPjrid-74Gj=xG8HZqj2n+OPoKw@mail.gmail.com/
>
> OK, just to confirm my understanding, so the use-case here is per-thread
> state which can be read by other threads (in this case the userspace scheduler) ?
Yes, exactly! And sometimes these other threads have to read/write the
state while they are themselves in preempt_disabled regions in the
kernel. There could be a way to do that asynchronously (e.g. via
workpools), but this will add latency and complexity.
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
On Fri, Sep 10, 2021 at 6:28 PM Peter Oskolkov <[email protected]> wrote:
> On Fri, Sep 10, 2021 at 9:13 AM Prakash Sangappa
> <[email protected]> wrote:
> > > Do you think your sys_task_getshared can be tweaked to return an
> > > arbitrarily-sized block of memory (subject to overall constraints)
> > > rather than a fixed number of "options"?
> >
> > I suppose it could. How big of a size? We don’t want to hold on to
> > arbitrarily large amount of pinned memory. The preference would
> > be for the kernel to decide what is going to be shared based on
> > what functionality/data sharing is supported. In that sense the size
> > is pre defined not something the userspace/application can ask.
>
> There could be a sysctl or some other mechanism that limits the amount
> of memory pinned per mm (or per task). Having "options" hardcoded for
> such a generally useful feature seems limiting...
That seems like it'll just create trouble a few years down the line
when the arbitrarily-chosen limit that nobody is monitoring blows up
in someone's production environment.
If this area is used for specific per-thread items, then the kernel
should be able to enforce that you only allocate as much space as is
needed for all threads of the process (based on the maximum number
that have ever been running in parallel in the process), right? Which
would probably work best if the kernel managed those allocations.
On Fri, Sep 10, 2021 at 12:12 PM Jann Horn <[email protected]> wrote:
>
> On Fri, Sep 10, 2021 at 6:28 PM Peter Oskolkov <[email protected]> wrote:
> > On Fri, Sep 10, 2021 at 9:13 AM Prakash Sangappa
> > <[email protected]> wrote:
> > > > Do you think your sys_task_getshared can be tweaked to return an
> > > > arbitrarily-sized block of memory (subject to overall constraints)
> > > > rather than a fixed number of "options"?
> > >
> > > I suppose it could. How big of a size? We don’t want to hold on to
> > > arbitrarily large amount of pinned memory. The preference would
> > > be for the kernel to decide what is going to be shared based on
> > > what functionality/data sharing is supported. In that sense the size
> > > is pre defined not something the userspace/application can ask.
> >
> > There could be a sysctl or some other mechanism that limits the amount
> > of memory pinned per mm (or per task). Having "options" hardcoded for
> > such a generally useful feature seems limiting...
>
> That seems like it'll just create trouble a few years down the line
> when the arbitrarily-chosen limit that nobody is monitoring blows up
> in someone's production environment.
>
> If this area is used for specific per-thread items, then the kernel
> should be able to enforce that you only allocate as much space as is
> needed for all threads of the process (based on the maximum number
> that have ever been running in parallel in the process), right? Which
> would probably work best if the kernel managed those allocations.
This sounds, again, as if the kernel should be aware of the kind of
items being allocated; having a more generic mechanism of allocating
pinned memory for the userspace to use at its discretion would be more
generally useful, I think. But how then the kernel/system should be
protected from a buggy or malicious process trying to grab too much?
One option would be to have a generic in-kernel mechanism for this,
but expose it to the userspace via domain-specific syscalls that do
the accounting you hint at. This sounds a bit like an over-engineered
solution, though...
> On Sep 10, 2021, at 12:36 PM, Peter Oskolkov <[email protected]> wrote:
>
> On Fri, Sep 10, 2021 at 12:12 PM Jann Horn <[email protected]> wrote:
>>
>> On Fri, Sep 10, 2021 at 6:28 PM Peter Oskolkov <[email protected]> wrote:
>>> On Fri, Sep 10, 2021 at 9:13 AM Prakash Sangappa
>>> <[email protected]> wrote:
>>>>> Do you think your sys_task_getshared can be tweaked to return an
>>>>> arbitrarily-sized block of memory (subject to overall constraints)
>>>>> rather than a fixed number of "options"?
>>>>
>>>> I suppose it could. How big of a size? We don’t want to hold on to
>>>> arbitrarily large amount of pinned memory. The preference would
>>>> be for the kernel to decide what is going to be shared based on
>>>> what functionality/data sharing is supported. In that sense the size
>>>> is pre defined not something the userspace/application can ask.
>>>
>>> There could be a sysctl or some other mechanism that limits the amount
>>> of memory pinned per mm (or per task). Having "options" hardcoded for
>>> such a generally useful feature seems limiting...
>>
>> That seems like it'll just create trouble a few years down the line
>> when the arbitrarily-chosen limit that nobody is monitoring blows up
>> in someone's production environment.
>>
>> If this area is used for specific per-thread items, then the kernel
>> should be able to enforce that you only allocate as much space as is
>> needed for all threads of the process (based on the maximum number
>> that have ever been running in parallel in the process), right? Which
>> would probably work best if the kernel managed those allocations.
>
> This sounds, again, as if the kernel should be aware of the kind of
> items being allocated; having a more generic mechanism of allocating
> pinned memory for the userspace to use at its discretion would be more
> generally useful, I think. But how then the kernel/system should be
> protected from a buggy or malicious process trying to grab too much?
>
> One option would be to have a generic in-kernel mechanism for this,
> but expose it to the userspace via domain-specific syscalls that do
> the accounting you hint at. This sounds a bit like an over-engineered
> solution, though…
What will this pinned memory be used for in your use case,
can you explain?
On Mon, Sep 13, 2021 at 10:36 AM Prakash Sangappa
<[email protected]> wrote:
[...]
> > This sounds, again, as if the kernel should be aware of the kind of
> > items being allocated; having a more generic mechanism of allocating
> > pinned memory for the userspace to use at its discretion would be more
> > generally useful, I think. But how then the kernel/system should be
> > protected from a buggy or malicious process trying to grab too much?
> >
> > One option would be to have a generic in-kernel mechanism for this,
> > but expose it to the userspace via domain-specific syscalls that do
> > the accounting you hint at. This sounds a bit like an over-engineered
> > solution, though…
>
>
> What will this pinned memory be used for in your use case,
> can you explain?
For userspace scheduling, to share thread/task state information
between the kernel and the userspace. This memory will be allocated
per task/thread; both the kernel and the userspace will write to the
shared memory, and these reads/writes will happen not only in the
memory regions belonging to the "current" task/thread, but also to
remote tasks/threads.
Somewhat detailed doc/rst is here:
https://lore.kernel.org/lkml/[email protected]/
> On Sep 13, 2021, at 11:00 AM, Peter Oskolkov <[email protected]> wrote:
>
> On Mon, Sep 13, 2021 at 10:36 AM Prakash Sangappa
> <[email protected]> wrote:
>
> [...]
>
>>> This sounds, again, as if the kernel should be aware of the kind of
>>> items being allocated; having a more generic mechanism of allocating
>>> pinned memory for the userspace to use at its discretion would be more
>>> generally useful, I think. But how then the kernel/system should be
>>> protected from a buggy or malicious process trying to grab too much?
>>>
>>> One option would be to have a generic in-kernel mechanism for this,
>>> but expose it to the userspace via domain-specific syscalls that do
>>> the accounting you hint at. This sounds a bit like an over-engineered
>>> solution, though…
>>
>>
>> What will this pinned memory be used for in your use case,
>> can you explain?
>
> For userspace scheduling, to share thread/task state information
> between the kernel and the userspace. This memory will be allocated
> per task/thread; both the kernel and the userspace will write to the
> shared memory, and these reads/writes will happen not only in the
> memory regions belonging to the "current" task/thread, but also to
> remote tasks/threads.
>
> Somewhat detailed doc/rst is here:
> https://lore.kernel.org/lkml/[email protected]/
(Resending reply)
From what I could glean from the link above, looks like you will need the
entire 'struct umcg_task’(which is 24 bytes in size) in the per thread shared
mapped space(pinned memory?) Accessed/updated both in user space
and kernel. Appears the state transitions here are specific to umcg. So,
may not be usable in other use cases that are interested in just checking
if a thread is executing on cpu or blocked.
We have a requirement to share thread state as well(on or off cpu) in the
shared structure, which also will be accessed by other threads in the user
space. Kernel updates the state when the thread blocks or resumes execution.
Need to see if may be the task state you have could be repurposed when
not used by umcg threads.
Regarding use of pinned memory, it is not arbitrary amount per thread then
right? Basically you need 24 bytes per thread. The proposed task_getshared()
allocates pinned memory pages to accommodate requests from as many
threads in a process that need to use the shared structure
(padded to 128 bytes). The amount of memory/pages consumed will be
bound by the number threads a process can create. As I mentioned in the
cover letter multiple shared structures are fit/allocated from a page.