by Thomas Gleixner

[permalink] [raw]

On Tue, Sep 28, 2021 at 09:31:34PM -0700, Andy Lutomirski wrote:
> On Mon, Sep 13, 2021, at 1:01 PM, Sohil Mehta wrote:
> > User Interrupts Introduction
> > ============================
> >
> > User Interrupts (Uintr) is a hardware technology that enables delivering
> > interrupts directly to user space.
> >
> > Today, virtually all communication across privilege boundaries happens by going
> > through the kernel. These include signals, pipes, remote procedure calls and
> > hardware interrupt based notifications. User interrupts provide the foundation
> > for more efficient (low latency and low CPU utilization) versions of these
> > common operations by avoiding transitions through the kernel.
> >
>
> ...
>
> I spent some time reviewing the docs (ISE) and contemplating how this all fits together, and I have a high level question:
>
> Can someone give an example of a realistic workload that would benefit from SENDUIPI and precisely how it would use SENDUIPI? Or an example of a realistic workload that would benefit from hypothetical device-initiated user interrupts and how it would use them? I'm having trouble imagining something that wouldn't work as well or better by simply polling, at least on DMA-coherent architectures like x86.

I was wondering the same thing. One thing came to mind:

An application that wants to be *interrupted* from what it's doing
rather than waiting until the next polling point. For example,
applications that are CPU-intensive and have green threads. I can't name
a real application like this though :P.

Stefan

Attachments:

(No filename) (1.56 kB)
signature.asc (499.00 B)
Download all attachments

2021-09-30 19:07:12

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall

On Tue, Sep 28, 2021, at 9:56 PM, Sohil Mehta wrote:
> On 9/28/2021 8:30 PM, Andy Lutomirski wrote:
>> On Mon, Sep 13, 2021, at 1:01 PM, Sohil Mehta wrote:
>>> Add a new system call to allow applications to block in the kernel and
>>> wait for user interrupts.
>>>
>> ...
>>
>>> When the application makes this syscall the notification vector is
>>> switched to a new kernel vector. Any new SENDUIPI will invoke the kernel
>>> interrupt which is then used to wake up the process.
>> Any new SENDUIPI that happens to hit the target CPU's ucode at a time when the kernel vector is enabled will deliver the interrupt. Any new SENDUIPI that happens to hit the target CPU's ucode at a time when a different UIPI-using task is running will *not* deliver the interrupt, unless I'm missing some magic. Which means that wakeups will be missed, which I think makes this whole idea a nonstarter.
>>
>> Am I missing something?
>
>
> The current kernel implementation reserves 2 notification vectors (NV)
> for the 2 states of a thread (running vs blocked).
>
> NV-1 – used only for tasks that are running. (results in a user
> interrupt or a spurious kernel interrupt)
>
> NV-2 – used only for a tasks that are blocked in the kernel. (always
> results in a kernel interrupt)
>
> The UPID.UINV bits are switched between NV-1 and NV-2 based on the state
> of the task.

Aha, cute. So NV-1 is only sent if the target is directly paying attention and, assuming all the atomics are done right, NV-2 will be sent for tasks that are asleep.

Logically, I think these are the possible states for a receiving task:

1. Running. SENDUIPI will actually deliver the event directly (or not if uintr is masked). If the task just stopped running and the atomics are right, then the schedule-out code can, I think, notice.

2. Not running, but either runnable or not currently waiting for uintr (e.g. blocked in an unrelated syscall). This is straightforward -- no IPI or other action is needed other than setting the uintr-pending bit.

3. Blocked and waiting for uintr. For this to work right, anyone trying to send with SENDUIPI (or maybe a vdso or similar clever wrapper around it) needs to result in either a fault or an IPI so the kernel can process the wakeup.

(Note that, depending on how fancy we get with file descriptors and polling, we need to watch out for the running-and-also-waiting-for-kernel-notification state. That one will never work right.)

3 is the nasty case, and your patch makes it work with this NV-2 trick. The trick is a bit gross for a couple reasons. First, it conveys no useful information to the kernel except that an unknown task did SENDUIPI and maybe that the target was most recently on a given CPU. So a big list search is needed. Also, it hits an essentially arbitrary and possibly completely innocent victim CPU and task, and people doing any sort of task isolation workload will strongly dislike this. For some of those users, "strongly" may mean "treat system as completely failed, fail over to something else and call expensive tech support." So we can't do that.

I think we have three choices:

Use a fancy wrapper around SENDUIPI. This is probably a bad idea.

Treat the NV-2 as a real interrupt and honor affinity settings. This will be annoying and slow, I think, if it's even workable at all.

Handle this case with faults instead of interrupts. We could set a reserved bit in UPID so that SENDUIPI results in #GP, decode it, and process it. This puts the onus on the actual task causing trouble, which is nice, and it lets us find the UPID and target directly instead of walking all of them. I don't know how well it would play with hypothetical future hardware-initiated uintrs, though.

2021-09-30 19:57:56

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall

On Thu, Sep 30 2021 at 11:08, Andy Lutomirski wrote:
> On Tue, Sep 28, 2021, at 9:56 PM, Sohil Mehta wrote:
> I think we have three choices:
>
> Use a fancy wrapper around SENDUIPI. This is probably a bad idea.
>
> Treat the NV-2 as a real interrupt and honor affinity settings. This
> will be annoying and slow, I think, if it's even workable at all.

We can make it a real interrupt in form of a per CPU interrupt, but
affinity settings are not really feasible because the affinity is in the
UPID.ndst field. So, yes we can target it to some CPU, but that's racy.

> Handle this case with faults instead of interrupts. We could set a
> reserved bit in UPID so that SENDUIPI results in #GP, decode it, and
> process it. This puts the onus on the actual task causing trouble,
> which is nice, and it lets us find the UPID and target directly
> instead of walking all of them. I don't know how well it would play
> with hypothetical future hardware-initiated uintrs, though.

I thought about that as well and dismissed it due to the hardware
initiated ones but thinking more about it, those need some translation
unit (e.g. irq remapping) anyway, so it might be doable to catch those
as well. So we could just ignore them for now and go for the #GP trick
and deal with the device initiated ones later when they come around :)

But even with that we still need to keep track of the armed ones per CPU
so we can handle CPU hotunplug correctly. Sigh...

Thanks,

tglx

2021-09-30 21:47:40

Hi!

> Instructions
> ------------
> senduipi <index> - send a user IPI to a target task based on the UITT index.
>
> clui - Mask user interrupts by clearing UIF (User Interrupt Flag).
>
> stui - Unmask user interrupts by setting UIF.
>
> testui - Test current value of UIF.
>
> uiret - return from a user interrupt handler.

Are other CPU vendors allowed to implement compatible instructions?

If not, we should probably have VDSO entries so kernel can abstract
differences between CPUs.

> Untrusted processes
> -------------------
> The current implementation expects only trusted and cooperating processes to
> communicate using user interrupts. Coordination is expected between processes
> for a connection teardown. In situations where coordination doesn't happen
> (say, due to abrupt process exit), the kernel would end up keeping shared
> resources (like UPID) allocated to avoid faults.

Keeping resources allocated after process exit is a no-no.

Best regards,
Pavel
--
http://www.livejournal.com/~pavelmachek

Attachments:

(No filename) (1.04 kB)
signature.asc (188.00 B)
Digital signature Download all attachments

2021-10-01 15:18:16

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall

On Fri, Oct 1, 2021, at 2:56 AM, Thomas Gleixner wrote:
> On Thu, Sep 30 2021 at 21:41, Andy Lutomirski wrote:
>> On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:
>
>> Now that I read the docs some more, I'm seriously concerned about this
>> XSAVE design. XSAVES with UINTR is destructive -- it clears UINV. If
>> we actually use this, then the whole last_cpu "preserve the state in
>> registers" optimization goes out the window. So does anything that
>> happens to assume that merely saving the state doesn't destroy it on
>> respectable modern CPUs XRSTORS will #GP if you XRSTORS twice, which
>> makes me nervous and would need a serious audit of our XRSTORS paths.
>
> I have no idea what you are fantasizing about. You can XRSTORS five
> times in a row as long as your XSTATE memory image is correct.

I'm just reading TFM, which is some kind of dystopian fantasy.

11.8.2.4 XRSTORS

Before restoring the user-interrupt state component, XRSTORS verifies that UINV is 0. If it is not, XRSTORS
causes a general-protection fault (#GP) before loading any part of the user-interrupt state component. (UINV
is IA32_UINTR_MISC[39:32]; XRSTORS does not check the contents of the remainder of that MSR.)

So if UINV is set in the memory image and you XRSTORS five times in a row, the first one will work assuming UINV was zero. The second one will #GP. And:

11.8.2.3 XSAVES
After saving the user-interrupt state component, XSAVES clears UINV. (UINV is IA32_UINTR_MISC[39:32];
XSAVES does not modify the remainder of that MSR.)

So if we're running a UPID-enabled user task and we switch to a kernel thread, we do XSAVES and UINV is cleared. Then we switch back to the same task and don't do XRSTORS (or otherwise write IA32_UINTR_MISC) and UINV is still clear.

And we had better clear UINV when running a kernel thread because the UPID might get freed or the kernel thread might do some CPL3 shenanigans (via EFI, perhaps? I don't know if any firmwares actually do this).

So all this seems to put UINV into the "independent" category of feature along with LBR. And the 512-byte wastes from extra copies of the legacy area and the loss of the XMODIFIED optimization will just be collateral damage.

2021-10-01 16:38:13

Hi Sohil,

Thank you for your reply and the clarification.

>>
>> We have been exploring how user-level interrupts (UIs) can be used to
>> improve performance and programmability in several different areas:
>> e.g., parallel programming, memory management, I/O, and floating-point
>> libraries.
>
> Can you please share more details on this? It would really help
> improve the API design.
>
Of course! Below we describe a few use cases for both user-level
interrupts (UIs) and user-level exceptions (UEs). We realize that the
current proposal is targeted towards UIs, but we also describe some UEs
use cases because we believe handling exceptions without going through
the kernel may provide even more of a benefit than UIs. We hope these
use cases can influence the direction of the API so that it can be made
forward compatible for future hardware revisions.

To be clear, we distinguish between interrupts (generated from an
external source, such as another core or Device) that are most likely
imprecise and asynchronous and exceptions (generated by the currently
executing program) that need to be precise and synchronous.

# UST2UST

A UI is a mechanism to allow two or more threads to communicate with one
another asynchronously without requiring the intervention of the kernel
or a change in privilege. We believe that having UIs can help integrate
the shared memory model and message passing model for multicore
processors. This integration makes it easier to build parallel programs,
allowing developers to take advantage of both models. The shared memory
model provides an easy way to share data between threads, while the
message passing model can be used for synchronization between threads.

In the following section, we will describe two use cases for UIs.
- We show how UIs can be used to improve parallel program performance by
reducing the overhead of exposing parallelism.
- We show how UIs can be used to build efficient active messages.

Both of the use cases we present below require the receiver of a UI to
know which thread issued it. At the end of the email we describe how we
would implement this using the current API and suggest an alternative,
and possibly more streamlined approach.

## Lazy Work Stealing

One of the hurdles in writing parallel programs is ensuring that the
cost of parallelizing the code does not become a bottleneck in program
performance. Some of these overheads come from unnecessarily exposing
too much parallelism, even if all cores are busy. One mechanism to
reduce this overhead is to lazily expose parallelism only when it is
needed. This can be done through stack unwinding (similar to how
Exception Handling works). Whenever a thread (thief) asks for work from
another thread (victim), the victim will perform stack unwinding,
creating the work for the thief. This approach to lazy thread creation
requires some mechanism for the thief to ask for work.

We have implemented a prototype compiler and runtime for this mechanism.
Our runtime requires a mechanism for the thief to signal the victim when
it needs work. We implement this signaling through polls because the
current IPI mechanism is too expensive to use. However, requiring the
victim to poll can introduce excessive polling overhead and/or introduce
significant latency between the request and the response. The compiler
tries to keep the overhead of polling low (<5%) while still ensuring
that the latency between a work-stealing request and its response is as
low as possible. Currently, we essentially only poll for work requests
in the function prologue, keeping the overhead to about 2% of execution
time on average. This works well for almost all applications. However,
in some applications, this can add 100 of microseconds of latency to the
response of a work-stealing request.

One reason we use polling today, instead of the victim just taking work,
is that there are points in the program where work-stealing is not
allowed. So, in addition to having an inexpensive mechanism to request
work, we need an inexpensive method to disallow the requests. IOW, the
compiler only inserts polls at points where it is safe to do so. With
the UI mechanism in the proposed API, we could signal a work-stealing
request with a UST2UST UI and disallow such requests by disabling
interrupts. One nice advantage of the current proposal is that disabling
interrupts is a *local* operation, making it very inexpensive and not
causing any interference with the rest of the threads. In other words,
an important benefit of the proposed UI mechanism is that we can ensure
atomicity (with respect to work stealing) without having to do any
global communication.

## Implementing Active Message

Active messages can efficiently support the message passing parallel
programming model. With the proposed API, the UI could signal that an AM
is being delivered while shared memory data structures could be used for
the payload. As described in the above use case, this would allow
receiving threads to provide atomicity by disabling interrupts without
any global communication.

Clearly, having a shared address space makes data access and management
easier for parallel programming. On the other hand, controlling access
to that data can often be cleaner to implement in message passing
models. Dogan et al. have shown promising improvements by using explicit
messaging hardware to accelerate Machine Learning and Graphs workloads
(see [DHAKWETAL17,DAKK19]). Explicit messaging is used as a
synchronization mechanism and has better scalability than shared
memory-based synchronization. The current proposal would support this
model integration with significantly lower overheads and lower latencies
compared to what are available on today's machines.

------------------------------------------------------------------------
# D2UST

Applications that frequently interact with external devices can benefit
from UIs. To achieve high performance, conventional IO approaches
through the kernel are not appealing as it incurs high overhead. It
requires context switching and data transfer from kernel-space to
user-space, possibly polluting the cache and TLB. One improvement to
bypass the kernel is by pinning pages to specific physical addresses,
where these pages act as a buffer between user-space and device.
However, since the device cannot directly interrupt the UST, the UST
needs to poll to check if the data from the device is available.
However, having the UST poll can easily erase any potential performance
improvements offered by bypassing the kernel in the first place.
Allowing a device to interrupt a UST under the proposed API will
eliminate the need to poll and support atomicity as required which could
significantly improve application performance.

This would be particularly useful when an application uses a GPU as an
accelerator for parallel computation and CPU for serial computation (see
[WBSACETAL08]). An example would be K-means. Finding which clusters each
point belongs to is computed in the GPU (in parallel), while computing
the mean is computed in the CPU (in serial). As this process is
iterative, there are multiple computation transitions between the CPU
and GPU. Without UIs, the only real option is to poll for GPU task
completion, complicating control flow if there is also other work for
the CPU thread to do. With UIs, keeping the GPU busy can be handled by
the UI handler. The result would be cleaner code and better load
balancing. To make this work, the D2UST interrupt will have to ensure
that the process that started the task on the GPU is the same one that
is currently running. When a different process is running, the interrupt
will have to be saved by the kernel so it can be delivered to the UST
when is is next scheduled.

------------------------------------------------------------------------

# CPU2UST

Providing a low cost user-level exception mechanism could fundamentally
change the approach to implementing many algorithms. Examples range
across many common tasks, e.g., checking for valid pointers,
preprocessing floating-point data, garbage collection, etc. Today, due
to the high cost of exception handling, programmers go to great lengths
to ensure that exceptions do not happen. Unfortunately, this leads to
more code and often less performance. Below we describe different
scenarios where UEs could potentially reduce programming effort and/or
improve performance.

## API for CPU2UST

For the examples, below we propose a small modification to the proposed
API to support exceptions. We propose that a handler be registered for a
particular fault to distinguish the exception type. Potentially, the
`flags` argument could hold the `signum`, or a bit in the `flags`
argument could indicate that a third parameter was being included with
the `signum`. We suggest including `signum` in the current API for
future use.

```
int uintr_register_handler(u64 handler_address, unsigned int flags, int
signum);
```

Since each handler is registered for a particular exception, the handler
itself would only have one argument, a pointer to the `__uintr_frame`.
In some cases, the handler might need the `error_code` information
(e.g., for a page-fault), which could be obtained using a new function,
`unsigned long long __get_ue_errorcode(void)`.

```
__attribute__ ((interrupt))
void
handlerFunction (struct __uintr_frame *ui_frame)
{
// Get error code if needed
// unsigned long long error_code = __get_ue_errorcode();
...
}
```

We envision four ways for the user handler to manipulate the thread's
state. Here we assume that a UE is handled by the thread that causes the
exception.

1. Continuing the faulting thread of control.
2. Suspending a faulting thread or continuing another thread in the same
process.
3. Deferring processing of the fault back to the kernel.
4. Or, finally, terminating the thread of control.

In case 1, where the faulting thread is continued, the handler can
simply use the uiret (It could potentially modify the return address on
the stack to change where execution continues). For case 2, we do not
have a proposed API yet, but potentially some set of functions that
extend pthreads might be appropriate. For case 3, the handler would use
a trap to signal that the kernel should continue processing the
exception. The compiler would have to restore registers appropriately
before the trap is executed.

## Binary rewriting

Binary rewriting is a valuable technique for debugging, optimizing,
repairing, emulating, and hardening (tightening security) a program
[WMUW19]. One implementation of binary rewriting is to replace the
probed instructions (instrumentation points) with a redirect instruction
(either jump or trap) to the patch instructions. Most developers use
jump instructions instead of traps due to their lower cost. However,
because instructions have variable encoding lengths, inserting jump
instructions requires care, e.g., "instruction punning" [CSDN17] with a
combination of padding and eviction [DGR20]. On the other hand, the trap
instruction is only a single byte, allowing it to replace any patched
instruction. If the trap can be made inexpensive, this would potentially
allow a simpler approach to binary rewriting without control flow recovery.

## Binary Emulation for forward/backward compatibility

Some processor families have an all encompassing ISA of which only a
subset is implemented in hardware for some instances of the family.
Applications built for the processor family either have to be recompiled
for each instance or software emulation must handle the unimplemented
instructions. If there is a UE for the illegal instruction fault, this
can potentially be made inexpensive enough to avoid recompilation.
Furthermore, it could be a way to handle legacy code and allow future
generations to avoid the older crufty instructions that are no longer
commonly used.

## Floating-Point Performance

Today, floating-point algorithms often preprocess the data in order to
avoid underflow (or overflow) exceptions. If UEs were low enough cost,
it is possible that these time consuming data preparation steps could be
removed and only run if an exception was generated. A simple example is
the calculation of the Root Mean Square (RMS) of a vector [HFT94,H96].
The common approach to calculating a vector's RMS is to scan the input
vector and then potentially scale it to avoid underflow/overflow. For
many applications, the common case is that the data does not require
rescaling. In those cases, one could calculate the RMS on the unscaled
data and only scale it if a UE was generated.

## Memory: garbage collection and watch points

User-level Page Fault exceptions (ULPF) is one essential component for
improving the performance of a wide variety of applications. For
example, in [AL20], we describe a solution that shows how ULPFs when
combined with a mechanism that allows the user a limited ability to
change a page's permissions without kernel intervention, can be used to
implement an unlimited number of efficient software watchpoints. Our
experiments were performed using GEM5, where we made changes to the MMU
and TLB. However, Intel's Memory Protection Keys for User (MPK) [MPK17]
combined with UE could also potentially do the trick.

Another example of an application that could benefit from ULPF is
Concurrent Garbage Collection. Concurrent Garbage collection allows both
the program (aka mutator) threads and the collector to run in parallel.
To implement concurrent GC, a read barrier or write barrier is often
needed (these are GC terms and should not be confused with hardware
memory barriers). These barriers ensure that the GC invariants are
maintained before a read or write operations. The write barrier prevents
the GC from reclaiming a live object that was recently accessed by the
mutator (in the case of a concurrent mark-sweep) [BDS91]. The read
barrier prevents the mutator from reading stale objects (in the case of
concurrent mark-compact) [AL91]. Both read and write barriers can be
implemented using ULPF. The programmer can use the permission bit in the
user-level page tables to cheaply turn on/off memory protection (e.g.,
inside the handler).

Belay et al. [BBMTMETAL12] has shown how to implement Boehm GC[BDS91] (a
mostly parallel mark-sweep GC used in the Mono project [Mono18] and
Objective-C [Objc15]) on their platform, Dune. Dune is a platform that
allows user-space direct access to exceptions and privileged hardware
features. The results show both speedup and slowdown, where the slowdown
is attributed to their platform's inherent overhead. On the other hand,
Click et al. [CTW05] and Tene et al. [TIW11] have built a custom system
to build a Pauseless GC. This custom system allows fast page fault
handling. The mechanism described in [AL20] could be extended to
implement a similar approach.

# References

- [AL91] Appel, Andrew W. and Li, Kai, Virtual Memory Primitives for
User Programs (1991)
- [AL20] Li, Qingyang, User Level Page Faults (2020),
http://reports-archive.adm.cs.cmu.edu/anon/2020/CMU-CS-20-124.pdf
- [BBMTMETAL12] Belay, Adam and Bittau, Andrea and Mashtizadeh, Ali and
Terei, David and Mazi\`{e}res, David and Kozyrakis, Christos, Dune: Safe
User-Level Access to Privileged CPU Features (2012)
- [BDS91] Boehm, Hans-J. and Demers, Alan J. and Shenker, Scott, Mostly
Parallel Garbage Collection (1991)
- [CSDN17] Chamith, Buddhika and Svensson, Bo Joel and Dalessandro, Luke
and Newton, Ryan R., Instruction Punning: Lightweight Instrumentation
for X86-64 (2017)
- [CTW05] Click, Cliff and Tene, Gil and Wolf, Michael, The Pauseless GC
Algorithm (2005)
- [DAKK19] Dogan, Halit and Ahmad, Masab and Kahne, Brian and Khan,
Omer, Accelerating Synchronization Using Moving Compute to Data Model at
1,000-core Multicore Scale (2019)
- [DHAKWETAL17] Dogan, Halit and Hijaz, Farrukh and Ahmad, Masab and
Kahne, Brian and Wilson, Peter and Khan, Omer, Accelerating Graph and
Machine Learning Workloads Using a Shared Memory Multicore Architecture
with Auxiliary Support for in-Hardware Explicit Messaging (2017)
- [DGR20] Duck, Gregory J. and Gao, Xiang and Roychoudhury, Abhik,
Binary Rewriting without Control Flow Recovery (2020)
- [ECGS92] von Eicken, Thorsten and Culler, David E. and Goldstein, Seth
Copen and Schauser, Klaus Erik, Active Messages: A Mechanism for
Integrated Communication and Computation (1992)
- [H96] Hauser, John R., Handling Floating-Point Exceptions in Numeric
Programs (1996)
- [HFT94] Hull, T. E. and Fairgrieve, Thomas F. and Tang, Ping-Tak
Peter, Implementing Complex Elementary Functions Using Exception
Handling (1994)
- [Mono18] https://www.mono-project.com/docs/advanced/runtime/ (2018)
- [MPK17]
https://www.kernel.org/doc/Documentation/x86/protection-keys.txt (2017)
- [Objc15]
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Garbage-Collection.html (2015)
- [TIW11] Tene, Gil and Iyengar, Balaji and Wolf, Michael, C4: The
Continuously Concurrent Compacting Collector (2011)
- [WBSACETAL08] Wong, Henry and Bracy, Anne and Schuchman, Ethan and
Aamodt, Tor M. and Collins, Jamison D. and Wang, Perry H. and Chinya,
Gautham and Groen, Ankur Khandelwal and Jiang, Hong and Wang, Hong ,
Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor (2008)
- [WMUW19] Wenzl, Matthias and Merzdovnik, Georg and Ullrich, Johanna
and Weippl, Edgar, From Hack to Elaborate Technique—A Survey on Binary
Rewriting (2019)

>> # Preparing for future use cases
>> If someone could point out an example for Kernel to
>> user-space thread (K2UST) UI, we would appreciate it.
>>
>
> The idea here is improve the kernel-to-user event notification
> latency. Theoretically, this can be useful when the kernel sees event
> completion on one cpu but it want to signal (notify) a thread actively
> running on some other CPU. The receiver thread can save some cycles by
> avoiding ring transitions to receive the event.
>
> IO_URING is one of the examples for kernel-to-user event
> notifications. We are evaluating whether providing a UINTR based
> completion mechanism can have benefit over eventfd based completions.
> The benefits in practice are yet to be measured and proven.
>
Thank you for the clarification.

- QUESTION: If the processor has D2UST capability, would this allow the
device to directly send the interrupt to the target process (the process
that initiates the I/O through io_uring) instead of the kernel?

>> In our work, we have also been exploring precise UIs from the
>> currently running thread. We call these CPU to UST (CPU2UST) UIs.
>> For example, a SIGSEGV generated by writing to a read-only page, a
>> SIGFPE generated by dividing a number by zero.
>>
>
> It is definitely possible in future to delivery CPU events as User
> Interrupts. The hardware architecture for this is still being worked
> on internally.
>
> Though our focus isn't on exceptions being delivered as User
> Interrupts. Do you have details on what type of benefit is expected?
>
Described in the use-cases we mentioned above.

>> - QUESTION: Is there is a rough draft/plan that we can refer to that
>> describes the
>> current thinking on these three cases.
>>
>> - QUESTION: Are there use cases for K2UST, or is K2UST the same as
>> CPU2UST?
>>
>
> No, K2UST isn't the same as CPU2UST. We would expect limited benefits
> from K2UST but on the other hand CPU2UST can provide significant
> speedup since it avoids the kernel completely.
>
> Unfortunately, due to the large scope of the feature, the hardware
> architecture development is happening in stages. I don't have detailed
> plans for each of the sources of User Interrupts.
>
> Here is our rough plan:
>
> 1. Provide a common infrastructure to receive User Interrupts. This is
> independent of the source of the interrupt. The intention here is to
> keep the software APIs generic and extendable so that future sources
> can be added without causing much disturbance to the older APIs.
>
> 2. Introduce various sources of User Interrupts in stages:
>
> UST2UST - This RFC. Available in the upcoming Sapphire Rapids processor.
>
> K2UST - Also available in upcoming Sapphire Rapids. Working towards
> proving the value before sending something out.
>
> D2UST - Future processor. Hardware architecture being worked on
> internally. Not much to share right now.
>
> CPU2UST - Future processor. Hardware architecture being worked on
> internally. Not much to share right now.

Thank you for the update, really appreciate it.

>
> The saving and restoring of the registers is done by gcc when the
> muintr flag along with the 'interrupt' attribute is used. Applications
> can choose to save floating point registers as part of the interrupt
> handler as well.
>
> To make it easier for applications we are working on implementing a
> thin library that can help with some of this common functionality like
> saving floating point registers or redirecting to 64 sub-handlers.

- QUESTION: Would this thin library also provide a mechanism to share
data between sender and receiver through shared memory (similar to
implementing Active message)?
- QUESTION: Is there a plan in the future to allow data to be
transmitted along with the interrupt?

>> # Multi-threaded parallel programming example
>>
>> One of the uses for UIs that we have been exploring is combining the
>> message-passing and shared memory models for parallel programming. In
>> our approach, message-passing is used for synchronization and shared
>> memory for data sharing. The message passing part of the programming
>> pattern is based loosely on Active Messages (See ISCA92), where a
>> particular thread can turn off/on interrupts to ignore incoming
>> messages so they can execute critical sections without having to
>> notify any other threads in the system.
>>
>
> This look like a good fit for the User IPI (UST2UST) implementation in
> this RFC. Have you had a chance to evaluate the current API design for
> this usage?

Our approach requires point-to-point communication to implement the
UST2UST use cases described above. From my understanding, the current
API requires (n-1)*n descriptors to enable point-to-point communication
(assuming a private UITT). Here, each receiver assigns a vector to the
UI file descriptor (uifd) and shares it with the appropriate sender.
This way, the receivers know the sender based on the vector.

Have other approaches to handling the case where the receiver needs to
know the sender's identity been explored? In particular, approaches that
do not require n^2 descriptors be created? In the context of the RFC,
one possibility we have thought about would be where the sender assigns
a vector to uifd (maybe based on its cpuid) and shares this information
to all receivers. This would possibly only require n descriptors.

> Also, is any of the above work publicly available?

Not yet. We are still working on it and hope to update you on it.

Best regards,
Chrisma and Seth