LinuxLists.cc - [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST

2023-04-03 14:11:51

Subject: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST

From: Lai Jiangshan <[email protected]>

0 Purpose
---------

Make the entry code handles the super exceptions safely with atomic
stack-switching.

Make the entry code more robust and make TDX and SEV more resistant to
hypervisor manipulation when using IST.

1 What's the problem
--------------------

Thomas Gleixner's complaint about the x86's syscall/exception design:
[RFD] x86: Curing the exception and syscall trainwreck in hardware
https://lore.kernel.org/lkml/[email protected]/

In the email thread, Linus Torvalds also despised the design with a
lengthy description.

Andrew Cooper's detailed documentation about the syscall gap and the
stack switching problem:
x86 Stack Switching Issues and Improvements
https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit#
(highly recommended reading if you want to know the detail of the problem)

In short, x86_64 has the issue of syscall gap, so IST has to be used for
super exceptions and exposed to its recursion issues.

Fixing hardware is the only way to remove the syscall gap, while the IST
recursion issues can actually be fixed via software approaches.

2 What are the current approaches to fix the IST recursion issues
-----------------------------------------------------------------

2.1 NMI
-------

Steven Rostedt's NMI stack-switching approach (recommended reading)
The x86 NMI iret problem
https://lwn.net/Articles/484932/

Nested NMIs lead to CVE-2015-3290 （bad iret to userspace)
https://lwn.net/Articles/654418/

Linux x86_64 NMI security issues
https://lore.kernel.org/lkml/CALCETrXViSiMG79NtqN79NauDN9B2k9nOQN18496h9pJg+78+g@mail.gmail.com/

If all other super exceptions are omitted, the NMI approach is excellent
except it puts two stacks (the hardware entry stack, 8*12=96 bytes, and
the software handler stack, 4000 bytes) into a single 4K-stack and uses
ASM code too much, which isn't convenient, i.e. it uses X86_EFLAGS_DF to
avoid misleading from syscall gap rather than using
ip_within_syscall_gap().

And the atomic in it is just atomic for NMI, not for all the super
exceptions.

2.2 MCE and DB
--------------

The approach for fixing the kernel mode #DB recursion issue is to totally
disable #DB recursion in kernel mode, which is considered to be the best
and strongest solution.

The approach for fixing the kernel mode MCE recursion issue is to just
ignore it because MCE in kernel mode is considered to be fatal. It is
an acceptable solution.

2.3 #VE
-------

The approach for fixing the kernel mode #VE recursion issue is to just
NOT use IST for #VE although #VE is also considered to be one of the
super exceptions and had raised some worries:
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/CALCETrU9XypKbj-TrXLB3CPW6=MZ__5ifLz0ckbB=c=Myegn9Q@mail.gmail.com/
https://lore.kernel.org/lkml/[email protected]/

To remit the worries, SEPT_VE_DISABLE is forced used currently and
also disables its abilities (accept-on-demand or memory balloon which
is critical to lightweight VMs like Kata Containers):
https://lore.kernel.org/lkml/YCb0%2FDg28uI7TRD%[email protected]/

2.4 #VC and #HV
---------------

The approach for fixing the kernel mode #VC/#HV recursion issue is even
more complicated and gross. It dynamically changes the IST pointers in
the TSS; it has a linked list on the stacks to link the stacks; it has
code spilled too many places. It doesn't fix the problem V.S NMI, MCE.
(#VC in the NMI prologue reenable NMI on iret and maybe corrupt the NMI
prologue with nested NMI).
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/CALCETrWw-we3O4_upDoXJ4NzZHsBqNO69ht6nBp3y+QFhwPgKw@mail.gmail.com/

2.5 #DF
-------

The approach for fixing kernel mode #DF recursion issue is to move it
out of the software stack switching scope because #DF is really
destructively fatal and is the last resort for all the other problems.
This approach is always the best way.

2.6 summary for current approaches
----------------------------------

The problem with these approaches is that different super exception uses
different approach and each approach takes care of itself only and the
code is spread overall and into the high-level handler. Above all, they
don't fix the problem; the system can go unexpectedly except in
bare-metal where there is no #VE #VC and #HV and only NMI is the only
recoverable recursive super exception. When there are more than one
recoverable recursive super exceptions, scattered approaches can't work.
Moreover, the current approaches are obstacles to implement supervisor
mode shadow stack.

3 A journey to find a safe stack-switching approach
---------------------------------------------------

3.1 the simplest way, use hardware stack switching only
-------------------------------------------------------

Use only the IST mechanism for super exceptions, but it is not
reentrance-safe:

+---------+ +---------+
+--------> | event A | +-> | event B |
| |hw entry | | |hw entry |
| +----+----+ | +----+----+
| | + | |
+ +----v----+ | | +----v----+
handle the event | prepare | | | | prepare |
on the IST stack +----+----+ +-+ +----+----+
+ + | | |
| | +----v----+ | +----v----+
| +---> |handling | | |handling |
| +---------+ + |or iret |
| +----+----+
| +---------+ |
+-------> | event A | <----------------+
| reenter | |reenable |
+----+----+ |or trigger|
| | event A |
v +----------+
reenter on the
same stack top Event B (might not IST)
!oops! occurs when Event A is
preparing or handling

It is unsafe for the handler to stick on the IST stack.

3.2 switch off the IST stack ASAP
---------------------------------

+---------+ +---------+
| event A | +-> | event B |
|hw entry | | |hw entry |
+----+----+ | +----+----+
| | |
+----v----+ | +----v----+
|prepare& | | | prepare |
switch to +-> |switch SP| | | |
a special +----+----+ | +----+----+
stack | | |
+----v----+ | +----v----+
|handling | +-+ |handling |
+---------+ |or iret |
+----+----+
+---------+ |
| event A | <----------------+
| reenter | |reenable |
+----+----+ |or trigger|
| | event A |
| +----------+
v
safe switch SP Event B (might not IST)
and handling occurs when Event A is
handling

It seems to work.
But this approach is still not working, when:

+---------+ +---------+ Event B (also IST)
| event A | +-> | event B | occurs when Event A is
|hw entry | | |hw entry | switching stack, but
+----+----+ | +----+----+ has not finished yet
| | |
+----v----+ | +----v----+
|prepare& | | |prepare& |
Switch to +-> |switching|---+ |switch SP|
a special +----+----+
stack |
+---------+ +----V----+
| event A | <---+ |handling |
| reenter | | |or iret |
+----+----+ | +----+----+
| | |
v +------------+
Reenter on the |reenable |
same stack top |or trigger|
!oops! | event A |
+----------+

In this example, if event A (IST) is interrupted by event B while is
preparing or switching, the stack is still corrupted.

So it needs to be an atomical switching-off.

Currently, NMI and #VC are using this approach.

NMI uses the hardware-entry IST stack, 8*12=96 bytes, and the software
handler stack, 4000 bytes, connected in a 4K page. NMI setups both
stacks and do the high-level handler in the 4000-byte stack in an
atomical way.

But this "atomical way" is only atomic in the view of NMI itself,
not in the view of all super exceptions. If event B (#MC, #VC)
occurs when event A (NMI) is preparing (NMI's prologue), the NMI
is reenabled on B's return, and if a nested NMI is delivered,
the stack is still corrupted.

#VC switches off the IST stack totally with more hacks, but not
really fix the problem.

So it needs to be a really atomical switching-off. "really atomical"
means it is atomic in the view of all super exceptions.
(#DF is not counted, which is more super than other super exceptions)

3.3 atomically switch off the IST stack
---------------------------------------

We need a really atomic way to switch the stacks.

This new atomic way is named atomic-IST-entry which is applied for
all IST events except #DF. For convenience, aIST means the events that
use atomic-IST-entry approach, or all IST events except #DF.

+-------------------------------+ +-------------------------------+
| +---------+ | | +---------+ |
| atomic-IST-entry | event A | | | | event B | atomic-IST-entry |
| |hw entry | | | |hw entry | |
| +----+----+ | | +----+----+ |
| | | | | |
| atomic switch +----v----+ | | +----v----+ Also help |
| to a special +-> | switch | | | | switch | <-+complete A's |
| stack | stack | +----> | stack | atomic entry |
| +---------+ | | +---------+ |
+-------------------------------+ +---------+---------------------+
|
+---------+ +---------+ +----v----+
|handle A <--+B return <--+handle B |
+---------+ +---------+ +---------+

atomic-IST-entry consists of the hardware entry procedure which delivers
the event on the TSS-configured IST stack and a software stack switching
procedure which switches the stack from the TSS-configured IST stack to
a special stack.

If it just interrupts an incompleted atomic-IST-entry, it helps complete
what the interrupted atomic-IST-entry should have done and adjusts its
return point so that when it returns, it will return to the point as
though the interrupted aIST has fully completed its atomic-IST-entry.

Note, doing "what the interrupted atomic-IST-entry should have done" may
do it recursively if the interrupted aIST had also interrupted yet
another aIST. Since the maximum number of nested aIST is limited, the
recursive procedure would be implemented as a loop rather than a
recursive function call.

This approach ensures any atomic-IST-entry's completion no matter
whether it completes by itself or is completed by a nested aIST, so it
is atomicall.

4 atomic-IST-entry: how to implement it
---------------------------------------

4.1 Types of atomic-IST-entry
-----------------------------

outmost atomic-IST-entry
The atomic-IST-entry that interrupts any non-atomic-IST-entry
context is the outmost atomic-IST-entry.

interrupted atomic-IST-entry
The atomic-IST-entry interrupted by an IST event before
committing is an interrupted atomic-IST-entry. All
atomic-IST-entries except for the ongoing one are interrupted
atomic-IST-entries.

nested atomic-IST-entry
The atomic-IST-entry that interrupts other atomic-IST-entry is
a nested atomic-IST-entry. All atomic-IST-entries except for
the outmost are nested atomic-IST-entries.

ongoing atomic-IST-entry
The atomic-IST-entry running on the CPU before committing is
an ongoing atomic-IST-entry. If it is interrupted by an IST event
before committing, it will become an interrupted atomic-IST-entry
and the new event will be the new ongoing atomic-IST-entry.

4.2 atomic and proxy
--------------------

Atomic is the most fundamental strategy and attribute for the stack
switching procedure for aIST entries. The software maintains it like
atomically hardware-entry as if the architecture would deliver the aIST
event on a target stack atomically.

To make an interrupted atomic-IST-entry atomic in software, the nested
atomic-IST-entry works as a proxy of the interrupted atomic-IST-entry
and does all its work. The interrupted atomic-IST-entry may also be a
nested atomic-IST-entry, so the ongoing nested atomic-IST-entry should
do all of the work of all interrupted atomic-IST-entries.

4.3 abortable, replicable and idempotent
----------------------------------------

When an atomic-IST-entry is interrupted, it will never be resumed to the
point where it was interrupted since all of its work had been done on
return. So any point in the atomic-IST-entry must be abortable. All of
its work has to be completed with the same result by the ongoing nested
atomic-IST-entry, so any work in an atomic-IST-entry must be replicable
and idempotent.

4.4 two views of the atomic-IST-entry: logical and reality
----------------------------------------------------------

logical view +------> atomic by aborting and replicating

+-------------+ + +---------------+ +---------------+
| event A | | | event A | interrupt | event B |
| hw entry | | | hw entry | +------> | hw entry |
+------+------+ | +------+--------+ + | +------+--------+
| | | | | |
+------v------+ | +------v--------+ | | +------v--------+
|save pt_regs | | | save pt_regs | | | | save pt_regs |
+------+------+ | +------+--------+ | | +------+--------+
| | | +-+ |
| | +------v--------+ | +------v--------+
| | |for all entries| | |for all entries|
+------v------+ | | replicate | | | replicate |
|copy pt_regs | | | copy pt_regs | | | copy pt_regs |
+------+------+ | +------+--------+ | +------+--------+
| | | | |
+------v------+ | +------v--------+ + +------v--------+
| switch | | | commit switch | | commit switch |
+-------------+ | +---------------+ abort to +---------------+
+ the interrupted --------> the nested
atomic-IST-entry <-------- atomic-IST-entry
replicate (when ongoing)

4.5 copy-on-write and commit design pattern
-------------------------------------------

Generally, abortable, replicable, and idempotent programs are often
implemented via copy-on-write and commit design pattern. Copy-on-write
is usually applied for memory or persistent storage where the write
operations are performed on the copied version of the data and the
original version is kept read-only.

With regard to interrupts and exceptions, the "copy-on-write" for
registers is a little different: the registers are copied/saved to the
memory and kept read-only and the later write operation are performed on
the bare registers because when the interrupt or the exception returns,
the saved version is considered original and restored on return.

In the entry code, the procedure has to use the basic registers to run
the procedure itself, doing "copy-on-write" for the basic registers
sounds like something of a catch-22 or a chicken-and-egg problem.

Luckily, there are two inherent features that can be used to bootstrap
the procedure and overcome the problem. Firstly, the hardware-entry has
saved the exception head onto the IST stack, which is also a
copy-on-write for the %rflags, %rsp, and %rip registers. Secondly,
there are scratchable areas to write without needing to save them first.
Scratchable areas are crucial for copy-on-write to do the copy. The IST
stack itself is the scratchable area to save everything and run the
procedure. The unused part of the target stack is also a scratchable
area to copy the context before committing switching stacks.

4.6 save-touch-replicate-copy-commit assembly line
--------------------------------------------------

Combining the copy-on-write and commit design pattern with the logical
procedure that an atomic-IST-entry is supposed to do (push pt_regs,
copy pt_regs, and switch stacks), the save-touch-replicate-copy-commit
assembly line comes into the world:

save:
To ensure proper handling of registers and data, it is crucial to save
everything before making any changes. This includes multiple save-stages
and must be saved on memory.

save-stage-1: hardware-entry (COW for %rflags, %rsp, %rip):
hardware atomic and saves %rflags, %rsp, %rip, allowing them to be
safely touched afterward.

save-stage-2: push general registers (COW for the general registers):
save all general registers at once since they must be touched during
the atomic-IST-entry procedure. It is recommended to save all general
registers at one go before making any changes to them, otherwise,
it may result in the need for additional save-stages.

save-stage-3: save other things (COW for other things):
If the procedure also touches anything aside from general registers or
scratchable areas, it is necessary to use this stage. This stage is
separate as saving these additional items often involves touching
general registers. It is not recommended to use it as touching extra
things complicates the atomic-IST-entry. It is important to note that
if atomic-IST-entry needs to change to kernel GSBASE or kernel CR3 or
SPEC_CTRL, this stage is required.

The save operation needs to be designed with the ability to be aborted
and replicated:
o The save-stages are obviously inherent abortable;
o Aborted save-stages can be replicated easily as the data has not
yet been touched and can be replicated and stored in another
location within nested contexts;
o Complete save-stages don't need to be replicated directly and the
complete saved version must be locatable; the ability in nested
contexts to identify, locate and access the complete saved data can
be considered a form of replication;

touch:
Write to the registers or the other data. It can be easily maintained
abortable, replicable, and idempotent with the help of the save-stages.

replicate:
Replicate all the work (save-touch-copy-commit) of the interrupted
atomic-IST-entry. The replicate operation focuses only on the final
result. The supposed final resulted context of the interrupted
atomic-IST-entry needs to be copied to the target stack, so the
replicate operation directly operates on the copying target which is
still in a scratchable area before commtting to avoid dirtying any
saved area.

Replicating the commit operation does not necessarily mean directly
committing it. Rather, it involves ensuring that when a nested
atomic-IST-entry returns to the interrupted atomic-IST-entry, it
returns to the commit point by replicating all the work as if the
interrupted aIST had just successfully committed before being
interrupted.

The replicate operation is obviously inherent abortable and replicable.

copy:
(COW for the saved context and prepare the target pt_regs on the target
stack for switching):
Copy all the supposed saved contexts onto the target stack at the
deterministic-located location. In the implementation, the copy
operation is squashed with the replicate operation since the replicate
operation is designed to directly operate on the copying target.

The copy operation must be abortable and replicable, i.e. the source
must be locatable or replicable and the destination must be locatable
and touchable.

commit:
Commit the copied result and switch to the target stack. Since
replicating the commit operation does not necessarily mean directly
committing it, the commit operation is actually doing overall-commit:
when the ongoing atomic-IST-entry succeeds to commit, all the
interrupted atomic-IST-entries are also committed right away.

4.7 identify and locate
-----------------------

The atomic-IST-entry code has to identify the types (outmost or nested)
of the atomic-IST-entries, from itself to the outmost atomic-IST-entry.

For each interrupted atomic-IST-entry, the event vector, the saved
context (mainly pt_regs) on the TSS stack, the commit_ip has to be
identified/located, and the information which save-stage it had
completed.

The identifying methods can be implemented based on RSP, RIP, and SSP.
In theory, it is not required to use all of the methods, but using
multiple methods can increase robustness in case the entry code goes
wrong, the hardware goes wrong, or any non-entry code goes wrong
(i.e. buggy code sets RSP to one of the IST stacks). If different
methods give different results, forcedly triggering #DF is the last
resort.

Note, RSP can only be examined when the interrupted context is ring0 and
not in syscall gap.
Note, if there are multiple vectors that share the same IST stack, or
there are multiple instructions in the commit stage, the RIP-based
method must be used.

The current code uses RSP based method only. It is very easy to add the
RIP-based method later.

To easily deterministic-locate the location for copying, only one target
stack is used, and the stack is named IST main stack and shared for all
aIST events. The atomic-IST-entry copies the pt_regs to the IST main
stack and commit-switches to it. (When you re-read the above paragraphs,
you can replace "the target stack" with "the IST main stack".) There
might be multiple events on the IST main stack, so the stack is larger.

To make identifying and locating code work correctly, there are
requirements for the code outside the atomic-IST-entry:
o No SP/IP mess: Should not set the SP/IP to the IST stacks or IST
event entry code, and only the code of atomic-IST-entry is running
on the TSS-configured IST stacks (except #DF's IST stack).
o leave IST stacks completely: No usable data left when switching off
the IST stacks.

4.8 Hardware Requirement: not nested in atomic-IST-entry
--------------------------------------------------------

Any aIST event can NOT be nested by itself in atomic-IST-entry. No iret
instruction in atomic-IST-entry to re-enable NMI; No write the specific
MSR to re-enable #MC; No debug allowed in the atomic-IST-entry to
re-trigger #DB; No TDCALL or anything to re-enable #VE, #VC, and #HV.
If any aIST is nested inside a not yet committed atomic-IST-entry, the
hardware should morph it into a double fault or there is a bug in the
hardware.

The identifying code will resort to double fault if this requirement
is not met.

4.9 Copy the supposed saved context
-----------------------------------

With identifying and locating code, we have two types of copies:
copy_outmost() and copy_nested() for the outmost and nested
atomic-IST-entries respectively. And they are implemented in a way that
the code to replicate them is themself. i.e.
replicate(copy_outmost) = copy_outmost
replicate(copy_nested) = copy_nested

copy_outmost:
Do the work as the outmost atomic-IST-entry to copy the supposed pt_regs
of the interrupted context to the IST main stack. (If the ongoing
atomic-IST-entry is the outmost one, the work is literally doing copy as
the outmost, if not, the work is replicating the outmost.)

The hardware-entry of the outmost atomic-IST-entry has already saved the
exception head of the pt_regs. If the outmost atomic-IST-entry was
unfortunately interrupted before fully saving all the general registers,
the general registers are untouched and must be saved by one of the
consequent nested atomic-IST-entries. The identifying code can just
examine all the nested atomic-IST-entries to find which one has saved
the general registers.

copy_nested:
Do the work as a nested atomic-IST-entry to copy the supposed pt_regs
of the interrupted context to the IST main stack.

The hardware-entry of the nested atomic-IST-entry has already saved
the exception head of the pt_regs of the interrupted context (inside
the interrupted atomic-IST-entry). To maintain the atomic attribute
of the atomic-IST-entry, the copy_nested() (of the ongoing nested
atomic-IST-entry) has to replicate all that the interrupted
atomic-IST-entries should have been done till the commit point and
copy the supposed saved context (pt_regs).

To avoid touching any saved pt_regs, the replicating is actually
directly applied on the target pt_regs.

4.10 Full view
--------------

non-atomic-IST-entry context (start point)
+
| outmost nested nested
| +-----------+ +-----------+ +-----------+
+> | event A | interrupt | event B | interrupt | event C |
| hw entry | +---> | hw entry | +---> | hw entry |
+-----+-----+ + | +-----+-----+ + | +-----+-----+
| | | | | | |
+-----v-------+ | | +-----v-------+ | | +-----v-------+
|save pt_regs | | | |save pt_regs | | | |save pt_regs |
+-----+-------+ | | +-----+-------+ | | +-----+-------+
| +-+ | +-+ |
+-----v-------+ | +-----v-------+ | +-----v-------+
| identify | | +-----+ identify | | +-----+ identify |
| | | | +-------------+ | | +-------------+
|copy_outmost | | | | |
+-----+-------+ | | +-------------+ | | +-------------+
| | | +-> |copy_nested | | | +-> |copy_nested |
+-----v-------+ + | | +-----+-------+ | | | +-----+-------+
|commit switch| | | | | | | |
+-------------+ | | +-----v-------+ + | | +-----v-------+
| | |commit switch| | | |commit switch|
+-------------+ <-+ | +-------------+ | | +-----+-------+
|copy_outmost | | | | |
+-------------+ +---+ +-------------------+ | V
| | succeed to commit
+-------------+ <-----+ +-------------+ | (end point)
|copy_outmost | |copy_nested | |
+-------------+ +-----> +-------------+ +---+

Full view with data movement:

(start point) EventA's EventB's EventC's IST main
+----------+ ISTstack ISTstack ISTstack stack
| |
|non atomic| <----------------------------------+ +------+
|IST entry | <--------------------------------+ | | |
| context | <---+ +------+ | | | |
| | <-+ | |ss | copy outmost | | |ss |context
+---+------+ | +-+rsp | exc head | +--+rsp |inter-
| | |rflags| ============# | |rflags|rupted
| +-+ |cs | # +--+ |cs |by A
+---v---+ +-+rip | #copy_ +-+rip |
| |hw entry |ecode | #outmost |ecode |
|Event A| ======> |gp | <--------+ #======> |gp |
| |ASM push |regs | | # |regs |
| |(aborted) +------+ | # +------+ <+
| | ^ | | +------+ | # +------+ |
+---+---+ | | | |ss | | # copy_ |ss | |
| | |rsp +-+ # nested |rsp +--+
| +------------+ |rflags| == # ======> |rflags|
+---v---+ | |cs | # |cs | +----+
| | hw entry +--+rip |copy#outmost |rip +-+ v
|Event B| ======> |ecode | gp#regs |ecode |Event A's
| | ASM push |gp |====# |gp |commit ip
| rep | <--+ (pushed) |regs | |regs |
| copy | | +------+ <--------+ +------+ <+
+---+---+ | | | +------+ | +------+ |
| | | | |ss | | |ss | |
| | |rsp +-+ |rsp +--+
+---v---+ +--------------------+ |rflags| |rflags|
| | | |cs |copy_ |cs | +----+
|Event C| hw entry +--+rip |nested|rip +-+ v
| | ======> |ecode |====> |ecode |Event B's
| rep | ASM push |gp | |gp |commit ip
| copy | (pushed) |regs | |regs |
+---+---+ +------+ +-> +------+
| | | | | |
| | | | | |
|succeed to | | |
| commit rsp +------------------+
+-----------------> rip = Event C's commit ip final context
(end point) after commit

4.11 minimal procedure environment
----------------------------------

To avoid complicating atomic-IST-entry too much, the atomic-IST-entry
accesses only the general purpose registers and the stacks which is also
the minimal required environment for a C-function to run.

So the atomic-IST-entry can be possibly running with user GSBASE (so
don't use PERCPU), with KPTI's user CR3 (so don't access outside the CPU
ENTRY AREA), without IBRS bit. Be careful!

It is possible to make the atomic-IST-entry switches GSBASE, CR3, and
SPEC_CTRL, but it will need the save-stage-3, more code to switch GSBASE,
CR3, and SPEC_CTRL, and more code to replicate switching GSBASE, CR3, and
SPEC_CTRL.

4.12 C-function entry code
--------------------------

As seen above, the work of atomic-IST-entry is hard to implement in ASM
code, so the major part of the atomic-IST-entry is implemented in C code
as a C-function. The ASM code does the save-stages and then calls the
C-function.

The C-function has a RET instruction before IBRS_ENTER. I (Lai Jiangshan)
am still searching for why IBRS_ENTER "Must be called before the first
RET instruction" (comments in entry_64.S). No clue so far and needs help.
The only way to fix it is to use save-stage-3 and to make the
atomic-IST-entry switches GSBASE, CR3, and SPEC_CTL. (Not hugely
cumbersome, but I don't like it)

5 Supervisor Shadow stack
-------------------------

The current approach to handling the IST stack (including NMI_executing
and #VC's stack switching) presents a challenge for implementing the
supervisor shadow stack. However, this obstacle is removed by the
introduction of this atomic-IST-entry.

The implementation of the supervisor shadow stack can be accomplished
using a similar software-based atomicall approach. First, a shadow stack
(the IST main shadow stack) must be created and associated with the IST
main stack for each CPU besides shadow stacks for TSS-configured IST
stacks. Next, the locating code within the atomic-IST-entry can
determine where to write on the IST main shadow stack when identifying
the interrupted context.

The atomic-IST-entry can then write(WRSS) values corresponding to the
copied pt_regs to the IST main shadow stack and save the resulting SSP
on the IST main stack in the extended portion below or above the pt_regs
as if the hardware delivers the event on the IST main stack and the IST
main shadow stack. The commit stage is extended to multiple instructions
that commit both stacks which switches the RSP first and then the shadow
stack (obtained the resulting SSP from the extended portion below or
above the copied pt_regs).

If the multiple-instruction commit stage is interrupted, it is
considered an interrupted atomic-IST-entry, and the RIP-based
identifying and locating code need to travel inside the outer
interrupted atomic-IST-entry. It can also be considered
non-atomic-IST-entry context and affects the identifying and locating
code differently and a special replicating code is needed for this
special non-atomic-IST-entry context which makes it not preferred.

Finally, the code must clear all aIST's shadow stack's busy bits before
entering the handling code to ensure that the shadow stacks are ready
for the next hardware-entry and atomic-IST-entry.

By implementing these steps, the supervisor shadow stack can be
successfully used along with the IST stacks.

6 FRED
------

FRED stands for Flexible Return and Event Delivery, and it is Intel's
attempt to address the problem of switching stacks for super exceptions
among other things such as improving overall performance and response
time and ensuring that event delivery establishes the full supervisor
context and that event return establishes the full user context.

https://cdrdv2-public.intel.com/678938/346446-flexible-return-and-event-delivery.pdf
https://lore.kernel.org/lkml/[email protected]/

The FRED approach to address the problem of switching stacks for super
exceptions is to introduce the concept of a stack level. The current
stack level (CSL) is a value in the range 0–3 that the processor tracks
when CPL = 0. FRED event delivery determines the stack level associated
with the event being delivered and, if it is greater than the CSL (or if
CPL had been 3), loads the stack pointer (RSP) from a new FRED RSP MSR
associated with the event’s stack level. (If supervisor shadow stacks
are enabled, the stack level applies also to the shadow-stack pointer,
SSP, which would be loaded from a FRED SSP MSR.) The FRED return
instruction ERETS restores the old stack level.

Comparing the software atomic-IST-entry and FRED:
o atomic-IST-entry is focused solely on stack switching for IST events,
while FRED offers a variety of features.
o While, obviously, atomic-IST-entry may not improve overall
performance and response time like FRED, it also does not worsen it.
In the fast path where the outmost atomic-IST-entry succeeds to
commit, the atomic-IST-entry's major work is not much different from
a normal interrupt.
o FRED attempts to restore the full supervisor context when entering
from ring3, while atomic-IST-entry doesn't even handle the supervisor
GSBASE. It should be noted that neither atomic-IST-entry nor FRED
handles CR3 for KPTI or RCU or other context-tracking bits. GSBASE
is not more crucial than CR3 and less crucial than stacks which is
the minimal basic environment for everything else.
o While FRED may only be available on future platforms,
atomic-IST-entry is available on all existing x86_64 platforms,
including non-Intel platforms.

In total, atomic-IST-entry can only do atomic stack switching for IST,
while FRED provides more abilities. If you have FRED, just use it. If
you don't, just don't fret too much either.

7 summary
---------

There are multiple gaps in event delivery:
1) stack gap type 1 (syscall gap): RSP is not kernel RSP (on syscall
event)
2) stack gap type 2 (reentrance safe gap): the stack is in danger of
corruption by nested super events.
3) GSBASE gap: GSBASE is not kernel GSBASE
4) CR3 gap: CR3 is not kernel CR3
5) IBRS gap: SPEC_CTRL_IBRS bit is not set for kernel
6) RCU gap: RCU is not enabled for kernel
7) MSR gap: some other MSR is switched to the corresponding kernel MSR
value
8) kernel context gap: some other context is not switched to the kernel
context

Gaps 3-8 are robustly solved by the noinstrument facility with careful
interrupted context tracking (paranoid_entry and .Lerror_bad_iret are
among the examples) and build-time objtools checking. They are not as
essential and tough as stack gaps (maybe except for the IBRS gap).

Gap1(stack gap type 1, syscall gap) could have been addressed by
hardware, ensuring a proper kernel RSP on syscall event, as demonstrated
by FRED. The gap is short and contained by the entry/noinstrument
acility, only presenting issues in the face of super events, which can
be solved by IST. However, IST causes the issue of Gap 2.

Gap2(stack gap type 2, reentrance safe gap) must be addressed on any
system that wants to switch stacks on super events. FRED's solution
involves using kernel stack levels. On a system without kernel stack
levels, gap2 can be robustly solved now by this new atomic-IST-entry
and noinstrument facility. Actually, atomic-IST-entry can be considered
a special form of kernel stack levels. The #DF stack is the highest
stack level, IST main stack for other super events is a less highest
stack level, and all the other events are assigned with the lowest
stack level.

If gap5(IBRS gap) is problematic within atomic-IST-entry, which uses
a RET instruction, save-stage-3, and corresponding touching (writing
the MSR) and replicating code should be added into atomic-IST-entry.

Lai Jiangshan (7):
x86/entry: Move PUSH_AND_CLEAR_REGS out of paranoid_entry
x86/entry: Add IST main stack
x86/entry: Implement atomic-IST-entry
x86/entry: Use atomic-IST-entry for NMI
x86/entry: Use atomic-IST-entry for MCE and DB
x86/entry: Use atomic-IST-entry for VC
x86/entry: Test atomic-IST-entry via KVM

Documentation/x86/kernel-stacks.rst | 2 +
arch/x86/entry/Makefile | 3 +
arch/x86/entry/entry_64.S | 615 +++++++-------------------
arch/x86/entry/ist_entry.c | 299 +++++++++++++
arch/x86/include/asm/cpu_entry_area.h | 8 +-
arch/x86/include/asm/idtentry.h | 15 +-
arch/x86/include/asm/sev.h | 14 -
arch/x86/include/asm/traps.h | 1 -
arch/x86/kernel/asm-offsets_64.c | 7 +
arch/x86/kernel/callthunks.c | 2 +
arch/x86/kernel/dumpstack_64.c | 6 +-
arch/x86/kernel/nmi.c | 8 -
arch/x86/kernel/sev.c | 108 -----
arch/x86/kernel/traps.c | 43 --
arch/x86/kvm/vmx/vmx.c | 441 +++++++++++++++++-
arch/x86/kvm/x86.c | 10 +-
arch/x86/mm/cpu_entry_area.c | 2 +-
tools/objtool/check.c | 7 +-
18 files changed, 937 insertions(+), 654 deletions(-)
create mode 100644 arch/x86/entry/ist_entry.c

base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6
--
2.19.1.6.gb485710b

2023-04-03 14:12:23

by Lai Jiangshan

[permalink] [raw]

Subject: [RFC PATCH 2/7] x86/entry: Add IST main stack

From: Lai Jiangshan <[email protected]>

Add IST main stack for atomic-IST-entry.

The size is THREAD_SIZE since there might be multiple super
exceptions being handled on the stack.

Signed-off-by: Lai Jiangshan <[email protected]>
---
Documentation/x86/kernel-stacks.rst | 2 ++
arch/x86/include/asm/cpu_entry_area.h | 5 +++++
arch/x86/kernel/dumpstack_64.c | 6 ++++--
arch/x86/mm/cpu_entry_area.c | 1 +
4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/Documentation/x86/kernel-stacks.rst b/Documentation/x86/kernel-stacks.rst
index 6b0bcf027ff1..be89acf302da 100644
--- a/Documentation/x86/kernel-stacks.rst
+++ b/Documentation/x86/kernel-stacks.rst
@@ -105,6 +105,8 @@ The currently assigned IST stacks are:
middle of switching stacks. Using IST for MCE events avoids making
assumptions about the previous state of the kernel stack.

+* ESTACK_IST. bla bla
+
For more details see the Intel IA32 or AMD AMD64 architecture manuals.

diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 462fc34f1317..a373e8c37e25 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -10,6 +10,8 @@

#ifdef CONFIG_X86_64

+#define IST_MAIN_STKSZ THREAD_SIZE
+
#ifdef CONFIG_AMD_MEM_ENCRYPT
#define VC_EXCEPTION_STKSZ EXCEPTION_STKSZ
#else
@@ -30,6 +32,8 @@
char VC_stack[optional_stack_size]; \
char VC2_stack_guard[guardsize]; \
char VC2_stack[optional_stack_size]; \
+ char IST_stack_guard[guardsize]; \
+ char IST_stack[IST_MAIN_STKSZ]; \
char IST_top_guard[guardsize]; \

/* The exception stacks' physical storage. No guard pages required */
@@ -52,6 +56,7 @@ enum exception_stack_ordering {
ESTACK_MCE,
ESTACK_VC,
ESTACK_VC2,
+ ESTACK_IST,
N_EXCEPTION_STACKS
};

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index f05339fee778..3413b23fa9f1 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -26,11 +26,12 @@ static const char * const exception_stack_names[] = {
[ ESTACK_MCE ] = "#MC",
[ ESTACK_VC ] = "#VC",
[ ESTACK_VC2 ] = "#VC2",
+ [ ESTACK_IST ] = "#IST",
};

const char *stack_type_name(enum stack_type type)
{
- BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+ BUILD_BUG_ON(N_EXCEPTION_STACKS != ARRAY_SIZE(exception_stack_names));

if (type == STACK_TYPE_TASK)
return "TASK";
@@ -89,6 +90,7 @@ struct estack_pages estack_pages[CEA_ESTACK_PAGES] ____cacheline_aligned = {
EPAGERANGE(MCE),
EPAGERANGE(VC),
EPAGERANGE(VC2),
+ EPAGERANGE(IST),
};

static __always_inline bool in_exception_stack(unsigned long *stack, struct stack_info *info)
@@ -98,7 +100,7 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac
struct pt_regs *regs;
unsigned int k;

- BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+ BUILD_BUG_ON(N_EXCEPTION_STACKS != 7);

begin = (unsigned long)__this_cpu_read(cea_exception_stacks);
/*
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 7316a8224259..62341cb819ab 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -148,6 +148,7 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
cea_map_stack(NMI);
cea_map_stack(DB);
cea_map_stack(MCE);
+ cea_map_stack(IST);

if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
--
2.19.1.6.gb485710b

2023-04-03 14:12:29

by Lai Jiangshan

[permalink] [raw]

Subject: [RFC PATCH 3/7] x86/entry: Implement atomic-IST-entry

From: Lai Jiangshan <[email protected]>

See the comments in the cover-letter. They will be moved into the code
and changelog here when improved.

Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/Makefile | 3 +
arch/x86/entry/entry_64.S | 193 ++++++++++++++++++++
arch/x86/entry/ist_entry.c | 299 +++++++++++++++++++++++++++++++
arch/x86/kernel/asm-offsets_64.c | 7 +
arch/x86/kernel/callthunks.c | 2 +
tools/objtool/check.c | 7 +-
6 files changed, 510 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/entry/ist_entry.c

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index ca2fe186994b..7cc1254ca519 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -8,11 +8,14 @@ UBSAN_SANITIZE := n
KCOV_INSTRUMENT := n

CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE)
+CFLAGS_REMOVE_ist_entry.o = $(CC_FLAGS_FTRACE) $(RETHUNK_CFLAGS)

CFLAGS_common.o += -fno-stack-protector
+CFLAGS_ist_entry.o += -fno-stack-protector

obj-y := entry.o entry_$(BITS).o syscall_$(BITS).o
obj-y += common.o
+obj-$(CONFIG_X86_64) += ist_entry.o

obj-y += vdso/
obj-y += vsyscall/
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 49ddc4dd3117..50a24cc83581 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -443,6 +443,184 @@ SYM_CODE_END(\asmsym)
idtentry \vector asm_\cfunc \cfunc has_error_code=0
.endm

+/**
+ * idtentry_ist - Macro to generate entry stubs for IST exceptions except #DF
+ * @vector: Vector number
+ * @asmsym: ASM symbol for the entry point
+ * @cfunc: C function to be called when it occurs in kernel
+ * @user_cfunc: C function to be called when it occurs in userspace
+ * @has_error_code: Hardware pushed error code on stack
+ * @stack_offset: Offset of the IST stack top in struct cea_exception_stacks
+ *
+ * The macro emits code to set up the kernel context for IST exceptions.
+ *
+ * From the hardware entry of the event to the SYM_INNER_LABEL(commit_\asmsym)
+ * is atomic-IST-entry (note: atomic-IST-entry is from the hardware entry,
+ * not merely from the first instruction of this macro).
+ *
+ * The atomic-IST-entry pushes pt_regs and copies the pt_regs to the IST
+ * main stack, and switches to it. If the atomic-IST-entry is interrupted
+ * by another IST event (except #DF), the new atomic-IST-entry will
+ * replicate the interrupted one as if every atomic-IST-entry is atomic.
+ *
+ * See the comments in entry64.c.
+ *
+ * When the cpu is on any IST stack or the IST main stack, %rsp can not be
+ * switched off except being interrupted by any IST exception or totally
+ * switching off (no usable data left).
+ *
+ * If the entry comes from user space, it turns to use the normal entry
+ * path finally on its kernel stack including the return to user space
+ * work and preemption checks on exit. The macro idtentry_body ensures
+ * the IST main stack is totally switched off (no usable data left) at
+ * the same time it switches to the kernel stack..
+ *
+ * If hits in kernel mode then it needs to go through the paranoid
+ * entry as the exception can hit any random state. No preemption
+ * check on exit to keep the paranoid path simple.
+ */
+.macro idtentry_ist vector asmsym cfunc user_cfunc has_error_code:req, stack_offset:req
+SYM_CODE_START(\asmsym)
+ UNWIND_HINT_IRET_REGS offset=\has_error_code*8
+ ENDBR
+
+ /*
+ * Clear X86_EFLAGS_AC, X86_EFLAGS_DF and set a default ORIG_RAX.
+ *
+ * The code setting ORIG_RAX will not be replicated if interrupted.
+ */
+ ASM_CLAC
+ cld
+
+ .if \has_error_code == 0
+ pushq $-1 /* ORIG_RAX: no syscall to restart */
+ .endif
+
+ /*
+ * No register can be touched except %rsp,%rflags,%rip before
+ * pushing all the registers. It is indispensable for nested
+ * atomic-IST-entry to replicate pushing the registers.
+ */
+ PUSH_REGS
+
+ /*
+ * Finished pushing register, all registers can be touched by now.
+ *
+ * Clear registers for the C function ist_copy_regs_to_main_stack()
+ * and the handler to avoid any possible exploitation of any
+ * speculation attack.
+ */
+ CLEAR_REGS
+
+ /*
+ * Copy the pt_regs to the IST main stack including the pt_regs of
+ * the interrupted atomic-IST-entris, if any, by replicating.
+ */
+ movq %rsp, %rdi /* pt_regs pointer on its own IST stack */
+ leaq PTREGS_SIZE-\stack_offset(%rsp), %rsi /* struct cea_exception_stacks pointer */
+ call ist_copy_regs_to_main_stack
+
+ /*
+ * Commit stage.
+ */
+SYM_INNER_LABEL(start_commit_\asmsym, SYM_L_GLOBAL)
+ /*
+ * Switches to the IST main stack. Before the switching is done,
+ * %rax is the copied pt_regs pointer in IST main stack.
+ */
+ movq %rax, %rsp
+
+ /*
+ * The label should be immediate after the instruction that switches
+ * the stack since there is code assuming there is only one single
+ * instruction in the commit stage and the code assumes "%rsp in the
+ * IST main stack is also the sign of ending a atomic-IST-entry".
+ * (The code will be removed in future when %rip-based identifying
+ * is added.)
+ */
+SYM_INNER_LABEL(commit_\asmsym, SYM_L_GLOBAL)
+
+ /*
+ * Now, it is on the IST main stack. For the whole kernel, the entries
+ * of the IST exceptions can be seen from here because the inside
+ * of the atomic-IST-entry can not be seen from the whole kernel
+ * except in the atomic-IST-entry or #DF.
+ */
+ UNWIND_HINT_REGS
+ ENCODE_FRAME_POINTER
+
+ /*
+ * The code setting ORIG_RAX will not be replicated if interrupted.
+ * So redo it here.
+ */
+ .if \has_error_code == 0
+ movq $-1, ORIG_RAX(%rsp) /* ORIG_RAX: no syscall to restart */
+ .endif
+
+ /*
+ * If the entry is from userspace, switch stacks and treat it as
+ * a normal entry.
+ */
+ testb $3, CS(%rsp)
+ jnz .Lfrom_usermode_switch_stack_\@
+
+ /*
+ * paranoid_entry returns GS/CR3/SPEC_CTL information for
+ * paranoid_exit in RBX/R14/R15.
+ */
+ call paranoid_entry
+
+ movq %rsp, %rdi /* pt_regs pointer */
+ .if \has_error_code == 1
+ movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
+ movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
+ .endif
+ call \cfunc
+
+ jmp paranoid_exit
+
+.Lfrom_usermode_switch_stack_\@:
+ /* Switch context: GS_BASE, CR3, SPEC_CTL. */
+ swapgs
+ FENCE_SWAPGS_USER_ENTRY
+ /* We have user CR3. Change to kernel CR3. */
+ SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
+ IBRS_ENTER
+ UNTRAIN_RET
+
+ /* Put the pt_regs onto the kernel task stack. */
+ movq %rsp, %rdi /* arg0 = pt_regs pointer */
+ call sync_regs
+
+ /*
+ * Switch to the kernel task stack and use the user entry point.
+ *
+ * When from the user mode, the procedure has to atomically switches
+ * off the TSS-configured IST stacks too, so it switches to the IST
+ * main stack first, and then switches off the IST main stack in atomic
+ * fashion: when %rsp leaves the IST main stack, the IST main stack is
+ * totally free.
+ */
+ movq %rax, %rsp
+ UNWIND_HINT_REGS
+ ENCODE_FRAME_POINTER
+
+ movq %rsp, %rdi /* pt_regs pointer into 1st argument*/
+ .if \has_error_code == 1
+ movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
+ movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
+ .endif
+ call \user_cfunc
+
+ /* For some configurations \user_cfunc ends up being a noreturn. */
+ REACHABLE
+
+ jmp error_return
+
+_ASM_NOKPROBE(\asmsym)
+SYM_CODE_END(\asmsym)
+.endm
+
/**
* idtentry_mce_db - Macro to generate entry stubs for #MC and #DB
* @vector: Vector number
@@ -586,8 +764,23 @@ SYM_CODE_END(\asmsym)
*/
.macro idtentry_df vector asmsym cfunc
SYM_CODE_START(\asmsym)
+
+ /*
+ * This unwind-hint is incorect if it is the soft double fault rasied
+ * from ist_double_fault(). It doesn't matter since it is unrecoverable
+ * double fault.
+ */
UNWIND_HINT_IRET_REGS offset=8
ENDBR
+
+ /*
+ * Set %rsp = %rsp - 8 if it is the soft double fault raisied from
+ * ist_double_fault(). The CPU doesn't push an error code in the case
+ * since it is injected by an INT instruction.
+ */
+ btr $3, %rsp
+ UNWIND_HINT_IRET_REGS offset=8
+
ASM_CLAC
cld

diff --git a/arch/x86/entry/ist_entry.c b/arch/x86/entry/ist_entry.c
new file mode 100644
index 000000000000..e1b06306ac51
--- /dev/null
+++ b/arch/x86/entry/ist_entry.c
@@ -0,0 +1,299 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022-2023 Lai Jiangshan, Ant Group
+ *
+ * Handle entries and exits for hardware traps and faults.
+ *
+ * It is as low level as entry_64.S and its code can be running in the
+ * environments that the GS base is a user controlled value, or the CR3
+ * is the PTI user CR3 or both.
+ */
+#include <asm/traps.h>
+
+#define IST_DOUBLE_FAULT_VECTOR 8
+
+static __always_inline void ist_double_fault(void)
+{
+ asm volatile ("int $" __stringify(IST_DOUBLE_FAULT_VECTOR));
+}
+
+#define IN_CEA_ESTACK(ceastp, name, sp) \
+ ((CEA_ESTACK_BOT(ceastp, name) <= (sp)) && \
+ ((sp) < CEA_ESTACK_TOP(ceastp, name)))
+
+struct ist_ctx {
+ const struct pt_regs *regs;
+ unsigned long commit_ip;
+};
+
+#define DEFINE_IDENTIFY_IST(stack_name, sym_name, is_enabled) \
+extern char commit_asm_exc_##sym_name[]; \
+static __always_inline bool identify_ist_##sym_name( \
+ const struct pt_regs *regs, struct cea_exception_stacks *stacks,\
+ struct ist_ctx *ctx) \
+{ \
+ if (!(is_enabled)) \
+ return false; \
+ if (!IN_CEA_ESTACK(stacks, stack_name, regs->sp)) \
+ return false; \
+ ctx->regs = (struct pt_regs *)CEA_ESTACK_TOP(stacks, stack_name) - 1; \
+ ctx->commit_ip = (unsigned long)commit_asm_exc_##sym_name; \
+ return true; \
+}
+
+DEFINE_IDENTIFY_IST(NMI, nmi, false)
+DEFINE_IDENTIFY_IST(DB, debug, false)
+DEFINE_IDENTIFY_IST(MCE, machine_check, false)
+DEFINE_IDENTIFY_IST(VC, vmm_communication, false)
+
+static __always_inline bool identify_ist(
+ const struct pt_regs *regs, struct cea_exception_stacks *stacks,
+ struct ist_ctx *ctx)
+{
+ return identify_ist_nmi(regs, stacks, ctx) ||
+ identify_ist_debug(regs, stacks, ctx) ||
+ identify_ist_machine_check(regs, stacks, ctx) ||
+ identify_ist_vmm_communication(regs, stacks, ctx);
+}
+
+/*
+ * identify if an interrupted atomic-IST-entry had successfully saved
+ * the general registers onto its IST stack.
+ *
+ * Generally, the outmost atomic-IST-entry had likely successfully saved
+ * the general registers. If not, there must be one of the nested
+ * atomic-IST-entry had saved the general registers of the context that
+ * the outmost atomic-IST-entry had interrupted.
+ *
+ * Arguments:
+ * @nested: the nested atomic-IST-entry who had interrupted @interrupted
+ * @interrupted: the interrupted atomic-IST-entry.
+ *
+ * Returns:
+ * true: the interrupted atomic-IST-entry had saved the general registers.
+ * false: the interrupted atomic-IST-entry had not yet saved the general registers.
+ */
+static __always_inline
+bool identify_if_gp_registers_saved(const struct pt_regs *nested, const struct pt_regs *interrupted)
+{
+ return nested->sp <= (unsigned long)(void *)interrupted;
+}
+
+static __always_inline
+void copy_regs_exception_head(struct pt_regs *target, const struct pt_regs *from)
+{
+ target->ss = from->ss;
+ target->sp = from->sp;
+ target->flags = from->flags;
+ target->cs = from->cs;
+ target->ip = from->ip;
+ target->orig_ax = from->orig_ax;
+}
+
+static __always_inline
+void copy_regs_general_registers(struct pt_regs *target, const struct pt_regs *from)
+{
+ target->di = from->di;
+ target->si = from->si;
+ target->dx = from->dx;
+ target->cx = from->cx;
+ target->ax = from->ax;
+ target->r8 = from->r8;
+ target->r9 = from->r9;
+ target->r10 = from->r10;
+ target->r11 = from->r11;
+ target->bx = from->bx;
+ target->bp = from->bp;
+ target->r12 = from->r12;
+ target->r13 = from->r13;
+ target->r14 = from->r14;
+ target->r15 = from->r15;
+}
+
+/*
+ * Do the work as the outmost atomic-IST-entry to copy the supposed pt_regs
+ * of the interrupted context to the IST main stack. (If the ongoing
+ * atomic-IST-entry is the outmost one, the work is literally doing copy as
+ * the outmost, if not, the work is replicating the outmost.)
+ *
+ * The hardware-entry of the outmost atomic-IST-entry has already saved the
+ * exception head of the pt_regs. If the outmost atomic-IST-entry was
+ * unfortunately interrupted before fully saving all the general registers,
+ * the general registers are untouched and must be saved by one of the
+ * consequent nested atomic-IST-entries. The identifying code can just
+ * examine all the nested atomic-IST-entries to find which one has saved
+ * the general registers.
+ */
+static __always_inline
+void copy_outmost(struct pt_regs *target, const struct pt_regs *outmost, const struct pt_regs *gp)
+{
+ copy_regs_exception_head(target, outmost);
+ copy_regs_general_registers(target, gp);
+}
+
+/*
+ * Replicate the interrupted atomic-IST-entry's CLAC and CLD in the ASM
+ * code. Even SMAP is not enabled, CLAC is replicated unconditionally
+ * since doing so has no harm.
+ */
+static __always_inline void replicate_clac_cld(struct pt_regs *target)
+{
+ target->flags &= ~(unsigned long)(X86_EFLAGS_AC | X86_EFLAGS_DF);
+}
+
+/* Replicate the interrupted atomic-IST-entry's CLEAR_REGS macro. */
+static __always_inline void replicate_clear_regs(struct pt_regs *target)
+{
+ target->di = 0;
+ target->si = 0;
+ target->dx = 0;
+ target->cx = 0;
+ target->ax = 0;
+ target->r8 = 0;
+ target->r9 = 0;
+ target->r10 = 0;
+ target->r11 = 0;
+ target->bx = 0;
+ target->bp = 0;
+ target->r12 = 0;
+ target->r13 = 0;
+ target->r14 = 0;
+ target->r15 = 0;
+}
+
+/*
+ * Replicate the action that the interrupted atomic-IST-entry's
+ * ist_copy_regs_to_main_stack() clobbers caller-saved registers
+ */
+static __always_inline void replicate_func_clobber(struct pt_regs *target)
+{
+ /* nothing needs to be done. */
+}
+
+/*
+ * Replicate the copy operation in the interrupted atomic-IST-entry's
+ * ist_copy_regs_to_main_stack()
+ */
+static __always_inline void replicate_func_copy(struct pt_regs *target)
+{
+ /*
+ * To avoid recursive functions calls with __always_inline, the
+ * copy operation for the interrupted atomic-IST-entry has been
+ * done in the caller of copy_nested(). Nothing need to be done.
+ */
+}
+
+#define IST_FRAME_SIZE ALIGN(sizeof(struct pt_regs), 16)
+
+/*
+ * Replicate the return result of the interrupted atomic-IST-entry's
+ * ist_copy_regs_to_main_stack() in %rax and the commit operation.
+ */
+static __always_inline void replicate_func_result_and_commit(struct pt_regs *target, unsigned long commit_ip)
+{
+ void *target_of_interrupted = (void *)target + IST_FRAME_SIZE;
+
+ /* return result in %rax */
+ target->ax = (unsigned long)target_of_interrupted;
+ /* move %rax, %rsp */
+ target->sp = (unsigned long)target_of_interrupted;
+ /* the %rip advances to commit point */
+ target->ip = commit_ip;
+}
+
+/*
+ * Do the work as a nested atomic-IST-entry to copy the supposed pt_regs
+ * of the interrupted context to the IST main stack.
+ *
+ * The hardware-entry of the nested atomic-IST-entry has already saved
+ * the exception head of the pt_regs of the interrupted context (inside
+ * the interrupted atomic-IST-entry). To maintain the atomic attribute
+ * of the atomic-IST-entry, the copy_nested() (of the ongoing nested
+ * atomic-IST-entry) has to replicate all that the interrupted
+ * atomic-IST-entries should have been done till the commit point and
+ * copy the supposed saved context (pt_regs).
+ *
+ * To avoid touching any saved pt_regs, the replicating is actually
+ * directly applied on the target pt_regs.
+ */
+static __always_inline
+void copy_nested(struct pt_regs *target, const struct pt_regs *nested, unsigned long commit_ip)
+{
+ copy_regs_exception_head(target, nested);
+ replicate_clac_cld(target);
+ replicate_clear_regs(target);
+ replicate_func_clobber(target);
+ replicate_func_copy(target);
+ replicate_func_result_and_commit(target, commit_ip);
+}
+
+asmlinkage __visible __noinstr_section(".entry.text")
+struct pt_regs *ist_copy_regs_to_main_stack(
+ const struct pt_regs *regs, struct cea_exception_stacks *stacks)
+{
+ unsigned long ist_main_sp = CEA_ESTACK_TOP(stacks, IST);
+ struct ist_ctx ist_ctx[8];
+ const struct pt_regs *gp_saved;
+ struct pt_regs *target;
+ int nr_entries, i;
+
+ /*
+ * Identify all of the atomic-IST-entris.
+ *
+ * The current ongoing atomic-IST-entry doesn't need to be identified,
+ * but is also put in the @ist_ctx[0] for later convenience.
+ *
+ * The for-loop identifies what the context @regs has interrupted is.
+ * It travels back to the outmost atomic-IST-entry.
+ *
+ * Result:
+ * Identified result is put in ist_ctx[i].
+ * ist_ctx[0] is the current ongoing atomic-IST-entry.
+ * ist_ctx[nr_entries-1] is the outmost atomic-IST-entry.
+ * gp_saved is the atomic-IST-entry that has saved the general registers.
+ */
+ ist_ctx[0].regs = regs;
+ ist_ctx[0].commit_ip = -1; /* unused */
+ nr_entries = 1;
+ gp_saved = regs;
+ for (;;) {
+ if (user_mode((struct pt_regs *)regs))
+ break;
+ if (ip_within_syscall_gap((struct pt_regs *)regs))
+ break;
+ if (!identify_ist(regs, stacks, &ist_ctx[nr_entries])) {
+ /* locate the top of copying target pt_regs */
+ if (IN_CEA_ESTACK(stacks, IST, regs->sp))
+ ist_main_sp = ALIGN_DOWN(regs->sp, 16);
+ break;
+ }
+ if (identify_if_gp_registers_saved(regs, ist_ctx[nr_entries].regs))
+ gp_saved = ist_ctx[nr_entries].regs;
+ regs = ist_ctx[nr_entries].regs;
+ nr_entries++;
+ if (nr_entries >= ARRAY_SIZE(ist_ctx))
+ ist_double_fault();
+ }
+
+ if (!IN_CEA_ESTACK(stacks, IST, ist_main_sp - IST_FRAME_SIZE * nr_entries))
+ ist_double_fault();
+
+ /*
+ * Copy the saved pt_regs to the IST main stack.
+ *
+ * For each atomic-IST-entry including the interrupted ones and
+ * the current ongoing one, calls either copy_outmost() or copy_nested()
+ * to copy the pt_regs of what should have been saved, by replicating
+ * if needed, to the IST main stack.
+ */
+ ist_main_sp -= IST_FRAME_SIZE;
+ target = (void *)ist_main_sp;
+ copy_outmost(target, ist_ctx[nr_entries - 1].regs, gp_saved);
+ for (i = nr_entries - 2; unlikely(i >= 0); i--) {
+ ist_main_sp -= IST_FRAME_SIZE;
+ target = (void *)ist_main_sp;
+ copy_nested(target, ist_ctx[i].regs, ist_ctx[i+1].commit_ip);
+ }
+
+ return target;
+}
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index bb65371ea9df..f861a56c0002 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -60,5 +60,12 @@ int main(void)
OFFSET(FIXED_stack_canary, fixed_percpu_data, stack_canary);
BLANK();
#endif
+
+ DEFINE(CEA_stacks_NMI, offsetofend(struct cea_exception_stacks, NMI_stack));
+ DEFINE(CEA_stacks_DB, offsetofend(struct cea_exception_stacks, DB_stack));
+ DEFINE(CEA_stacks_MCE, offsetofend(struct cea_exception_stacks, MCE_stack));
+ DEFINE(CEA_stacks_VC, offsetofend(struct cea_exception_stacks, VC_stack));
+ BLANK();
+
return 0;
}
diff --git a/arch/x86/kernel/callthunks.c b/arch/x86/kernel/callthunks.c
index ffea98f9064b..e756c89996d8 100644
--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -123,6 +123,8 @@ static bool skip_addr(void *dest)
{
if (dest == error_entry)
return true;
+ if (dest == ist_copy_regs_to_main_stack)
+ return true;
if (dest == paranoid_entry)
return true;
if (dest == xen_error_entry)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index f937be1afe65..8dfa627d4b41 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3998,6 +3998,11 @@ static int validate_unret(struct objtool_file *file)
return warnings;
}

+static bool in_ist_entry(struct instruction *insn)
+{
+ return !strcmp(insn->sym->name, "ist_copy_regs_to_main_stack");
+}
+
static int validate_retpoline(struct objtool_file *file)
{
struct instruction *insn;
@@ -4016,7 +4021,7 @@ static int validate_retpoline(struct objtool_file *file)
continue;

if (insn->type == INSN_RETURN) {
- if (opts.rethunk) {
+ if (opts.rethunk && !in_ist_entry(insn)) {
WARN_FUNC("'naked' return found in RETHUNK build",
insn->sec, insn->offset);
} else
--
2.19.1.6.gb485710b

2023-04-03 14:12:39

by Lai Jiangshan

[permalink] [raw]

Subject: [RFC PATCH 4/7] x86/entry: Use atomic-IST-entry for NMI

From: Lai Jiangshan <[email protected]>

Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 373 --------------------------------
arch/x86/entry/ist_entry.c | 2 +-
arch/x86/include/asm/idtentry.h | 9 +-
3 files changed, 7 insertions(+), 377 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 50a24cc83581..2bb7ab8512dc 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1341,379 +1341,6 @@ SYM_CODE_START_LOCAL(error_return)
jmp swapgs_restore_regs_and_return_to_usermode
SYM_CODE_END(error_return)

-/*
- * Runs on exception stack. Xen PV does not go through this path at all,
- * so we can use real assembly here.
- *
- * Registers:
- * %r14: Used to save/restore the CR3 of the interrupted context
- * when PAGE_TABLE_ISOLATION is in use. Do not clobber.
- */
-SYM_CODE_START(asm_exc_nmi)
- UNWIND_HINT_IRET_REGS
- ENDBR
-
- /*
- * We allow breakpoints in NMIs. If a breakpoint occurs, then
- * the iretq it performs will take us out of NMI context.
- * This means that we can have nested NMIs where the next
- * NMI is using the top of the stack of the previous NMI. We
- * can't let it execute because the nested NMI will corrupt the
- * stack of the previous NMI. NMI handlers are not re-entrant
- * anyway.
- *
- * To handle this case we do the following:
- * Check the a special location on the stack that contains
- * a variable that is set when NMIs are executing.
- * The interrupted task's stack is also checked to see if it
- * is an NMI stack.
- * If the variable is not set and the stack is not the NMI
- * stack then:
- * o Set the special variable on the stack
- * o Copy the interrupt frame into an "outermost" location on the
- * stack
- * o Copy the interrupt frame into an "iret" location on the stack
- * o Continue processing the NMI
- * If the variable is set or the previous stack is the NMI stack:
- * o Modify the "iret" location to jump to the repeat_nmi
- * o return back to the first NMI
- *
- * Now on exit of the first NMI, we first clear the stack variable
- * The NMI stack will tell any nested NMIs at that point that it is
- * nested. Then we pop the stack normally with iret, and if there was
- * a nested NMI that updated the copy interrupt stack frame, a
- * jump will be made to the repeat_nmi code that will handle the second
- * NMI.
- *
- * However, espfix prevents us from directly returning to userspace
- * with a single IRET instruction. Similarly, IRET to user mode
- * can fault. We therefore handle NMIs from user space like
- * other IST entries.
- */
-
- ASM_CLAC
- cld
-
- /* Use %rdx as our temp variable throughout */
- pushq %rdx
-
- testb $3, CS-RIP+8(%rsp)
- jz .Lnmi_from_kernel
-
- /*
- * NMI from user mode. We need to run on the thread stack, but we
- * can't go through the normal entry paths: NMIs are masked, and
- * we don't want to enable interrupts, because then we'll end
- * up in an awkward situation in which IRQs are on but NMIs
- * are off.
- *
- * We also must not push anything to the stack before switching
- * stacks lest we corrupt the "NMI executing" variable.
- */
-
- swapgs
- FENCE_SWAPGS_USER_ENTRY
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
- movq %rsp, %rdx
- movq PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp
- UNWIND_HINT_IRET_REGS base=%rdx offset=8
- pushq 5*8(%rdx) /* pt_regs->ss */
- pushq 4*8(%rdx) /* pt_regs->rsp */
- pushq 3*8(%rdx) /* pt_regs->flags */
- pushq 2*8(%rdx) /* pt_regs->cs */
- pushq 1*8(%rdx) /* pt_regs->rip */
- UNWIND_HINT_IRET_REGS
- pushq $-1 /* pt_regs->orig_ax */
- PUSH_AND_CLEAR_REGS rdx=(%rdx)
- ENCODE_FRAME_POINTER
-
- IBRS_ENTER
- UNTRAIN_RET
-
- /*
- * At this point we no longer need to worry about stack damage
- * due to nesting -- we're on the normal thread stack and we're
- * done with the NMI stack.
- */
-
- movq %rsp, %rdi
- movq $-1, %rsi
- call exc_nmi
-
- /*
- * Return back to user mode. We must *not* do the normal exit
- * work, because we don't want to enable interrupts.
- */
- jmp swapgs_restore_regs_and_return_to_usermode
-
-.Lnmi_from_kernel:
- /*
- * Here's what our stack frame will look like:
- * +---------------------------------------------------------+
- * | original SS |
- * | original Return RSP |
- * | original RFLAGS |
- * | original CS |
- * | original RIP |
- * +---------------------------------------------------------+
- * | temp storage for rdx |
- * +---------------------------------------------------------+
- * | "NMI executing" variable |
- * +---------------------------------------------------------+
- * | iret SS } Copied from "outermost" frame |
- * | iret Return RSP } on each loop iteration; overwritten |
- * | iret RFLAGS } by a nested NMI to force another |
- * | iret CS } iteration if needed. |
- * | iret RIP } |
- * +---------------------------------------------------------+
- * | outermost SS } initialized in first_nmi; |
- * | outermost Return RSP } will not be changed before |
- * | outermost RFLAGS } NMI processing is done. |
- * | outermost CS } Copied to "iret" frame on each |
- * | outermost RIP } iteration. |
- * +---------------------------------------------------------+
- * | pt_regs |
- * +---------------------------------------------------------+
- *
- * The "original" frame is used by hardware. Before re-enabling
- * NMIs, we need to be done with it, and we need to leave enough
- * space for the asm code here.
- *
- * We return by executing IRET while RSP points to the "iret" frame.
- * That will either return for real or it will loop back into NMI
- * processing.
- *
- * The "outermost" frame is copied to the "iret" frame on each
- * iteration of the loop, so each iteration starts with the "iret"
- * frame pointing to the final return target.
- */
-
- /*
- * Determine whether we're a nested NMI.
- *
- * If we interrupted kernel code between repeat_nmi and
- * end_repeat_nmi, then we are a nested NMI. We must not
- * modify the "iret" frame because it's being written by
- * the outer NMI. That's okay; the outer NMI handler is
- * about to about to call exc_nmi() anyway, so we can just
- * resume the outer NMI.
- */
-
- movq $repeat_nmi, %rdx
- cmpq 8(%rsp), %rdx
- ja 1f
- movq $end_repeat_nmi, %rdx
- cmpq 8(%rsp), %rdx
- ja nested_nmi_out
-1:
-
- /*
- * Now check "NMI executing". If it's set, then we're nested.
- * This will not detect if we interrupted an outer NMI just
- * before IRET.
- */
- cmpl $1, -8(%rsp)
- je nested_nmi
-
- /*
- * Now test if the previous stack was an NMI stack. This covers
- * the case where we interrupt an outer NMI after it clears
- * "NMI executing" but before IRET. We need to be careful, though:
- * there is one case in which RSP could point to the NMI stack
- * despite there being no NMI active: naughty userspace controls
- * RSP at the very beginning of the SYSCALL targets. We can
- * pull a fast one on naughty userspace, though: we program
- * SYSCALL to mask DF, so userspace cannot cause DF to be set
- * if it controls the kernel's RSP. We set DF before we clear
- * "NMI executing".
- */
- lea 6*8(%rsp), %rdx
- /* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */
- cmpq %rdx, 4*8(%rsp)
- /* If the stack pointer is above the NMI stack, this is a normal NMI */
- ja first_nmi
-
- subq $EXCEPTION_STKSZ, %rdx
- cmpq %rdx, 4*8(%rsp)
- /* If it is below the NMI stack, it is a normal NMI */
- jb first_nmi
-
- /* Ah, it is within the NMI stack. */
-
- testb $(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
- jz first_nmi /* RSP was user controlled. */
-
- /* This is a nested NMI. */
-
-nested_nmi:
- /*
- * Modify the "iret" frame to point to repeat_nmi, forcing another
- * iteration of NMI handling.
- */
- subq $8, %rsp
- leaq -10*8(%rsp), %rdx
- pushq $__KERNEL_DS
- pushq %rdx
- pushfq
- pushq $__KERNEL_CS
- pushq $repeat_nmi
-
- /* Put stack back */
- addq $(6*8), %rsp
-
-nested_nmi_out:
- popq %rdx
-
- /* We are returning to kernel mode, so this cannot result in a fault. */
- iretq
-
-first_nmi:
- /* Restore rdx. */
- movq (%rsp), %rdx
-
- /* Make room for "NMI executing". */
- pushq $0
-
- /* Leave room for the "iret" frame */
- subq $(5*8), %rsp
-
- /* Copy the "original" frame to the "outermost" frame */
- .rept 5
- pushq 11*8(%rsp)
- .endr
- UNWIND_HINT_IRET_REGS
-
- /* Everything up to here is safe from nested NMIs */
-
-#ifdef CONFIG_DEBUG_ENTRY
- /*
- * For ease of testing, unmask NMIs right away. Disabled by
- * default because IRET is very expensive.
- */
- pushq $0 /* SS */
- pushq %rsp /* RSP (minus 8 because of the previous push) */
- addq $8, (%rsp) /* Fix up RSP */
- pushfq /* RFLAGS */
- pushq $__KERNEL_CS /* CS */
- pushq $1f /* RIP */
- iretq /* continues at repeat_nmi below */
- UNWIND_HINT_IRET_REGS
-1:
-#endif
-
-repeat_nmi:
- ANNOTATE_NOENDBR // this code
- /*
- * If there was a nested NMI, the first NMI's iret will return
- * here. But NMIs are still enabled and we can take another
- * nested NMI. The nested NMI checks the interrupted RIP to see
- * if it is between repeat_nmi and end_repeat_nmi, and if so
- * it will just return, as we are about to repeat an NMI anyway.
- * This makes it safe to copy to the stack frame that a nested
- * NMI will update.
- *
- * RSP is pointing to "outermost RIP". gsbase is unknown, but, if
- * we're repeating an NMI, gsbase has the same value that it had on
- * the first iteration. paranoid_entry will load the kernel
- * gsbase if needed before we call exc_nmi(). "NMI executing"
- * is zero.
- */
- movq $1, 10*8(%rsp) /* Set "NMI executing". */
-
- /*
- * Copy the "outermost" frame to the "iret" frame. NMIs that nest
- * here must not modify the "iret" frame while we're writing to
- * it or it will end up containing garbage.
- */
- addq $(10*8), %rsp
- .rept 5
- pushq -6*8(%rsp)
- .endr
- subq $(5*8), %rsp
-end_repeat_nmi:
- ANNOTATE_NOENDBR // this code
-
- /*
- * Everything below this point can be preempted by a nested NMI.
- * If this happens, then the inner NMI will change the "iret"
- * frame to point back to repeat_nmi.
- */
- pushq $-1 /* ORIG_RAX: no syscall to restart */
-
- PUSH_AND_CLEAR_REGS
- UNWIND_HINT_REGS
- ENCODE_FRAME_POINTER
-
- /*
- * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
- * as we should not be calling schedule in NMI context.
- * Even with normal interrupts enabled. An NMI should not be
- * setting NEED_RESCHED or anything that normal interrupts and
- * exceptions might do.
- */
- call paranoid_entry
-
- movq %rsp, %rdi
- movq $-1, %rsi
- call exc_nmi
-
- /* Always restore stashed SPEC_CTRL value (see paranoid_entry) */
- IBRS_EXIT save_reg=%r15
-
- /* Always restore stashed CR3 value (see paranoid_entry) */
- RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
-
- /*
- * The above invocation of paranoid_entry stored the GSBASE
- * related information in R/EBX depending on the availability
- * of FSGSBASE.
- *
- * If FSGSBASE is enabled, restore the saved GSBASE value
- * unconditionally, otherwise take the conditional SWAPGS path.
- */
- ALTERNATIVE "jmp nmi_no_fsgsbase", "", X86_FEATURE_FSGSBASE
-
- wrgsbase %rbx
- jmp nmi_restore
-
-nmi_no_fsgsbase:
- /* EBX == 0 -> invoke SWAPGS */
- testl %ebx, %ebx
- jnz nmi_restore
-
-nmi_swapgs:
- swapgs
-
-nmi_restore:
- POP_REGS
-
- /*
- * Skip orig_ax and the "outermost" frame to point RSP at the "iret"
- * at the "iret" frame.
- */
- addq $6*8, %rsp
-
- /*
- * Clear "NMI executing". Set DF first so that we can easily
- * distinguish the remaining code between here and IRET from
- * the SYSCALL entry and exit paths.
- *
- * We arguably should just inspect RIP instead, but I (Andy) wrote
- * this code when I had the misapprehension that Xen PV supported
- * NMIs, and Xen PV would break that approach.
- */
- std
- movq $0, 5*8(%rsp) /* clear "NMI executing" */
-
- /*
- * iretq reads the "iret" frame and exits the NMI stack in a
- * single instruction. We are returning to kernel mode, so this
- * cannot result in a fault. Similarly, we don't need to worry
- * about espfix64 on the way back to kernel mode.
- */
- iretq
-SYM_CODE_END(asm_exc_nmi)
-
#ifndef CONFIG_IA32_EMULATION
/*
* This handles SYSCALL from 32-bit code. There is no way to program
diff --git a/arch/x86/entry/ist_entry.c b/arch/x86/entry/ist_entry.c
index e1b06306ac51..407571cc4a8c 100644
--- a/arch/x86/entry/ist_entry.c
+++ b/arch/x86/entry/ist_entry.c
@@ -41,7 +41,7 @@ static __always_inline bool identify_ist_##sym_name( \
return true; \
}

-DEFINE_IDENTIFY_IST(NMI, nmi, false)
+DEFINE_IDENTIFY_IST(NMI, nmi, true)
DEFINE_IDENTIFY_IST(DB, debug, false)
DEFINE_IDENTIFY_IST(MCE, machine_check, false)
DEFINE_IDENTIFY_IST(VC, vmm_communication, false)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b241af4ce9b4..b568f1de6da6 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -450,6 +450,9 @@ __visible noinstr void func(struct pt_regs *regs, \
idtentry_sysvec vector func

#ifdef CONFIG_X86_64
+# define DECLARE_IDTENTRY_NMI(vector, func) \
+ idtentry_ist vector asm_##func func func has_error_code=0 stack_offset=CEA_stacks_NMI
+
# define DECLARE_IDTENTRY_MCE(vector, func) \
idtentry_mce_db vector asm_##func func

@@ -475,11 +478,11 @@ __visible noinstr void func(struct pt_regs *regs, \
/* No ASM emitted for XEN hypervisor callback */
# define DECLARE_IDTENTRY_XENCB(vector, func)

-#endif
-
-/* No ASM code emitted for NMI */
+/* No ASM code emitted for NMI for X86_32 */
#define DECLARE_IDTENTRY_NMI(vector, func)

+#endif
+
/*
* ASM code to emit the common vector entry stubs where each stub is
* packed into IDT_ALIGN bytes.
--
2.19.1.6.gb485710b

2023-04-03 14:12:47

by Lai Jiangshan

[permalink] [raw]

Subject: [RFC PATCH 5/7] x86/entry: Use atomic-IST-entry for MCE and DB

From: Lai Jiangshan <[email protected]>

Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 53 ---------------------------------
arch/x86/entry/ist_entry.c | 4 +--
arch/x86/include/asm/idtentry.h | 4 +--
3 files changed, 4 insertions(+), 57 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 2bb7ab8512dc..e4ddc793f841 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -621,59 +621,6 @@ _ASM_NOKPROBE(\asmsym)
SYM_CODE_END(\asmsym)
.endm

-/**
- * idtentry_mce_db - Macro to generate entry stubs for #MC and #DB
- * @vector: Vector number
- * @asmsym: ASM symbol for the entry point
- * @cfunc: C function to be called
- *
- * The macro emits code to set up the kernel context for #MC and #DB
- *
- * If the entry comes from user space it uses the normal entry path
- * including the return to user space work and preemption checks on
- * exit.
- *
- * If hits in kernel mode then it needs to go through the paranoid
- * entry as the exception can hit any random state. No preemption
- * check on exit to keep the paranoid path simple.
- */
-.macro idtentry_mce_db vector asmsym cfunc
-SYM_CODE_START(\asmsym)
- UNWIND_HINT_IRET_REGS
- ENDBR
- ASM_CLAC
- cld
-
- pushq $-1 /* ORIG_RAX: no syscall to restart */
-
- /*
- * If the entry is from userspace, switch stacks and treat it as
- * a normal entry.
- */
- testb $3, CS-ORIG_RAX(%rsp)
- jnz .Lfrom_usermode_switch_stack_\@
-
- PUSH_AND_CLEAR_REGS
- UNWIND_HINT_REGS
- ENCODE_FRAME_POINTER
-
- /* paranoid_entry returns GS information for paranoid_exit in EBX. */
- call paranoid_entry
-
- movq %rsp, %rdi /* pt_regs pointer */
-
- call \cfunc
-
- jmp paranoid_exit
-
- /* Switch to the regular task stack and use the noist entry point */
-.Lfrom_usermode_switch_stack_\@:
- idtentry_body noist_\cfunc, has_error_code=0
-
-_ASM_NOKPROBE(\asmsym)
-SYM_CODE_END(\asmsym)
-.endm
-
#ifdef CONFIG_AMD_MEM_ENCRYPT
/**
* idtentry_vc - Macro to generate entry stub for #VC
diff --git a/arch/x86/entry/ist_entry.c b/arch/x86/entry/ist_entry.c
index 407571cc4a8c..946b3b537bd5 100644
--- a/arch/x86/entry/ist_entry.c
+++ b/arch/x86/entry/ist_entry.c
@@ -42,8 +42,8 @@ static __always_inline bool identify_ist_##sym_name( \
}

DEFINE_IDENTIFY_IST(NMI, nmi, true)
-DEFINE_IDENTIFY_IST(DB, debug, false)
-DEFINE_IDENTIFY_IST(MCE, machine_check, false)
+DEFINE_IDENTIFY_IST(DB, debug, true)
+DEFINE_IDENTIFY_IST(MCE, machine_check, IS_ENABLED(CONFIG_X86_MCE))
DEFINE_IDENTIFY_IST(VC, vmm_communication, false)

static __always_inline bool identify_ist(
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b568f1de6da6..01f3152ffe82 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -454,10 +454,10 @@ __visible noinstr void func(struct pt_regs *regs, \
idtentry_ist vector asm_##func func func has_error_code=0 stack_offset=CEA_stacks_NMI

# define DECLARE_IDTENTRY_MCE(vector, func) \
- idtentry_mce_db vector asm_##func func
+ idtentry_ist vector asm_##func func noist_##func has_error_code=0 stack_offset=CEA_stacks_MCE

# define DECLARE_IDTENTRY_DEBUG(vector, func) \
- idtentry_mce_db vector asm_##func func
+ idtentry_ist vector asm_##func func noist_##func has_error_code=0 stack_offset=CEA_stacks_DB

# define DECLARE_IDTENTRY_DF(vector, func) \
idtentry_df vector asm_##func func
--
2.19.1.6.gb485710b

2023-04-03 14:13:04

by Lai Jiangshan

[permalink] [raw]

Subject: [RFC PATCH 6/7] x86/entry: Use atomic-IST-entry for VC

From: Lai Jiangshan <[email protected]>

Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 83 --------------------
arch/x86/entry/ist_entry.c | 2 +-
arch/x86/include/asm/cpu_entry_area.h | 3 -
arch/x86/include/asm/idtentry.h | 2 +-
arch/x86/include/asm/sev.h | 14 ----
arch/x86/include/asm/traps.h | 1 -
arch/x86/kernel/dumpstack_64.c | 4 +-
arch/x86/kernel/nmi.c | 8 --
arch/x86/kernel/sev.c | 108 --------------------------
arch/x86/kernel/traps.c | 43 ----------
arch/x86/mm/cpu_entry_area.c | 1 -
11 files changed, 3 insertions(+), 266 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e4ddc793f841..187d42efd288 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -621,89 +621,6 @@ _ASM_NOKPROBE(\asmsym)
SYM_CODE_END(\asmsym)
.endm

-#ifdef CONFIG_AMD_MEM_ENCRYPT
-/**
- * idtentry_vc - Macro to generate entry stub for #VC
- * @vector: Vector number
- * @asmsym: ASM symbol for the entry point
- * @cfunc: C function to be called
- *
- * The macro emits code to set up the kernel context for #VC. The #VC handler
- * runs on an IST stack and needs to be able to cause nested #VC exceptions.
- *
- * To make this work the #VC entry code tries its best to pretend it doesn't use
- * an IST stack by switching to the task stack if coming from user-space (which
- * includes early SYSCALL entry path) or back to the stack in the IRET frame if
- * entered from kernel-mode.
- *
- * If entered from kernel-mode the return stack is validated first, and if it is
- * not safe to use (e.g. because it points to the entry stack) the #VC handler
- * will switch to a fall-back stack (VC2) and call a special handler function.
- *
- * The macro is only used for one vector, but it is planned to be extended in
- * the future for the #HV exception.
- */
-.macro idtentry_vc vector asmsym cfunc
-SYM_CODE_START(\asmsym)
- UNWIND_HINT_IRET_REGS
- ENDBR
- ASM_CLAC
- cld
-
- /*
- * If the entry is from userspace, switch stacks and treat it as
- * a normal entry.
- */
- testb $3, CS-ORIG_RAX(%rsp)
- jnz .Lfrom_usermode_switch_stack_\@
-
- PUSH_AND_CLEAR_REGS
- UNWIND_HINT_REGS
- ENCODE_FRAME_POINTER
-
- /*
- * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
- * EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS
- */
- call paranoid_entry
-
- /*
- * Switch off the IST stack to make it free for nested exceptions. The
- * vc_switch_off_ist() function will switch back to the interrupted
- * stack if it is safe to do so. If not it switches to the VC fall-back
- * stack.
- */
- movq %rsp, %rdi /* pt_regs pointer */
- call vc_switch_off_ist
- movq %rax, %rsp /* Switch to new stack */
-
- ENCODE_FRAME_POINTER
- UNWIND_HINT_REGS
-
- /* Update pt_regs */
- movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
- movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
-
- movq %rsp, %rdi /* pt_regs pointer */
-
- call kernel_\cfunc
-
- /*
- * No need to switch back to the IST stack. The current stack is either
- * identical to the stack in the IRET frame or the VC fall-back stack,
- * so it is definitely mapped even with PTI enabled.
- */
- jmp paranoid_exit
-
- /* Switch to the regular task stack */
-.Lfrom_usermode_switch_stack_\@:
- idtentry_body user_\cfunc, has_error_code=1
-
-_ASM_NOKPROBE(\asmsym)
-SYM_CODE_END(\asmsym)
-.endm
-#endif
-
/*
* Double fault entry. Straight paranoid. No checks from which context
* this comes because for the espfix induced #DF this would do the wrong
diff --git a/arch/x86/entry/ist_entry.c b/arch/x86/entry/ist_entry.c
index 946b3b537bd5..c0cbd4527033 100644
--- a/arch/x86/entry/ist_entry.c
+++ b/arch/x86/entry/ist_entry.c
@@ -44,7 +44,7 @@ static __always_inline bool identify_ist_##sym_name( \
DEFINE_IDENTIFY_IST(NMI, nmi, true)
DEFINE_IDENTIFY_IST(DB, debug, true)
DEFINE_IDENTIFY_IST(MCE, machine_check, IS_ENABLED(CONFIG_X86_MCE))
-DEFINE_IDENTIFY_IST(VC, vmm_communication, false)
+DEFINE_IDENTIFY_IST(VC, vmm_communication, IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))

static __always_inline bool identify_ist(
const struct pt_regs *regs, struct cea_exception_stacks *stacks,
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index a373e8c37e25..618aa698eb82 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -30,8 +30,6 @@
char MCE_stack[EXCEPTION_STKSZ]; \
char VC_stack_guard[guardsize]; \
char VC_stack[optional_stack_size]; \
- char VC2_stack_guard[guardsize]; \
- char VC2_stack[optional_stack_size]; \
char IST_stack_guard[guardsize]; \
char IST_stack[IST_MAIN_STKSZ]; \
char IST_top_guard[guardsize]; \
@@ -55,7 +53,6 @@ enum exception_stack_ordering {
ESTACK_DB,
ESTACK_MCE,
ESTACK_VC,
- ESTACK_VC2,
ESTACK_IST,
N_EXCEPTION_STACKS
};
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 01f3152ffe82..5f3250e589ec 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -466,7 +466,7 @@ __visible noinstr void func(struct pt_regs *regs, \
DECLARE_IDTENTRY(vector, func)

# define DECLARE_IDTENTRY_VC(vector, func) \
- idtentry_vc vector asm_##func func
+ idtentry_ist vector asm_##func kernel_##func user_##func has_error_code=1 stack_offset=CEA_stacks_VC

#else
# define DECLARE_IDTENTRY_MCE(vector, func) \
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index ebc271bb6d8e..ce554b3a818d 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -135,18 +135,6 @@ struct snp_secrets_page_layout {

#ifdef CONFIG_AMD_MEM_ENCRYPT
extern struct static_key_false sev_es_enable_key;
-extern void __sev_es_ist_enter(struct pt_regs *regs);
-extern void __sev_es_ist_exit(void);
-static __always_inline void sev_es_ist_enter(struct pt_regs *regs)
-{
- if (static_branch_unlikely(&sev_es_enable_key))
- __sev_es_ist_enter(regs);
-}
-static __always_inline void sev_es_ist_exit(void)
-{
- if (static_branch_unlikely(&sev_es_enable_key))
- __sev_es_ist_exit();
-}
extern int sev_es_setup_ap_jump_table(struct real_mode_header *rmh);
extern void __sev_es_nmi_complete(void);
static __always_inline void sev_es_nmi_complete(void)
@@ -198,8 +186,6 @@ bool snp_init(struct boot_params *bp);
void __init __noreturn snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
#else
-static inline void sev_es_ist_enter(struct pt_regs *regs) { }
-static inline void sev_es_ist_exit(void) { }
static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
static inline void sev_es_nmi_complete(void) { }
static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 47ecfff2c83d..dc0da530f951 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -15,7 +15,6 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs);
asmlinkage __visible notrace
struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs);
void __init trap_init(void);
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
#endif

extern bool ibt_selftest(void);
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 3413b23fa9f1..b7ef2685f63b 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -25,7 +25,6 @@ static const char * const exception_stack_names[] = {
[ ESTACK_DB ] = "#DB",
[ ESTACK_MCE ] = "#MC",
[ ESTACK_VC ] = "#VC",
- [ ESTACK_VC2 ] = "#VC2",
[ ESTACK_IST ] = "#IST",
};

@@ -89,7 +88,6 @@ struct estack_pages estack_pages[CEA_ESTACK_PAGES] ____cacheline_aligned = {
EPAGERANGE(DB),
EPAGERANGE(MCE),
EPAGERANGE(VC),
- EPAGERANGE(VC2),
EPAGERANGE(IST),
};

@@ -100,7 +98,7 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac
struct pt_regs *regs;
unsigned int k;

- BUILD_BUG_ON(N_EXCEPTION_STACKS != 7);
+ BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);

begin = (unsigned long)__this_cpu_read(cea_exception_stacks);
/*
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 776f4b1e395b..bafd0c7ca5b7 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -514,12 +514,6 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
}
nmi_restart:

- /*
- * Needs to happen before DR7 is accessed, because the hypervisor can
- * intercept DR7 reads/writes, turning those into #VC exceptions.
- */
- sev_es_ist_enter(regs);
-
this_cpu_write(nmi_dr7, local_db_save());

irq_state = irqentry_nmi_enter(regs);
@@ -544,8 +538,6 @@ DEFINE_IDTENTRY_RAW(exc_nmi)

local_db_restore(this_cpu_read(nmi_dr7));

- sev_es_ist_exit();
-
if (unlikely(this_cpu_read(nmi_cr2) != read_cr2()))
write_cr2(this_cpu_read(nmi_cr2));
if (this_cpu_dec_return(nmi_state))
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 679026a640ef..74d55786c353 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -122,77 +122,6 @@ struct sev_config {

static struct sev_config sev_cfg __read_mostly;

-static __always_inline bool on_vc_stack(struct pt_regs *regs)
-{
- unsigned long sp = regs->sp;
-
- /* User-mode RSP is not trusted */
- if (user_mode(regs))
- return false;
-
- /* SYSCALL gap still has user-mode RSP */
- if (ip_within_syscall_gap(regs))
- return false;
-
- return ((sp >= __this_cpu_ist_bottom_va(VC)) && (sp < __this_cpu_ist_top_va(VC)));
-}
-
-/*
- * This function handles the case when an NMI is raised in the #VC
- * exception handler entry code, before the #VC handler has switched off
- * its IST stack. In this case, the IST entry for #VC must be adjusted,
- * so that any nested #VC exception will not overwrite the stack
- * contents of the interrupted #VC handler.
- *
- * The IST entry is adjusted unconditionally so that it can be also be
- * unconditionally adjusted back in __sev_es_ist_exit(). Otherwise a
- * nested sev_es_ist_exit() call may adjust back the IST entry too
- * early.
- *
- * The __sev_es_ist_enter() and __sev_es_ist_exit() functions always run
- * on the NMI IST stack, as they are only called from NMI handling code
- * right now.
- */
-void noinstr __sev_es_ist_enter(struct pt_regs *regs)
-{
- unsigned long old_ist, new_ist;
-
- /* Read old IST entry */
- new_ist = old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
-
- /*
- * If NMI happened while on the #VC IST stack, set the new IST
- * value below regs->sp, so that the interrupted stack frame is
- * not overwritten by subsequent #VC exceptions.
- */
- if (on_vc_stack(regs))
- new_ist = regs->sp;
-
- /*
- * Reserve additional 8 bytes and store old IST value so this
- * adjustment can be unrolled in __sev_es_ist_exit().
- */
- new_ist -= sizeof(old_ist);
- *(unsigned long *)new_ist = old_ist;
-
- /* Set new IST entry */
- this_cpu_write(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC], new_ist);
-}
-
-void noinstr __sev_es_ist_exit(void)
-{
- unsigned long ist;
-
- /* Read IST entry */
- ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
-
- if (WARN_ON(ist == __this_cpu_ist_top_va(VC)))
- return;
-
- /* Read back old IST entry and write it to the TSS */
- this_cpu_write(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC], *(unsigned long *)ist);
-}
-
/*
* Nothing shall interrupt this code path while holding the per-CPU
* GHCB. The backup GHCB is only for NMIs interrupting this path.
@@ -1841,26 +1770,6 @@ static __always_inline void vc_forward_exception(struct es_em_ctxt *ctxt)
}
}

-static __always_inline bool is_vc2_stack(unsigned long sp)
-{
- return (sp >= __this_cpu_ist_bottom_va(VC2) && sp < __this_cpu_ist_top_va(VC2));
-}
-
-static __always_inline bool vc_from_invalid_context(struct pt_regs *regs)
-{
- unsigned long sp, prev_sp;
-
- sp = (unsigned long)regs;
- prev_sp = regs->sp;
-
- /*
- * If the code was already executing on the VC2 stack when the #VC
- * happened, let it proceed to the normal handling routine. This way the
- * code executing on the VC2 stack can cause #VC exceptions to get handled.
- */
- return is_vc2_stack(sp) && !is_vc2_stack(prev_sp);
-}
-
static bool vc_raw_handle_exception(struct pt_regs *regs, unsigned long error_code)
{
struct ghcb_state state;
@@ -1930,23 +1839,6 @@ DEFINE_IDTENTRY_VC_KERNEL(exc_vmm_communication)
{
irqentry_state_t irq_state;

- /*
- * With the current implementation it is always possible to switch to a
- * safe stack because #VC exceptions only happen at known places, like
- * intercepted instructions or accesses to MMIO areas/IO ports. They can
- * also happen with code instrumentation when the hypervisor intercepts
- * #DB, but the critical paths are forbidden to be instrumented, so #DB
- * exceptions currently also only happen in safe places.
- *
- * But keep this here in case the noinstr annotations are violated due
- * to bug elsewhere.
- */
- if (unlikely(vc_from_invalid_context(regs))) {
- instrumentation_begin();
- panic("Can't handle #VC exception from unsupported context\n");
- instrumentation_end();
- }
-
/*
* Handle #DB before calling into !noinstr code to avoid recursive #DB.
*/
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d317dc3d06a3..6c697c175f7a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -864,49 +864,6 @@ asmlinkage __visible noinstr struct pt_regs *sync_regs(struct pt_regs *eregs)
return regs;
}

-#ifdef CONFIG_AMD_MEM_ENCRYPT
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *regs)
-{
- unsigned long sp, *stack;
- struct stack_info info;
- struct pt_regs *regs_ret;
-
- /*
- * In the SYSCALL entry path the RSP value comes from user-space - don't
- * trust it and switch to the current kernel stack
- */
- if (ip_within_syscall_gap(regs)) {
- sp = this_cpu_read(pcpu_hot.top_of_stack);
- goto sync;
- }
-
- /*
- * From here on the RSP value is trusted. Now check whether entry
- * happened from a safe stack. Not safe are the entry or unknown stacks,
- * use the fall-back stack instead in this case.
- */
- sp = regs->sp;
- stack = (unsigned long *)sp;
-
- if (!get_stack_info_noinstr(stack, current, &info) || info.type == STACK_TYPE_ENTRY ||
- info.type > STACK_TYPE_EXCEPTION_LAST)
- sp = __this_cpu_ist_top_va(VC2);
-
-sync:
- /*
- * Found a safe stack - switch to it as if the entry didn't happen via
- * IST stack. The code below only copies pt_regs, the real switch happens
- * in assembly code.
- */
- sp = ALIGN_DOWN(sp, 8) - sizeof(*regs_ret);
-
- regs_ret = (struct pt_regs *)sp;
- *regs_ret = *regs;
-
- return regs_ret;
-}
-#endif
-
asmlinkage __visible noinstr struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs)
{
struct pt_regs tmp, *new_stack;
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 62341cb819ab..7df1301ec343 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -153,7 +153,6 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
cea_map_stack(VC);
- cea_map_stack(VC2);
}
}
}
--
2.19.1.6.gb485710b

2023-04-03 14:14:17

by Lai Jiangshan

[permalink] [raw]

Subject: [RFC PATCH 7/7] x86/entry: Test atomic-IST-entry via KVM

From: Lai Jiangshan <[email protected]>

Not for inclusion.
It is not part of the atomic-IST-entry patch.

Test it with a VM with only one vCPU.
Lauch the VM with the uncompressed vmlinux.

The test injects super exceptions (MCE NMI DB) randomly.

The test make the ratio of the occurence of nested exceptions
(different vectors) higher.

The test reables the nested of the same vector very hurry.
When the cpu reaches a commit_ip, all the events are reabled.

usage:

parameters="$parameters exc_nmi_handler=0x$(readelf -W -s vmlinux | awk '/ exc_nmi$/ { print $2}')"
parameters="$parameters asm_exc_nmi_handler=0x$(readelf -W -s vmlinux | awk '/ asm_exc_nmi$/ { print $2}')"
parameters="$parameters asm_exc_db_handler=0x$(readelf -W -s vmlinux | awk '/ asm_exc_debug$/ { print $2}')"
parameters="$parameters asm_exc_mce_handler=0x$(readelf -W -s vmlinux | awk '/ asm_exc_machine_check$/ { print $2}')"
parameters="$parameters exc_db_kernel_handler=0x$(readelf -W -s vmlinux | awk '/ exc_debug$/ { print $2}')"
parameters="$parameters exc_db_user_handler=0x$(readelf -W -s vmlinux | awk '/ noist_exc_debug$/ { print $2}')"
parameters="$parameters exc_mce_kernel_handler=0x$(readelf -W -s vmlinux | awk '/ exc_machine_check$/ { print $2}')"
parameters="$parameters exc_mce_user_handler=0x$(readelf -W -s vmlinux | awk '/ noist_exc_machine_check$/ { print $2}')"
parameters="$parameters asm_exc_nmi_commit_ip=0x$(readelf -W -s vmlinux | awk '/ commit_asm_exc_nmi$/ { print $2}')"
parameters="$parameters asm_exc_db_commit_ip=0x$(readelf -W -s vmlinux | awk '/ commit_asm_exc_debug$/ { print $2}')"
parameters="$parameters asm_exc_mce_commit_ip=0x$(readelf -W -s vmlinux | awk '/ commit_asm_exc_machine_check$/ { print $2}')"
parameters="$parameters native_irq_return_iret_ip=0x$(readelf -W -s vmlinux | awk '/ native_irq_return_iret$/ { print $2}')"
parameters="$parameters ist_emul_test=1"

insmod arch/x86/kvm/kvm.ko
insmod arch/x86/kvm/kvm-intel.ko $parameters

Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 441 ++++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/x86.c | 10 +-
2 files changed, 447 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bcac3efcde41..6f0d574a137a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1252,6 +1252,441 @@ void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
}
}

+#include <linux/prandom.h>
+
+// readelf -W -s vmlinux | awk '/ asm_exc_double_fault$/ { print $2}'
+static unsigned long asm_exc_df_handler;
+module_param(asm_exc_df_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ asm_exc_nmi$/ { print $2}'
+static unsigned long asm_exc_nmi_handler;
+module_param(asm_exc_nmi_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ asm_exc_debug$/ { print $2}'
+static unsigned long asm_exc_db_handler;
+module_param(asm_exc_db_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ asm_exc_machine_check$/ { print $2}'
+static unsigned long asm_exc_mce_handler;
+module_param(asm_exc_mce_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ exc_nmi$/ { print $2}'
+static unsigned long exc_nmi_handler;
+module_param(exc_nmi_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ exc_debug$/ { print $2}'
+static unsigned long exc_db_kernel_handler;
+module_param(exc_db_kernel_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ noist_exc_debug$/ { print $2}'
+static unsigned long exc_db_user_handler;
+module_param(exc_db_user_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ exc_machine_check$/ { print $2}'
+static unsigned long exc_mce_kernel_handler;
+module_param(exc_mce_kernel_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ noist_exc_machine_check$/ { print $2}'
+static unsigned long exc_mce_user_handler;
+module_param(exc_mce_user_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ commit_asm_exc_nmi$/ { print $2}'
+static unsigned long asm_exc_nmi_commit_ip;
+module_param(asm_exc_nmi_commit_ip, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ commit_asm_exc_debug$/ { print $2}'
+static unsigned long asm_exc_db_commit_ip;
+module_param(asm_exc_db_commit_ip, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ commit_asm_exc_machine_check$/ { print $2}'
+static unsigned long asm_exc_mce_commit_ip;
+module_param(asm_exc_mce_commit_ip, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ native_irq_return_iret$/ { print $2}'
+static unsigned long native_irq_return_iret_ip;
+module_param(native_irq_return_iret_ip, ulong, S_IRUGO);
+
+#define MAX_NESTING 50
+#define IST_START_RATE 512 /* rate of queuing an IST per vm-exit if no IST queued */
+#define IST_NESTED_RATE 256 /* rate of nesting an IST per instruction during handling IST */
+
+static int instruction_count;
+static int ist_depth, ist_atomic_depth;
+static int ist_nmi_blocking, ist_db_blocking, ist_mce_blocking;
+static u8 pending_vector[MAX_NESTING];
+static unsigned long ist_interrupted_rip[MAX_NESTING];
+static unsigned long commit_ip;
+
+#define TEST_STOPPED 0
+#define TEST_REQUESTED 1
+#define TEST_ONGOING 2
+static int ist_emul_test = TEST_STOPPED;
+// set ist_emul_test = 1 (TEST_REQUESTED) when loading the module to enable the test
+// in a script which also sets all the above symbol addresses
+module_param(ist_emul_test, int, S_IRUGO);
+
+static bool test_too_early_nesting_bug;
+module_param(test_too_early_nesting_bug, bool, S_IRUGO);
+
+extern int ist_emul_test_singlestep;
+
+static bool ist_emul_injecting;
+static bool ist_emul_returning;
+
+static void print_pending_vectors(void)
+{
+ int i;
+
+ pr_cont("current vectors:");
+ for (i = 0; i < ist_depth; i++) {
+ switch (pending_vector[i]) {
+ case MC_VECTOR: pr_cont(" #MC"); break;
+ case DB_VECTOR: pr_cont(" #DB"); break;
+ case 2: pr_cont(" NMI"); break;
+ default: WARN_ON(1);
+ }
+ }
+ pr_cont(";\n");
+}
+
+#define DR7_EN(i, type, len) (((type | len) & 0xf) << (DR_CONTROL_SHIFT + i * DR_CONTROL_SIZE)) |\
+ (DR_GLOBAL_ENABLE << (i * DR_ENABLE_SIZE))
+
+#define DR7_X(i) DR7_EN(i, X86_BREAKPOINT_EXECUTE, X86_BREAKPOINT_LEN_X)
+
+static void watch_injecting(struct kvm_vcpu *vcpu)
+{
+ kvm_set_dr(vcpu, 0, asm_exc_nmi_handler);
+ kvm_set_dr(vcpu, 1, asm_exc_db_handler);
+ kvm_set_dr(vcpu, 2, asm_exc_mce_handler);
+ kvm_set_dr(vcpu, 3, asm_exc_df_handler);
+ kvm_set_dr(vcpu, 6, DR6_RESERVED);
+ kvm_set_dr(vcpu, 7, DR7_X(0) | DR7_X(1) | DR7_X(2) | DR7_X(3));
+}
+
+static void finish_watching_injecting(struct kvm_vcpu *vcpu)
+{
+ kvm_set_dr(vcpu, 0, 0);
+ kvm_set_dr(vcpu, 1, 0);
+ kvm_set_dr(vcpu, 2, 0);
+ kvm_set_dr(vcpu, 3, 0);
+ kvm_set_dr(vcpu, 6, DR6_RESERVED);
+ kvm_set_dr(vcpu, 7, 0);
+}
+
+static void watch_handler_done(struct kvm_vcpu *vcpu)
+{
+ kvm_set_dr(vcpu, 0, ist_interrupted_rip[ist_depth-1]);
+ kvm_set_dr(vcpu, 1, asm_exc_df_handler);
+ //kvm_set_dr(vcpu, 2, exc_nmi_handler);
+ //kvm_set_dr(vcpu, 3, native_irq_return_iret_ip);
+ kvm_set_dr(vcpu, 6, DR6_RESERVED);
+ kvm_set_dr(vcpu, 7, DR7_X(0) | DR7_X(1));
+ //kvm_set_dr(vcpu, 7, DR7_X(0) | DR7_X(1) | DR7_X(2) | DR7_X(3));
+}
+
+static void ist_emul_test_start_singlestep(struct kvm_vcpu *vcpu)
+{
+ ist_emul_test_singlestep = 1;
+}
+
+static void ist_emul_test_stop_singlestep(struct kvm_vcpu *vcpu)
+{
+ ist_emul_test_singlestep = 0;
+ watch_handler_done(vcpu);
+ finish_watching_injecting(vcpu);
+}
+
+static void ist_emul_test_stop(struct kvm_vcpu *vcpu)
+{
+ ist_emul_test = TEST_STOPPED;
+ ist_emul_test_stop_singlestep(vcpu);
+}
+
+static bool is_entering_ip(struct kvm_vcpu *vcpu)
+{
+ unsigned long rip = kvm_rip_read(vcpu);
+
+ return rip == asm_exc_mce_handler || rip == asm_exc_db_handler || rip == asm_exc_nmi_handler;
+}
+
+static void emul_ist_bug_in_singlestep(struct kvm_vcpu *vcpu)
+{
+ unsigned long rip = kvm_rip_read(vcpu);
+
+ if (WARN_ON_ONCE(!ist_depth)) {
+ kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+ return;
+ }
+
+ if (rip == asm_exc_df_handler || instruction_count > 1000) {
+ ist_emul_test_stop(vcpu);
+ return;
+ }
+}
+
+static unsigned long read_one(struct kvm_vcpu *vcpu, unsigned long addr)
+{
+ unsigned long ret;
+ struct x86_exception e;
+ if (kvm_read_guest_virt(vcpu, addr, &ret, sizeof(ret), &e) != X86EMUL_CONTINUE)
+ return 0xf000f123f456f789;
+ return ret;
+}
+
+static void emul_ist_kctx_handler_return(struct kvm_vcpu *vcpu)
+{
+ unsigned long rsp = kvm_rsp_read(vcpu);
+ unsigned long ret_rip;
+ struct x86_exception e;
+
+ if (kvm_read_guest_virt(vcpu, rsp, &ret_rip, sizeof(ret_rip), &e) != X86EMUL_CONTINUE)
+ kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+
+ /* it is faked #MC #DB NMI, make the guest handle it as if the handler in kernel context was an NOP */
+ kvm_rsp_write(vcpu, rsp+8);
+ kvm_rip_write(vcpu, ret_rip);
+}
+
+static void emul_ist_kctx_handler_done(struct kvm_vcpu *vcpu)
+{
+ unsigned long rip = kvm_rip_read(vcpu);
+ int vector = -1;
+
+ if (rip == exc_nmi_handler)
+ vector = NMI_VECTOR;
+ if (rip == exc_db_kernel_handler || rip == exc_db_user_handler)
+ vector = DB_VECTOR;
+ if (rip == exc_mce_kernel_handler || rip == exc_mce_user_handler)
+ vector = MC_VECTOR;
+
+ if (vector > 0) {
+ if (WARN_ON(vector != pending_vector[ist_depth-1]))
+ kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+ pr_info("Enter the handler in kernel context; ");
+ print_pending_vectors();
+ emul_ist_kctx_handler_return(vcpu);
+ }
+}
+
+static void emul_ist_emul_enable_delivering(struct kvm_vcpu *vcpu)
+{
+ unsigned long rip = kvm_rip_read(vcpu);
+
+ if (rip == commit_ip) {
+ ist_atomic_depth = 0;
+ ist_nmi_blocking = 0;
+ ist_db_blocking = 0;
+ ist_mce_blocking = 0;
+ }
+}
+
+static bool test_nesting_bug(void)
+{
+ return test_too_early_nesting_bug && ((get_random_u32() % 1024) == 0);
+}
+
+static void __ist_emul_test_queue_one(struct kvm_vcpu *vcpu, int vector)
+{
+ if (!ist_atomic_depth)
+ ist_interrupted_rip[ist_depth] = kvm_rip_read(vcpu);
+ else
+ ist_interrupted_rip[ist_depth] = commit_ip;
+ pending_vector[ist_depth] = vector;
+ ist_depth++;
+ ist_atomic_depth++;
+ instruction_count = 0;
+ pr_info("queue an IST exception, depth:%d, atomic_depth:%d; ", ist_depth, ist_atomic_depth);
+ print_pending_vectors();
+
+ // per intel SDM, X86_EFLAGS_TF is cleared after the exception is diliverred.
+ // use hw breakpoint to force immediate exit
+ // and add X86_EFLAGS_TF after breakpoint hits
+ ist_emul_injecting = true;
+ watch_injecting(vcpu);
+ pr_info("interrupt ip: %lx return ip: %lx\n", kvm_rip_read(vcpu), ist_interrupted_rip[ist_depth-1]);
+}
+
+static void ist_emul_test_queue_one(struct kvm_vcpu *vcpu)
+{
+ u32 which = get_random_u32() % 3;
+
+ if (ist_depth >= MAX_NESTING)
+ return;
+
+ if (which <= 0 && (!ist_mce_blocking || test_nesting_bug())) {
+ kvm_queue_exception(vcpu, MC_VECTOR);
+ ist_mce_blocking = 1;
+ __ist_emul_test_queue_one(vcpu, MC_VECTOR);
+ commit_ip = asm_exc_mce_commit_ip;
+ return;
+ }
+
+ if (which <= 1 && (!ist_db_blocking || test_nesting_bug())) {
+ kvm_queue_exception_p(vcpu, DB_VECTOR, DR6_BS);
+ ist_db_blocking = 1;
+ __ist_emul_test_queue_one(vcpu, DB_VECTOR);
+ commit_ip = asm_exc_db_commit_ip;
+ return;
+ }
+
+ if (which <= 2 && (!ist_nmi_blocking || test_nesting_bug())) {
+ vmx_set_nmi_mask(vcpu, false);
+ kvm_inject_nmi(vcpu);
+ ist_nmi_blocking = 1;
+ __ist_emul_test_queue_one(vcpu, 2);
+ commit_ip = asm_exc_nmi_commit_ip;
+ return;
+ }
+}
+
+static void __handle_ist_emul_test_singlestep(struct kvm_vcpu *vcpu)
+{
+ instruction_count++;
+ //pr_info("singlestep ip: %lx\n", kvm_rip_read(vcpu));
+ emul_ist_emul_enable_delivering(vcpu);
+ emul_ist_kctx_handler_done(vcpu);
+ emul_ist_bug_in_singlestep(vcpu);
+
+ /* try to nest an IST exception if the previous is not done */
+ if (!ist_emul_test_singlestep)
+ return;
+
+ // no RF, the return rip need to be watched.
+ // #DB during a repeated string instruction will get RF (sync_regs())
+ if (vmx_get_rflags(vcpu) & X86_EFLAGS_RF)
+ return;
+
+ if (get_random_u32() % IST_NESTED_RATE)
+ return;
+
+ ist_emul_test_queue_one(vcpu);
+}
+
+static bool handle_ist_emul_test_singlestep(struct kvm_vcpu *vcpu, unsigned long dr6)
+{
+ unsigned long rip;
+
+ if (!ist_emul_test_singlestep)
+ return false;
+
+ //kvm_get_dr(vcpu, 6, &dr6);
+ //pr_info("singlestep rip %lx dr6 %lx\n", kvm_rip_read(vcpu), dr6);
+ //kvm_set_dr(vcpu, 6, DR6_RESERVED);
+ if (WARN_ON(!ist_emul_injecting && !ist_emul_returning && is_entering_ip(vcpu))) {
+ ist_emul_test_stop(vcpu);
+ return true;
+ }
+ if (ist_emul_injecting) {
+ //pr_info("xxx ip: %lx\n", kvm_rip_read(vcpu));
+ if (WARN_ON(!is_entering_ip(vcpu))) {
+ ist_emul_test_stop(vcpu);
+ return true;
+ }
+ ist_emul_injecting = false;
+ finish_watching_injecting(vcpu);
+ if (0)
+ read_one(vcpu, kvm_rsp_read(vcpu));
+ //pr_info("rip in ret %lx\n", read_one(vcpu, kvm_rsp_read(vcpu)));
+ //watch_handler_done(vcpu);
+ //if (vmx_get_rflags(vcpu) & X86_EFLAGS_TF)
+ // pr_info("xxxxxxxxxxxxxxxxxxxxxx\n");
+ vmx_set_rflags(vcpu, vmx_get_rflags(vcpu) | X86_EFLAGS_TF);
+ }
+ if (ist_emul_returning) {
+ rip = kvm_rip_read(vcpu);
+ if (rip != ist_interrupted_rip[ist_depth-1]) {
+ pr_info("rip %lx expect rip %lx\n", rip, ist_interrupted_rip[ist_depth-1]);
+ WARN_ON(1);
+ ist_emul_test_stop(vcpu);
+ return true;
+ }
+ ist_depth--;
+ pr_info("IST exception returned, depth:%d; ", ist_depth);
+ print_pending_vectors();
+
+ // print instruction_count for adjusting IST_NESTED_RATE
+ // to contrl the average ist_depth for the the test.
+ //pr_info("instruction_count %d\n", instruction_count);
+ instruction_count = 0;
+ ist_emul_returning = false;
+
+ if (!ist_depth) {
+ ist_emul_test_stop_singlestep(vcpu);
+ return true;
+ }
+ }
+
+
+ __handle_ist_emul_test_singlestep(vcpu);
+
+ rip = kvm_rip_read(vcpu);
+ if (!ist_emul_injecting && rip == native_irq_return_iret_ip) {
+ pr_info("IST exception is going to return\n");
+ //watch_handler_done(vcpu);
+ ist_emul_returning = true;
+ }
+ return true;
+}
+
+static void handle_ist_emul_test(struct kvm_vcpu *vcpu)
+{
+ //if (ist_emul_test_singlestep)
+ // pr_info("vmexit ip: %lx\n", kvm_rip_read(vcpu));
+
+ if (ist_emul_test == TEST_STOPPED)
+ return;
+
+ if (ist_emul_test == TEST_REQUESTED) {
+ /* the test can't only be started until the guest's booting process is done. */
+ if (vmx_get_cpl(vcpu))
+ ist_emul_test = TEST_ONGOING;
+ else
+ return;
+ }
+
+ /* Don't start the test if the test is already started or someone
+ * else is using the X86_EFLAGS_TF */
+ // no RF, the return rip need to be watched.
+ if (vmx_get_rflags(vcpu) & (X86_EFLAGS_TF | X86_EFLAGS_RF))
+ return;
+
+ if (ist_emul_test_singlestep) {
+ //kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+ return;
+ }
+
+ if (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
+ (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS))
+ return;
+
+ if (vcpu->arch.nmi_injected || vcpu->arch.interrupt.injected ||
+ vcpu->arch.exception.pending || vcpu->arch.exception.injected)
+ return;
+
+ pr_info_once("%lx %lx %lx %lx %lx %lx\n", vcpu->arch.eff_db[0], vcpu->arch.eff_db[1],
+ vcpu->arch.eff_db[2], vcpu->arch.eff_db[3], vcpu->arch.dr6, vcpu->arch.dr7);
+ if (vcpu->arch.eff_db[0] || vcpu->arch.eff_db[1] ||
+ vcpu->arch.eff_db[2] || vcpu->arch.eff_db[3] ||
+ ((vcpu->arch.dr6 | DR6_RESERVED) != DR6_RESERVED) ||
+ ((vcpu->arch.dr7 | DR7_FIXED_1) != DR7_FIXED_1))
+ return;
+
+ if (WARN_ON_ONCE(ist_depth)) {
+ kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+ return;
+ }
+
+ /* XXX: don't start the testing when the guest is handling real #MC #DB NMI.
+ * But how to detect it? Just skip the detecting since is a test. */
+
+ if (get_random_u32() % IST_START_RATE)
+ return;
+
+ ist_emul_test_start_singlestep(vcpu);
+ ist_emul_test_queue_one(vcpu);
+}
+
void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -5258,6 +5693,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
switch (ex_no) {
case DB_VECTOR:
dr6 = vmx_get_exit_qual(vcpu);
+ if (handle_ist_emul_test_singlestep(vcpu, dr6))
+ return 1;
if (!(vcpu->guest_debug &
(KVM_GUESTDBG_SINGLESTEP | KVM_GUESTDBG_USE_HW_BP))) {
/*
@@ -6579,6 +7016,8 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
{
int ret = __vmx_handle_exit(vcpu, exit_fastpath);

+ if (ret > 0)
+ handle_ist_emul_test(vcpu);
/*
* Exit to user space when bus lock detected to inform that there is
* a bus lock in guest.
@@ -7290,7 +7729,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
* vmentry fails as it then expects bit 14 (BS) in pending debug
* exceptions being set, but that's not correct for the guest debugging
* case. */
- if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
+ if (ist_emul_test_singlestep || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP))
vmx_set_interrupt_shadow(vcpu, 0);

kvm_load_guest_xsave_state(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7713420abab0..205a4d989046 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -814,6 +814,7 @@ void kvm_inject_nmi(struct kvm_vcpu *vcpu)
atomic_inc(&vcpu->arch.nmi_queued);
kvm_make_request(KVM_REQ_NMI, vcpu);
}
+EXPORT_SYMBOL_GPL(kvm_inject_nmi);

void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code)
{
@@ -12810,12 +12811,15 @@ bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip)
}
EXPORT_SYMBOL_GPL(kvm_is_linear_rip);

+int ist_emul_test_singlestep;
+EXPORT_SYMBOL_GPL(ist_emul_test_singlestep);
+
unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu)
{
unsigned long rflags;

rflags = static_call(kvm_x86_get_rflags)(vcpu);
- if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
+ if (ist_emul_test_singlestep || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP))
rflags &= ~X86_EFLAGS_TF;
return rflags;
}
@@ -12823,8 +12827,8 @@ EXPORT_SYMBOL_GPL(kvm_get_rflags);

static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
{
- if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP &&
- kvm_is_linear_rip(vcpu, vcpu->arch.singlestep_rip))
+ if (ist_emul_test_singlestep || ((vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) &&
+ kvm_is_linear_rip(vcpu, vcpu->arch.singlestep_rip)))
rflags |= X86_EFLAGS_TF;
static_call(kvm_x86_set_rflags)(vcpu, rflags);
}
--
2.19.1.6.gb485710b

2023-04-03 14:24:30

by Dave Hansen

[permalink] [raw]

Subject: Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST

On 4/3/23 07:05, Lai Jiangshan wrote:
> 2.3 #VE
> -------
>
> The approach for fixing the kernel mode #VE recursion issue is to just
> NOT use IST for #VE although #VE is also considered to be one of the
> super exceptions and had raised some worries:
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/CALCETrU9XypKbj-TrXLB3CPW6=MZ__5ifLz0ckbB=c=Myegn9Q@mail.gmail.com/
> https://lore.kernel.org/lkml/[email protected]/
>
> To remit the worries, SEPT_VE_DISABLE is forced used currently and
> also disables its abilities (accept-on-demand or memory balloon which
> is critical to lightweight VMs like Kata Containers):
> https://lore.kernel.org/lkml/YCb0%2FDg28uI7TRD%[email protected]/

You don't need #VE for accept-on-demand. Pages go through _very_
well-defined software choke points before they get used *and* before
they get ballooned. Thus:

> https://lore.kernel.org/lkml/[email protected]/

BTW, _who_ considers #VE to be a "super exception"? Can you explain how
it is any more "super" than #PF? #PF can recurse. You can take #PF in
the entry paths.

I kinda don't think you should be using TDX and #VE as part of the
justification for this series.

2023-04-03 16:26:56

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC PATCH 2/7] x86/entry: Add IST main stack

On Mon, Apr 3, 2023 at 7:05 AM Lai Jiangshan <[email protected]> wrote:
>
> diff --git a/Documentation/x86/kernel-stacks.rst b/Documentation/x86/kernel-stacks.rst
> index 6b0bcf027ff1..be89acf302da 100644
> --- a/Documentation/x86/kernel-stacks.rst
> +++ b/Documentation/x86/kernel-stacks.rst
> @@ -105,6 +105,8 @@ The currently assigned IST stacks are:
> middle of switching stacks. Using IST for MCE events avoids making
> assumptions about the previous state of the kernel stack.
>
> +* ESTACK_IST. bla bla
> +
> For more details see the Intel IA32 or AMD AMD64 architecture manuals.

Maybe the cover letter description could be used here, rather than the
"bla bla" placeholder?

Linus

2023-04-03 16:43:58

by Lai Jiangshan

[permalink] [raw]

Subject: Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST

On Mon, Apr 3, 2023 at 10:23 PM Dave Hansen <[email protected]> wrote:
>
> On 4/3/23 07:05, Lai Jiangshan wrote:
> > 2.3 #VE
> > -------
> >
> > The approach for fixing the kernel mode #VE recursion issue is to just
> > NOT use IST for #VE although #VE is also considered to be one of the
> > super exceptions and had raised some worries:
> > https://lore.kernel.org/lkml/[email protected]/
> > https://lore.kernel.org/lkml/CALCETrU9XypKbj-TrXLB3CPW6=MZ__5ifLz0ckbB=c=Myegn9Q@mail.gmail.com/
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > To remit the worries, SEPT_VE_DISABLE is forced used currently and
> > also disables its abilities (accept-on-demand or memory balloon which
> > is critical to lightweight VMs like Kata Containers):
> > https://lore.kernel.org/lkml/YCb0%2FDg28uI7TRD%[email protected]/
>
> You don't need #VE for accept-on-demand. Pages go through _very_
> well-defined software choke points before they get used *and* before
> they get ballooned. Thus:
>
> > https://lore.kernel.org/lkml/[email protected]/
>

Thanks for the information.

I will have a look to see how it supports memory balloons.

And if accept-on-demand were supported, do we still need this
CONFIG_UNACCEPTED_MEMORY?

> BTW, _who_ considers #VE to be a "super exception"? Can you explain how
> it is any more "super" than #PF? #PF can recurse. You can take #PF in
> the entry paths.
>
> I kinda don't think you should be using TDX and #VE as part of the
> justification for this series.

You are right, #VE is not a super exception anymore since SEPT_VE_DISABLE
is forced set in the Linux kernel and it is nothing to do with this series.

But #VE was once thought to be a super exception (I will correct the
sentence in the cover letter), so it is worth mentioning it.

And since SEPT_VE_DISABLE is configurable, it would allow some paranoids
to have a try with SEPT_VE_DISABLE=false even without FRED.
The paranoids can try it with IST #VE.

Thanks
Lai

2023-04-03 17:11:56

by Dave Hansen

[permalink] [raw]

Subject: Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST

On 4/3/23 07:05, Lai Jiangshan wrote:
> Documentation/x86/kernel-stacks.rst | 2 +
> arch/x86/entry/Makefile | 3 +
> arch/x86/entry/entry_64.S | 615 +++++++-------------------
> arch/x86/entry/ist_entry.c | 299 +++++++++++++
> arch/x86/include/asm/cpu_entry_area.h | 8 +-
> arch/x86/include/asm/idtentry.h | 15 +-
> arch/x86/include/asm/sev.h | 14 -
> arch/x86/include/asm/traps.h | 1 -
> arch/x86/kernel/asm-offsets_64.c | 7 +
> arch/x86/kernel/callthunks.c | 2 +
> arch/x86/kernel/dumpstack_64.c | 6 +-
> arch/x86/kernel/nmi.c | 8 -
> arch/x86/kernel/sev.c | 108 -----
> arch/x86/kernel/traps.c | 43 --
> arch/x86/kvm/vmx/vmx.c | 441 +++++++++++++++++-
> arch/x86/kvm/x86.c | 10 +-
> arch/x86/mm/cpu_entry_area.c | 2 +-
> tools/objtool/check.c | 7 +-
> 18 files changed, 937 insertions(+), 654 deletions(-)
> create mode 100644 arch/x86/entry/ist_entry.c

Some high-level comments...

The diffstat looks a lot nastier because of the testing patch. If you
that patch and trim the diffstat a bit, it starts to look a _lot_ more
appealing:

> arch/x86/entry/entry_64.S | 615 ++++++++----------------------------
> arch/x86/entry/ist_entry.c | 299 +++++++++++++++++
> arch/x86/kernel/sev.c | 108 ------
> arch/x86/kernel/traps.c | 43 --
...
> 16 files changed, 490 insertions(+), 650 deletions(-)

It removes more code than it adds and also removes a bunch of assembly.
If it were me posting this, I'd have shouted that from the rooftops
instead of obscuring it with a testing patch and leaving it as an
exercise to the reader to figure out.

It also comes with testing code and a great cover letter, which are rare
and _spectacular_.

That said, I'm torn. This series makes a valiant attempt to improve one
of the creakiest corners of arch/x86/. But, there are precious few
reviewers that can _really_ dig into this stuff. This is also a lot
less valuable now than it would have been when FRED was not on the horizon.

It's also not incremental at all. It's going to be a big pain when
someone bisects their random system hang to patch 4/7. It'll mean
there's a bug buried somewhere in the 500 lines of patch 3/7.

Grumbling aside, there's too much potential upside here to ignore. As
this moves out of RFC territory, it would be great if you could:

* Look at the FRED series and determine how much this collides
* Dig up some reviewers
* Flesh out the testing story. What kind of hardware was this tested
on? How much was bare metal vs. VMs, etc...?
* Reinforce what the benefits to end users are here. Are folks
_actually_ running into the #VC fragility, for instance?

Also, let's say we queue this, it starts getting linux-next testing, and
we start getting bug reports of hangs. It'll have to get reverted if we
can't find the bug fast.

How much of a pain would it be to make atomic-IST-entry _temporarily_
selectable at boot time? It would obviously need to keep the old code
around and would not have the nice diffstat. But that way, folks would
at least have a workaround while we're bug hunting.

1. https://lore.kernel.org/all/[email protected]/

2023-04-04 03:21:38

by Lai Jiangshan

[permalink] [raw]

Subject: Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST

On Tue, Apr 4, 2023 at 12:53 AM Dave Hansen <[email protected]> wrote:
>
> On 4/3/23 07:05, Lai Jiangshan wrote:
> > Documentation/x86/kernel-stacks.rst | 2 +
> > arch/x86/entry/Makefile | 3 +
> > arch/x86/entry/entry_64.S | 615 +++++++-------------------
> > arch/x86/entry/ist_entry.c | 299 +++++++++++++
> > arch/x86/include/asm/cpu_entry_area.h | 8 +-
> > arch/x86/include/asm/idtentry.h | 15 +-
> > arch/x86/include/asm/sev.h | 14 -
> > arch/x86/include/asm/traps.h | 1 -
> > arch/x86/kernel/asm-offsets_64.c | 7 +
> > arch/x86/kernel/callthunks.c | 2 +
> > arch/x86/kernel/dumpstack_64.c | 6 +-
> > arch/x86/kernel/nmi.c | 8 -
> > arch/x86/kernel/sev.c | 108 -----
> > arch/x86/kernel/traps.c | 43 --
> > arch/x86/kvm/vmx/vmx.c | 441 +++++++++++++++++-
> > arch/x86/kvm/x86.c | 10 +-
> > arch/x86/mm/cpu_entry_area.c | 2 +-
> > tools/objtool/check.c | 7 +-
> > 18 files changed, 937 insertions(+), 654 deletions(-)
> > create mode 100644 arch/x86/entry/ist_entry.c
>
> Some high-level comments...
>
> The diffstat looks a lot nastier because of the testing patch. If you
> that patch and trim the diffstat a bit, it starts to look a _lot_ more
> appealing:
>
> > arch/x86/entry/entry_64.S | 615 ++++++++----------------------------
> > arch/x86/entry/ist_entry.c | 299 +++++++++++++++++
> > arch/x86/kernel/sev.c | 108 ------
> > arch/x86/kernel/traps.c | 43 --
> ...
> > 16 files changed, 490 insertions(+), 650 deletions(-)
>
> It removes more code than it adds and also removes a bunch of assembly.
> If it were me posting this, I'd have shouted that from the rooftops
> instead of obscuring it with a testing patch and leaving it as an
> exercise to the reader to figure out.

The cover letter has 800+ lines of comments. About 100-300 lines
of comments will be moved into the code which would make the diffstat
not so appealing.

P.S.

One of the reasons I didn't move them is that the comments are much more
complicated than the code which is a sign of improvement-required.

I'm going to change the narration from save-touch-replicate-copy-commit
to save-copy-build-commit and avoid using "replicate".

copy=copy_outmost(), build=build_interrupted(), the new narration
will change the comments mainly, and it will not change the actual
work. If the new narration is not simpler, I will not send it out to
add any confusion.

> * Flesh out the testing story. What kind of hardware was this tested
> on? How much was bare metal vs. VMs, etc...?

It is tested on bare metal and VM. It is hard to stress the bare
metal on atomic-IST-entry. The testing code tests it in VM and the
super exceptions can be injected at will.

> * Reinforce what the benefits to end users are here. Are folks
> _actually_ running into the #VC fragility, for instance?
>
> Also, let's say we queue this, it starts getting linux-next testing, and
> we start getting bug reports of hangs. It'll have to get reverted if we
> can't find the bug fast.
>
> How much of a pain would it be to make atomic-IST-entry _temporarily_
> selectable at boot time? It would obviously need to keep the old code
> around and would not have the nice diffstat. But that way, folks would
> at least have a workaround while we're bug hunting.

It is easy to make atomic-IST-entry selectable at boot time since IDT
is an indirect table. I will do it and temporarily keep the old code.

2023-04-04 17:11:51

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST

On 4/4/23 05:17, Lai Jiangshan wrote:
> The cover letter has 800+ lines of comments. About 100-300 lines
> of comments will be moved into the code which would make the diffstat
> not so appealing.

Removing assembly from arch/x86/entry/ and adding English to
Documentation/? That's _even more_ appealing. :)

Paolo

2023-04-06 10:22:58

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST

On Tue, Apr 04, 2023 at 07:03:45PM +0200, Paolo Bonzini wrote:
> On 4/4/23 05:17, Lai Jiangshan wrote:
> > The cover letter has 800+ lines of comments. About 100-300 lines
> > of comments will be moved into the code which would make the diffstat
> > not so appealing.
>
> Removing assembly from arch/x86/entry/ and adding English to Documentation/?
> That's _even more_ appealing. :)

I *much* prefer in-code comments to random gibberish that's instantly
out of date squirreled away somewhere in an unreadable format in
Documentation/

2023-04-06 10:58:43

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST

On Thu, Apr 06, 2023 at 12:35:24PM +0200, Jiri Slaby wrote:
> On 06. 04. 23, 12:12, Peter Zijlstra wrote:
> > On Tue, Apr 04, 2023 at 07:03:45PM +0200, Paolo Bonzini wrote:
> > > On 4/4/23 05:17, Lai Jiangshan wrote:
> > > > The cover letter has 800+ lines of comments. About 100-300 lines
> > > > of comments will be moved into the code which would make the diffstat
> > > > not so appealing.
> > >
> > > Removing assembly from arch/x86/entry/ and adding English to Documentation/?
> > > That's _even more_ appealing. :)
> >
> > I *much* prefer in-code comments to random gibberish that's instantly
> > out of date squirreled away somewhere in an unreadable format in
> > Documentation/
>
> +1 as one can link comments in the code to Documentation easily nowadays.
> They are sourced and end up in the generated Documentation [1] then. One
> only needs to type the kernel-doc properly.

Urgh, so that kernel doc stuff can defeat its purpose too. Some of that
is so heavily formatted it is unreadable gibberish just like
Documentation/ :/

I really detest that whole RST thing, and my solution is to explicitly
not write kerneldoc, that way the doc generation stuff doesn't complain
and I don't get random drive by patches wrecking the perfectly readable
comment.

2023-04-06 11:07:11

by Jiri Slaby

[permalink] [raw]

Subject: Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST

On 06. 04. 23, 12:12, Peter Zijlstra wrote:
> On Tue, Apr 04, 2023 at 07:03:45PM +0200, Paolo Bonzini wrote:
>> On 4/4/23 05:17, Lai Jiangshan wrote:
>>> The cover letter has 800+ lines of comments. About 100-300 lines
>>> of comments will be moved into the code which would make the diffstat
>>> not so appealing.
>>
>> Removing assembly from arch/x86/entry/ and adding English to Documentation/?
>> That's _even more_ appealing. :)
>
> I *much* prefer in-code comments to random gibberish that's instantly
> out of date squirreled away somewhere in an unreadable format in
> Documentation/

+1 as one can link comments in the code to Documentation easily
nowadays. They are sourced and end up in the generated Documentation [1]
then. One only needs to type the kernel-doc properly.

[1] https://www.kernel.org/doc/html/latest

--
js
suse labs

2023-04-06 11:12:32

by Jiri Slaby

[permalink] [raw]

Subject: Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST

On 06. 04. 23, 12:47, Peter Zijlstra wrote:
> On Thu, Apr 06, 2023 at 12:35:24PM +0200, Jiri Slaby wrote:
>> On 06. 04. 23, 12:12, Peter Zijlstra wrote:
>>> On Tue, Apr 04, 2023 at 07:03:45PM +0200, Paolo Bonzini wrote:
>>>> On 4/4/23 05:17, Lai Jiangshan wrote:
>>>>> The cover letter has 800+ lines of comments. About 100-300 lines
>>>>> of comments will be moved into the code which would make the diffstat
>>>>> not so appealing.
>>>>
>>>> Removing assembly from arch/x86/entry/ and adding English to Documentation/?
>>>> That's _even more_ appealing. :)
>>>
>>> I *much* prefer in-code comments to random gibberish that's instantly
>>> out of date squirreled away somewhere in an unreadable format in
>>> Documentation/
>>
>> +1 as one can link comments in the code to Documentation easily nowadays.
>> They are sourced and end up in the generated Documentation [1] then. One
>> only needs to type the kernel-doc properly.
>
> Urgh, so that kernel doc stuff can defeat its purpose too. Some of that
> is so heavily formatted it is unreadable gibberish just like
> Documentation/ :/

Definitely it _can_ defeat the purpose and be heavily formatted.But it
doesn't have to. It's like programming in perl.

What I had in mind was e.g. "DOC: TTY Struct Flags":
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/tty.h#n261

Resulting in:
https://www.kernel.org/doc/html/latest/driver-api/tty/tty_struct.html#tty-struct-flags

Both the source and the result are quite readable, IMO. And the markup
in the source is not mandatory, it's only for emphasizing and hyperlinks.

As I wrote, you can link the comment in the code. But definitely you
don't have to, if you don't want. I like the linking in Documentation as
I can put the pieces from various sources/headers together to one place
and build a bigger picture.

> I really detest that whole RST thing, and my solution is to explicitly
> not write kerneldoc, that way the doc generation stuff doesn't complain
> and I don't get random drive by patches wrecking the perfectly readable
> comment.

Sure. Rst _sources_ are not readable, IMO. Only generated man pages or
html are.

thanks,
--
js

2023-04-06 12:49:02

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST

On Thu, Apr 06, 2023 at 01:04:16PM +0200, Jiri Slaby wrote:

> Definitely it _can_ defeat the purpose and be heavily formatted.But it
> doesn't have to. It's like programming in perl.
>
> What I had in mind was e.g. "DOC: TTY Struct Flags":
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/tty.h#n261

* TTY_THROTTLED
* Driver input is throttled. The ldisc should call
* :c:member:`tty_driver.unthrottle()` in order to resume reception when
* it is ready to process more data (at threshold min).

That whole :c:member:'tty_driver.unthrottle()' is an abomination and
has no place in a comment.

> Resulting in:
> https://www.kernel.org/doc/html/latest/driver-api/tty/tty_struct.html#tty-struct-flags
>
> Both the source and the result are quite readable, IMO. And the markup in
> the source is not mandatory, it's only for emphasizing and hyperlinks.
>
> As I wrote, you can link the comment in the code. But definitely you don't
> have to, if you don't want. I like the linking in Documentation as I can put
> the pieces from various sources/headers together to one place and build a
> bigger picture.
>
> > I really detest that whole RST thing, and my solution is to explicitly
> > not write kerneldoc, that way the doc generation stuff doesn't complain
> > and I don't get random drive by patches wrecking the perfectly readable
> > comment.
>
> Sure. Rst _sources_ are not readable, IMO. Only generated man pages or html
> are.

But code comments are read in a text editor, not a browser. Hence all
the markup is counter productive.

Why would you go read something in a browser if you have the code right
there in a text editor?

2023-04-06 21:07:45

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 2/7] x86/entry: Add IST main stack

On Mon, Apr 03, 2023 at 10:06:00PM +0800, Lai Jiangshan wrote:
> diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
> index f05339fee778..3413b23fa9f1 100644
> --- a/arch/x86/kernel/dumpstack_64.c
> +++ b/arch/x86/kernel/dumpstack_64.c
> @@ -26,11 +26,12 @@ static const char * const exception_stack_names[] = {
> [ ESTACK_MCE ] = "#MC",
> [ ESTACK_VC ] = "#VC",
> [ ESTACK_VC2 ] = "#VC2",
> + [ ESTACK_IST ] = "#IST",
> };

Since #IST is not actually a vector, perhaps ditch the '#' ?

2023-04-06 21:09:51

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 3/7] x86/entry: Implement atomic-IST-entry

On Mon, Apr 03, 2023 at 10:06:01PM +0800, Lai Jiangshan wrote:

> diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
> index ca2fe186994b..7cc1254ca519 100644
> --- a/arch/x86/entry/Makefile
> +++ b/arch/x86/entry/Makefile
> @@ -8,11 +8,14 @@ UBSAN_SANITIZE := n
> KCOV_INSTRUMENT := n
>
> CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_ist_entry.o = $(CC_FLAGS_FTRACE) $(RETHUNK_CFLAGS)

This ^^^

> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 49ddc4dd3117..50a24cc83581 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -443,6 +443,184 @@ SYM_CODE_END(\asmsym)

> +.macro idtentry_ist vector asmsym cfunc user_cfunc has_error_code:req, stack_offset:req
> +SYM_CODE_START(\asmsym)
> + UNWIND_HINT_IRET_REGS offset=\has_error_code*8
> + ENDBR
> +
> + /*
> + * Clear X86_EFLAGS_AC, X86_EFLAGS_DF and set a default ORIG_RAX.
> + *
> + * The code setting ORIG_RAX will not be replicated if interrupted.
> + */
> + ASM_CLAC
> + cld
> +
> + .if \has_error_code == 0
> + pushq $-1 /* ORIG_RAX: no syscall to restart */
> + .endif
> +
> + /*
> + * No register can be touched except %rsp,%rflags,%rip before
> + * pushing all the registers. It is indispensable for nested
> + * atomic-IST-entry to replicate pushing the registers.
> + */
> + PUSH_REGS
> +
> + /*
> + * Finished pushing register, all registers can be touched by now.
> + *
> + * Clear registers for the C function ist_copy_regs_to_main_stack()
> + * and the handler to avoid any possible exploitation of any
> + * speculation attack.
> + */
> + CLEAR_REGS
> +
> + /*
> + * Copy the pt_regs to the IST main stack including the pt_regs of
> + * the interrupted atomic-IST-entris, if any, by replicating.
> + */
> + movq %rsp, %rdi /* pt_regs pointer on its own IST stack */
> + leaq PTREGS_SIZE-\stack_offset(%rsp), %rsi /* struct cea_exception_stacks pointer */
> + call ist_copy_regs_to_main_stack

IIUC you do a CALL+RET here, before you call paranoid_entry ...

> +
> + /*
> + * Commit stage.
> + */
> +SYM_INNER_LABEL(start_commit_\asmsym, SYM_L_GLOBAL)
> + /*
> + * Switches to the IST main stack. Before the switching is done,
> + * %rax is the copied pt_regs pointer in IST main stack.
> + */
> + movq %rax, %rsp
> +
> + /*
> + * The label should be immediate after the instruction that switches
> + * the stack since there is code assuming there is only one single
> + * instruction in the commit stage and the code assumes "%rsp in the
> + * IST main stack is also the sign of ending a atomic-IST-entry".
> + * (The code will be removed in future when %rip-based identifying
> + * is added.)
> + */
> +SYM_INNER_LABEL(commit_\asmsym, SYM_L_GLOBAL)
> +
> + /*
> + * Now, it is on the IST main stack. For the whole kernel, the entries
> + * of the IST exceptions can be seen from here because the inside
> + * of the atomic-IST-entry can not be seen from the whole kernel
> + * except in the atomic-IST-entry or #DF.
> + */
> + UNWIND_HINT_REGS
> + ENCODE_FRAME_POINTER
> +
> + /*
> + * The code setting ORIG_RAX will not be replicated if interrupted.
> + * So redo it here.
> + */
> + .if \has_error_code == 0
> + movq $-1, ORIG_RAX(%rsp) /* ORIG_RAX: no syscall to restart */
> + .endif
> +
> + /*
> + * If the entry is from userspace, switch stacks and treat it as
> + * a normal entry.
> + */
> + testb $3, CS(%rsp)
> + jnz .Lfrom_usermode_switch_stack_\@
> +
> + /*
> + * paranoid_entry returns GS/CR3/SPEC_CTL information for
> + * paranoid_exit in RBX/R14/R15.
> + */
> + call paranoid_entry

... all the way down here, which will do:

IBRS_ENTER;
UNTRAIN_RET_FROM_CALL;

Which thus breaks the whole RetBleed mess, since that must not do RET
before that happens.

2023-04-06 22:01:02

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 3/7] x86/entry: Implement atomic-IST-entry

On Mon, Apr 03, 2023 at 10:06:01PM +0800, Lai Jiangshan wrote:
> +static __always_inline
> +void copy_regs_exception_head(struct pt_regs *target, const struct pt_regs *from)
> +{
> + target->ss = from->ss;
> + target->sp = from->sp;
> + target->flags = from->flags;
> + target->cs = from->cs;
> + target->ip = from->ip;
> + target->orig_ax = from->orig_ax;
> +}
> +
> +static __always_inline
> +void copy_regs_general_registers(struct pt_regs *target, const struct pt_regs *from)
> +{
> + target->di = from->di;
> + target->si = from->si;
> + target->dx = from->dx;
> + target->cx = from->cx;
> + target->ax = from->ax;
> + target->r8 = from->r8;
> + target->r9 = from->r9;
> + target->r10 = from->r10;
> + target->r11 = from->r11;
> + target->bx = from->bx;
> + target->bp = from->bp;
> + target->r12 = from->r12;
> + target->r13 = from->r13;
> + target->r14 = from->r14;
> + target->r15 = from->r15;
> +}

> +/* Replicate the interrupted atomic-IST-entry's CLEAR_REGS macro. */
> +static __always_inline void replicate_clear_regs(struct pt_regs *target)
> +{
> + target->di = 0;
> + target->si = 0;
> + target->dx = 0;
> + target->cx = 0;
> + target->ax = 0;
> + target->r8 = 0;
> + target->r9 = 0;
> + target->r10 = 0;
> + target->r11 = 0;
> + target->bx = 0;
> + target->bp = 0;
> + target->r12 = 0;
> + target->r13 = 0;
> + target->r14 = 0;
> + target->r15 = 0;
> +}

I think there's compilers smart enough to see through your attempts at
avoiding mem{set,cpy}() there and I think we'll end up needing something
like __inline_memset() and __inline_memcpy() like here:

https://lore.kernel.org/lkml/Y759AJ%[email protected]/

2023-04-07 02:51:14

by Lai Jiangshan

[permalink] [raw]

Subject: Re: [RFC PATCH 3/7] x86/entry: Implement atomic-IST-entry

On Fri, Apr 7, 2023 at 5:01 AM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Apr 03, 2023 at 10:06:01PM +0800, Lai Jiangshan wrote:
>
> > diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
> > index ca2fe186994b..7cc1254ca519 100644
> > --- a/arch/x86/entry/Makefile
> > +++ b/arch/x86/entry/Makefile
> > @@ -8,11 +8,14 @@ UBSAN_SANITIZE := n
> > KCOV_INSTRUMENT := n
> >
> > CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE)
> > +CFLAGS_REMOVE_ist_entry.o = $(CC_FLAGS_FTRACE) $(RETHUNK_CFLAGS)
>
> This ^^^
>
>
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 49ddc4dd3117..50a24cc83581 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -443,6 +443,184 @@ SYM_CODE_END(\asmsym)
>
> > +.macro idtentry_ist vector asmsym cfunc user_cfunc has_error_code:req, stack_offset:req
> > +SYM_CODE_START(\asmsym)
> > + UNWIND_HINT_IRET_REGS offset=\has_error_code*8
> > + ENDBR
> > +
> > + /*
> > + * Clear X86_EFLAGS_AC, X86_EFLAGS_DF and set a default ORIG_RAX.
> > + *
> > + * The code setting ORIG_RAX will not be replicated if interrupted.
> > + */
> > + ASM_CLAC
> > + cld
> > +
> > + .if \has_error_code == 0
> > + pushq $-1 /* ORIG_RAX: no syscall to restart */
> > + .endif
> > +
> > + /*
> > + * No register can be touched except %rsp,%rflags,%rip before
> > + * pushing all the registers. It is indispensable for nested
> > + * atomic-IST-entry to replicate pushing the registers.
> > + */
> > + PUSH_REGS
> > +
> > + /*
> > + * Finished pushing register, all registers can be touched by now.
> > + *
> > + * Clear registers for the C function ist_copy_regs_to_main_stack()
> > + * and the handler to avoid any possible exploitation of any
> > + * speculation attack.
> > + */
> > + CLEAR_REGS
> > +
> > + /*
> > + * Copy the pt_regs to the IST main stack including the pt_regs of
> > + * the interrupted atomic-IST-entris, if any, by replicating.
> > + */
> > + movq %rsp, %rdi /* pt_regs pointer on its own IST stack */
> > + leaq PTREGS_SIZE-\stack_offset(%rsp), %rsi /* struct cea_exception_stacks pointer */
> > + call ist_copy_regs_to_main_stack
>
> IIUC you do a CALL+RET here, before you call paranoid_entry ...
>
> > +
> > + /*
> > + * Commit stage.
> > + */
> > +SYM_INNER_LABEL(start_commit_\asmsym, SYM_L_GLOBAL)
> > + /*
> > + * Switches to the IST main stack. Before the switching is done,
> > + * %rax is the copied pt_regs pointer in IST main stack.
> > + */
> > + movq %rax, %rsp
> > +
> > + /*
> > + * The label should be immediate after the instruction that switches
> > + * the stack since there is code assuming there is only one single
> > + * instruction in the commit stage and the code assumes "%rsp in the
> > + * IST main stack is also the sign of ending a atomic-IST-entry".
> > + * (The code will be removed in future when %rip-based identifying
> > + * is added.)
> > + */
> > +SYM_INNER_LABEL(commit_\asmsym, SYM_L_GLOBAL)
> > +
> > + /*
> > + * Now, it is on the IST main stack. For the whole kernel, the entries
> > + * of the IST exceptions can be seen from here because the inside
> > + * of the atomic-IST-entry can not be seen from the whole kernel
> > + * except in the atomic-IST-entry or #DF.
> > + */
> > + UNWIND_HINT_REGS
> > + ENCODE_FRAME_POINTER
> > +
> > + /*
> > + * The code setting ORIG_RAX will not be replicated if interrupted.
> > + * So redo it here.
> > + */
> > + .if \has_error_code == 0
> > + movq $-1, ORIG_RAX(%rsp) /* ORIG_RAX: no syscall to restart */
> > + .endif
> > +
> > + /*
> > + * If the entry is from userspace, switch stacks and treat it as
> > + * a normal entry.
> > + */
> > + testb $3, CS(%rsp)
> > + jnz .Lfrom_usermode_switch_stack_\@
> > +
> > + /*
> > + * paranoid_entry returns GS/CR3/SPEC_CTL information for
> > + * paranoid_exit in RBX/R14/R15.
> > + */
> > + call paranoid_entry
>
> ... all the way down here, which will do:
>
> IBRS_ENTER;
> UNTRAIN_RET_FROM_CALL;
>
> Which thus breaks the whole RetBleed mess, since that must not do RET
> before that happens.

I got it.
I will add the save-stage-3 in the atomic-IST-entry.

The benefit of the new stage:
Do CR3/GSBASE/IBRS switching in the C atomic-IST-entry.
(^^^^^ Also the drawback, which complicates the code, and basically needs:
https://lore.kernel.org/lkml/[email protected]/ )
The IST main stack can be in the Kernel CR3, not necessarily in the CEA