2020-11-06 23:34:29

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V3 05/10] x86/entry: Pass irqentry_state_t by reference

From: Ira Weiny <[email protected]>

Currently struct irqentry_state_t only contains a single bool value
which makes passing it by value is reasonable. However, future patches
propose to add information to this struct, for example the PKRS
register/thread state.

Adding information to irqentry_state_t makes passing by value less
efficient. Therefore, change the entry/exit calls to pass irq_state by
reference.

While at it, make the code easier to follow by changing all the usage
sites to consistently use the variable name 'irq_state'.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes from V1
From Thomas: Update commit message
Further clean up Kernel doc and comments
Missed some 'return' comments which are no longer valid

Changes from RFC V3
Clean up @irq_state comments
Standardize on 'irq_state' for the state variable name
Refactor based on new patch from Thomas Gleixner
Also addresses Peter Zijlstra's comment
---
arch/x86/entry/common.c | 8 ++++----
arch/x86/include/asm/idtentry.h | 25 ++++++++++++++----------
arch/x86/kernel/cpu/mce/core.c | 4 ++--
arch/x86/kernel/kvm.c | 6 +++---
arch/x86/kernel/nmi.c | 4 ++--
arch/x86/kernel/traps.c | 21 ++++++++++++--------
arch/x86/mm/fault.c | 6 +++---
include/linux/entry-common.h | 18 +++++++++--------
kernel/entry/common.c | 34 +++++++++++++--------------------
9 files changed, 65 insertions(+), 61 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 18d8f17f755c..87dea56a15d2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -259,9 +259,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
{
struct pt_regs *old_regs;
bool inhcall;
- irqentry_state_t state;
+ irqentry_state_t irq_state;

- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);
old_regs = set_irq_regs(regs);

instrumentation_begin();
@@ -271,13 +271,13 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
set_irq_regs(old_regs);

inhcall = get_and_clear_inhcall();
- if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
+ if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
restore_inhcall(inhcall);
} else {
- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
}
}
#endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 247a60a47331..282d2413b6a1 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -49,12 +49,13 @@ static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
__##func (regs); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs)
@@ -96,12 +97,13 @@ static __always_inline void __##func(struct pt_regs *regs, \
__visible noinstr void func(struct pt_regs *regs, \
unsigned long error_code) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
__##func (regs, error_code); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs, \
@@ -192,15 +194,16 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector); \
__visible noinstr void func(struct pt_regs *regs, \
unsigned long error_code) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
__##func (regs, (u8)error_code); \
irq_exit_rcu(); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs, u8 vector)
@@ -234,15 +237,16 @@ static void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
run_sysvec_on_irqstack_cond(__##func, regs); \
irq_exit_rcu(); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static noinline void __##func(struct pt_regs *regs)
@@ -263,15 +267,16 @@ static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
__irq_enter_raw(); \
kvm_set_cpu_l1tf_flush_l1d(); \
__##func (regs); \
__irq_exit_raw(); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index f5c860b1a50b..6ed2fa2ea321 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1995,7 +1995,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
mce_check_crashing_cpu())
return;

- irq_state = irqentry_nmi_enter(regs);
+ irqentry_nmi_enter(regs, &irq_state);
/*
* The call targets are marked noinstr, but objtool can't figure
* that out because it's an indirect call. Annotate it.
@@ -2006,7 +2006,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
if (regs->flags & X86_EFLAGS_IF)
trace_hardirqs_on_prepare();
instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);
}

static __always_inline void exc_machine_check_user(struct pt_regs *regs)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7f57ede3cb8e..ed7427c6e74d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -238,12 +238,12 @@ EXPORT_SYMBOL_GPL(kvm_read_and_reset_apf_flags);
noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
{
u32 flags = kvm_read_and_reset_apf_flags();
- irqentry_state_t state;
+ irqentry_state_t irq_state;

if (!flags)
return false;

- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);
instrumentation_begin();

/*
@@ -264,7 +264,7 @@ noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
}

instrumentation_end();
- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
return true;
}

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index bf250a339655..1fd7780e99dd 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)

this_cpu_write(nmi_dr7, local_db_save());

- irq_state = irqentry_nmi_enter(regs);
+ irqentry_nmi_enter(regs, &irq_state);

inc_irq_stat(__nmi_count);

if (!ignore_nmis)
default_do_nmi(regs);

- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);

local_db_restore(this_cpu_read(nmi_dr7));

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index e1b78829d909..8481cc373794 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -245,7 +245,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)

DEFINE_IDTENTRY_RAW(exc_invalid_op)
{
- irqentry_state_t state;
+ irqentry_state_t irq_state;

/*
* We use UD2 as a short encoding for 'CALL __WARN', as such
@@ -255,11 +255,11 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
if (!user_mode(regs) && handle_bug(regs))
return;

- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);
instrumentation_begin();
handle_invalid_op(regs);
instrumentation_end();
- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
}

DEFINE_IDTENTRY(exc_coproc_segment_overrun)
@@ -343,6 +343,7 @@ __visible void __noreturn handle_stack_overflow(const char *message,
*/
DEFINE_IDTENTRY_DF(exc_double_fault)
{
+ irqentry_state_t irq_state;
static const char str[] = "double fault";
struct task_struct *tsk = current;

@@ -405,7 +406,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
}
#endif

- irqentry_nmi_enter(regs);
+ irqentry_nmi_enter(regs, &irq_state);
instrumentation_begin();
notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);

@@ -651,13 +652,15 @@ DEFINE_IDTENTRY_RAW(exc_int3)
instrumentation_end();
irqentry_exit_to_user_mode(regs);
} else {
- irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+ irqentry_state_t irq_state;
+
+ irqentry_nmi_enter(regs, &irq_state);

instrumentation_begin();
if (!do_int3(regs))
die("int3", regs, 0);
instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);
}
}

@@ -852,7 +855,9 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
* includes the entry stack is excluded for everything.
*/
unsigned long dr7 = local_db_save();
- irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+ irqentry_state_t irq_state;
+
+ irqentry_nmi_enter(regs, &irq_state);
instrumentation_begin();

/*
@@ -909,7 +914,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
regs->flags &= ~X86_EFLAGS_TF;
out:
instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);

local_db_restore(dr7);
}
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 82bf37a5c9ec..8d20c4c13abf 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1441,7 +1441,7 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
unsigned long address = read_cr2();
- irqentry_state_t state;
+ irqentry_state_t irq_state;

prefetchw(&current->mm->mmap_lock);

@@ -1479,11 +1479,11 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
* code reenabled RCU to avoid subsequent wreckage which helps
* debugability.
*/
- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);

instrumentation_begin();
handle_page_fault(regs, error_code, address);
instrumentation_end();

- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
}
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 66938121c4b1..1193a70bcf1b 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -372,6 +372,8 @@ typedef struct irqentry_state {
/**
* irqentry_enter - Handle state tracking on ordinary interrupt entries
* @regs: Pointer to pt_regs of interrupted context
+ * @irq_state: Pointer to an opaque object to store state information; to be
+ * passed back to irqentry_exit()
*
* Invokes:
* - lockdep irqflag state tracking as low level ASM entry disabled
@@ -397,10 +399,8 @@ typedef struct irqentry_state {
* For user mode entries irqentry_enter_from_user_mode() is invoked to
* establish the proper context for NOHZ_FULL. Otherwise scheduling on exit
* would not be possible.
- *
- * Returns: An opaque object that must be passed to idtentry_exit()
*/
-irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
+void noinstr irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state);

/**
* irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
@@ -412,7 +412,7 @@ void irqentry_exit_cond_resched(void);
/**
* irqentry_exit - Handle return from exception that used irqentry_enter()
* @regs: Pointer to pt_regs (exception entry regs)
- * @state: Return value from matching call to irqentry_enter()
+ * @irq_state: Pointer to state information passed to irqentry_enter()
*
* Depending on the return target (kernel/user) this runs the necessary
* preemption and work checks if possible and required and returns to
@@ -423,25 +423,27 @@ void irqentry_exit_cond_resched(void);
*
* Counterpart to irqentry_enter().
*/
-void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
+void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state);

/**
* irqentry_nmi_enter - Handle NMI entry
* @regs: Pointer to currents pt_regs
+ * @irq_state: Pointer to an opaque object to store state information; to be
+ * passed back to irqentry_nmi_exit()
*
* Similar to irqentry_enter() but taking care of the NMI constraints.
*/
-irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs);
+void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_state);

/**
* irqentry_nmi_exit - Handle return from NMI handling
* @regs: Pointer to pt_regs (NMI entry regs)
- * @irq_state: Return value from matching call to irqentry_nmi_enter()
+ * @irq_state: Pointer to state information passed to irqentry_nmi_enter()
*
* Last action before returning to the low level assmebly code.
*
* Counterpart to irqentry_nmi_enter().
*/
-void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state);
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state);

#endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index fa17baadf63e..5ed9d45fb41b 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -289,15 +289,13 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
exit_to_user_mode();
}

-noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
+noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state)
{
- irqentry_state_t ret = {
- .exit_rcu = false,
- };
+ irq_state->exit_rcu = false;

if (user_mode(regs)) {
irqentry_enter_from_user_mode(regs);
- return ret;
+ return;
}

/*
@@ -335,8 +333,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
trace_hardirqs_off_finish();
instrumentation_end();

- ret.exit_rcu = true;
- return ret;
+ irq_state->exit_rcu = true;
+ return;
}

/*
@@ -350,8 +348,6 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
rcu_irq_enter_check_tick();
trace_hardirqs_off_finish();
instrumentation_end();
-
- return ret;
}

void irqentry_exit_cond_resched(void)
@@ -366,7 +362,7 @@ void irqentry_exit_cond_resched(void)
}
}

-noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
+noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state)
{
lockdep_assert_irqs_disabled();

@@ -379,7 +375,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
* carefully and needs the same ordering of lockdep/tracing
* and RCU as the return to user mode path.
*/
- if (state.exit_rcu) {
+ if (irq_state->exit_rcu) {
instrumentation_begin();
/* Tell the tracer that IRET will enable interrupts */
trace_hardirqs_on_prepare();
@@ -401,16 +397,14 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
* IRQ flags state is correct already. Just tell RCU if it
* was not watching on entry.
*/
- if (state.exit_rcu)
+ if (irq_state->exit_rcu)
rcu_irq_exit();
}
}

-irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
+void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_state)
{
- irqentry_state_t irq_state;
-
- irq_state.lockdep = lockdep_hardirqs_enabled();
+ irq_state->lockdep = lockdep_hardirqs_enabled();

__nmi_enter();
lockdep_hardirqs_off(CALLER_ADDR0);
@@ -421,15 +415,13 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
trace_hardirqs_off_finish();
ftrace_nmi_enter();
instrumentation_end();
-
- return irq_state;
}

-void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state)
{
instrumentation_begin();
ftrace_nmi_exit();
- if (irq_state.lockdep) {
+ if (irq_state->lockdep) {
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare(CALLER_ADDR0);
}
@@ -437,7 +429,7 @@ void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)

rcu_nmi_exit();
lockdep_hardirq_exit();
- if (irq_state.lockdep)
+ if (irq_state->lockdep)
lockdep_hardirqs_on(CALLER_ADDR0);
__nmi_exit();
}
--
2.28.0.rc0.12.gb6a658bd00c9


2020-11-15 19:01:16

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3 05/10] x86/entry: Pass irqentry_state_t by reference

Ira,

On Fri, Nov 06 2020 at 15:29, ira weiny wrote:

Subject prefix wants to 'entry:'. This changes generic code and the x86
part is just required to fix the generic code change.

> Currently struct irqentry_state_t only contains a single bool value
> which makes passing it by value is reasonable. However, future patches
> propose to add information to this struct, for example the PKRS
> register/thread state.
>
> Adding information to irqentry_state_t makes passing by value less
> efficient. Therefore, change the entry/exit calls to pass irq_state by
> reference.

The PKRS muck needs to add an u32 to that struct. So how is that a
problem?

The resulting struct still fits into 64bit which is by far more
efficiently passed by value than by reference. So which problem are you
solving here?

Thanks

tglx

2020-11-16 21:18:38

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V3 05/10] x86/entry: Pass irqentry_state_t by reference

On Sun, Nov 15, 2020 at 07:58:52PM +0100, Thomas Gleixner wrote:
> Ira,
>
> On Fri, Nov 06 2020 at 15:29, ira weiny wrote:
>
> Subject prefix wants to 'entry:'. This changes generic code and the x86
> part is just required to fix the generic code change.

Sorry, yes that was carried incorrectly from earlier versions.

>
> > Currently struct irqentry_state_t only contains a single bool value
> > which makes passing it by value is reasonable. However, future patches
> > propose to add information to this struct, for example the PKRS
> > register/thread state.
> >
> > Adding information to irqentry_state_t makes passing by value less
> > efficient. Therefore, change the entry/exit calls to pass irq_state by
> > reference.
>
> The PKRS muck needs to add an u32 to that struct. So how is that a
> problem?

There are more fields to be added for the kmap/pmem support. So this will be
needed eventually. Even though it is not strictly necessary in the next patch.

>
> The resulting struct still fits into 64bit which is by far more
> efficiently passed by value than by reference. So which problem are you
> solving here?

I'm getting ahead of myself a bit. I will be adding more fields for the
kmap/pmem tracking.

Would you accept just a clean up for the variable names in this patch? I could
then add the pass by reference when I add the new fields later. Or would an
update to the commit message be ok to land this now?

Ira

>
> Thanks
>
> tglx
>

2020-11-16 22:19:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3 05/10] x86/entry: Pass irqentry_state_t by reference

Ira,

On Mon, Nov 16 2020 at 10:49, Ira Weiny wrote:
> On Sun, Nov 15, 2020 at 07:58:52PM +0100, Thomas Gleixner wrote:
>> > Currently struct irqentry_state_t only contains a single bool value
>> > which makes passing it by value is reasonable. However, future patches
>> > propose to add information to this struct, for example the PKRS
>> > register/thread state.
>> >
>> > Adding information to irqentry_state_t makes passing by value less
>> > efficient. Therefore, change the entry/exit calls to pass irq_state by
>> > reference.
>>
>> The PKRS muck needs to add an u32 to that struct. So how is that a
>> problem?
>
> There are more fields to be added for the kmap/pmem support. So this will be
> needed eventually. Even though it is not strictly necessary in the next patch.

if you folks could start to write coherent changelogs with proper
explanations why something is needed and why it's going beyond the
immediate use case in the next patch, then getting stuff sorted would be
way easier.

Sorry my crystalball got lost years ago and I'm just not able to keep up
with the flurry of patch sets which have dependencies in one way or the
other.

>> The resulting struct still fits into 64bit which is by far more
>> efficiently passed by value than by reference. So which problem are you
>> solving here?
>
> I'm getting ahead of myself a bit. I will be adding more fields for the
> kmap/pmem tracking.
>
> Would you accept just a clean up for the variable names in this patch? I could
> then add the pass by reference when I add the new fields later. Or would an
> update to the commit message be ok to land this now?

Can you provide a coherent explanation for the full set of things which
needs to be added here first please?

Thanks,

tglx

2020-11-24 06:15:15

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V3.1] entry: Pass irqentry_state_t by reference

From: Ira Weiny <[email protected]>

Currently struct irqentry_state_t only contains a single bool value
which makes passing it by value is reasonable. However, future patches
add information to this struct. This includes the PKRS thread state,
included in this series, as well as information to store kmap reference
tracking and PKS global state outside this series. In total, we
anticipate 2 new 32 bit fields and an integer field to be added to the
struct beyond the existing bool value.

Adding information to irqentry_state_t makes passing by value less
efficient. Therefore, change the entry/exit calls to pass irq_state by
reference in preparation for the changes which follow.

While at it, make the code easier to follow by changing all the usage
sites to consistently use the variable name 'irq_state'.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes from V3
Clarify commit message regarding the need for this patch

Changes from V1
From Thomas: Update commit message
Further clean up Kernel doc and comments
Missed some 'return' comments which are no longer valid

Changes from RFC V3
Clean up @irq_state comments
Standardize on 'irq_state' for the state variable name
Refactor based on new patch from Thomas Gleixner
Also addresses Peter Zijlstra's comment
---
arch/x86/entry/common.c | 8 ++++----
arch/x86/include/asm/idtentry.h | 25 ++++++++++++++----------
arch/x86/kernel/cpu/mce/core.c | 4 ++--
arch/x86/kernel/kvm.c | 6 +++---
arch/x86/kernel/nmi.c | 4 ++--
arch/x86/kernel/traps.c | 21 ++++++++++++--------
arch/x86/mm/fault.c | 6 +++---
include/linux/entry-common.h | 18 +++++++++--------
kernel/entry/common.c | 34 +++++++++++++--------------------
9 files changed, 65 insertions(+), 61 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 18d8f17f755c..87dea56a15d2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -259,9 +259,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
{
struct pt_regs *old_regs;
bool inhcall;
- irqentry_state_t state;
+ irqentry_state_t irq_state;

- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);
old_regs = set_irq_regs(regs);

instrumentation_begin();
@@ -271,13 +271,13 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
set_irq_regs(old_regs);

inhcall = get_and_clear_inhcall();
- if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
+ if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
restore_inhcall(inhcall);
} else {
- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
}
}
#endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 247a60a47331..282d2413b6a1 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -49,12 +49,13 @@ static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
__##func (regs); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs)
@@ -96,12 +97,13 @@ static __always_inline void __##func(struct pt_regs *regs, \
__visible noinstr void func(struct pt_regs *regs, \
unsigned long error_code) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
__##func (regs, error_code); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs, \
@@ -192,15 +194,16 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector); \
__visible noinstr void func(struct pt_regs *regs, \
unsigned long error_code) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
__##func (regs, (u8)error_code); \
irq_exit_rcu(); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs, u8 vector)
@@ -234,15 +237,16 @@ static void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
run_sysvec_on_irqstack_cond(__##func, regs); \
irq_exit_rcu(); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static noinline void __##func(struct pt_regs *regs)
@@ -263,15 +267,16 @@ static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
__irq_enter_raw(); \
kvm_set_cpu_l1tf_flush_l1d(); \
__##func (regs); \
__irq_exit_raw(); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index f5c860b1a50b..6ed2fa2ea321 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1995,7 +1995,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
mce_check_crashing_cpu())
return;

- irq_state = irqentry_nmi_enter(regs);
+ irqentry_nmi_enter(regs, &irq_state);
/*
* The call targets are marked noinstr, but objtool can't figure
* that out because it's an indirect call. Annotate it.
@@ -2006,7 +2006,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
if (regs->flags & X86_EFLAGS_IF)
trace_hardirqs_on_prepare();
instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);
}

static __always_inline void exc_machine_check_user(struct pt_regs *regs)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7f57ede3cb8e..ed7427c6e74d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -238,12 +238,12 @@ EXPORT_SYMBOL_GPL(kvm_read_and_reset_apf_flags);
noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
{
u32 flags = kvm_read_and_reset_apf_flags();
- irqentry_state_t state;
+ irqentry_state_t irq_state;

if (!flags)
return false;

- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);
instrumentation_begin();

/*
@@ -264,7 +264,7 @@ noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
}

instrumentation_end();
- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
return true;
}

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index bf250a339655..1fd7780e99dd 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)

this_cpu_write(nmi_dr7, local_db_save());

- irq_state = irqentry_nmi_enter(regs);
+ irqentry_nmi_enter(regs, &irq_state);

inc_irq_stat(__nmi_count);

if (!ignore_nmis)
default_do_nmi(regs);

- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);

local_db_restore(this_cpu_read(nmi_dr7));

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index e1b78829d909..8481cc373794 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -245,7 +245,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)

DEFINE_IDTENTRY_RAW(exc_invalid_op)
{
- irqentry_state_t state;
+ irqentry_state_t irq_state;

/*
* We use UD2 as a short encoding for 'CALL __WARN', as such
@@ -255,11 +255,11 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
if (!user_mode(regs) && handle_bug(regs))
return;

- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);
instrumentation_begin();
handle_invalid_op(regs);
instrumentation_end();
- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
}

DEFINE_IDTENTRY(exc_coproc_segment_overrun)
@@ -343,6 +343,7 @@ __visible void __noreturn handle_stack_overflow(const char *message,
*/
DEFINE_IDTENTRY_DF(exc_double_fault)
{
+ irqentry_state_t irq_state;
static const char str[] = "double fault";
struct task_struct *tsk = current;

@@ -405,7 +406,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
}
#endif

- irqentry_nmi_enter(regs);
+ irqentry_nmi_enter(regs, &irq_state);
instrumentation_begin();
notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);

@@ -651,13 +652,15 @@ DEFINE_IDTENTRY_RAW(exc_int3)
instrumentation_end();
irqentry_exit_to_user_mode(regs);
} else {
- irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+ irqentry_state_t irq_state;
+
+ irqentry_nmi_enter(regs, &irq_state);

instrumentation_begin();
if (!do_int3(regs))
die("int3", regs, 0);
instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);
}
}

@@ -852,7 +855,9 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
* includes the entry stack is excluded for everything.
*/
unsigned long dr7 = local_db_save();
- irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+ irqentry_state_t irq_state;
+
+ irqentry_nmi_enter(regs, &irq_state);
instrumentation_begin();

/*
@@ -909,7 +914,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
regs->flags &= ~X86_EFLAGS_TF;
out:
instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);

local_db_restore(dr7);
}
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 82bf37a5c9ec..8d20c4c13abf 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1441,7 +1441,7 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
unsigned long address = read_cr2();
- irqentry_state_t state;
+ irqentry_state_t irq_state;

prefetchw(&current->mm->mmap_lock);

@@ -1479,11 +1479,11 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
* code reenabled RCU to avoid subsequent wreckage which helps
* debugability.
*/
- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);

instrumentation_begin();
handle_page_fault(regs, error_code, address);
instrumentation_end();

- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
}
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 49b26b216e4e..dd473137e728 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -358,6 +358,8 @@ typedef struct irqentry_state {
/**
* irqentry_enter - Handle state tracking on ordinary interrupt entries
* @regs: Pointer to pt_regs of interrupted context
+ * @irq_state: Pointer to an opaque object to store state information; to be
+ * passed back to irqentry_exit()
*
* Invokes:
* - lockdep irqflag state tracking as low level ASM entry disabled
@@ -383,10 +385,8 @@ typedef struct irqentry_state {
* For user mode entries irqentry_enter_from_user_mode() is invoked to
* establish the proper context for NOHZ_FULL. Otherwise scheduling on exit
* would not be possible.
- *
- * Returns: An opaque object that must be passed to idtentry_exit()
*/
-irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
+void noinstr irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state);

/**
* irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
@@ -398,7 +398,7 @@ void irqentry_exit_cond_resched(void);
/**
* irqentry_exit - Handle return from exception that used irqentry_enter()
* @regs: Pointer to pt_regs (exception entry regs)
- * @state: Return value from matching call to irqentry_enter()
+ * @irq_state: Pointer to state information passed to irqentry_enter()
*
* Depending on the return target (kernel/user) this runs the necessary
* preemption and work checks if possible and required and returns to
@@ -409,25 +409,27 @@ void irqentry_exit_cond_resched(void);
*
* Counterpart to irqentry_enter().
*/
-void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
+void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state);

/**
* irqentry_nmi_enter - Handle NMI entry
* @regs: Pointer to currents pt_regs
+ * @irq_state: Pointer to an opaque object to store state information; to be
+ * passed back to irqentry_nmi_exit()
*
* Similar to irqentry_enter() but taking care of the NMI constraints.
*/
-irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs);
+void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_state);

/**
* irqentry_nmi_exit - Handle return from NMI handling
* @regs: Pointer to pt_regs (NMI entry regs)
- * @irq_state: Return value from matching call to irqentry_nmi_enter()
+ * @irq_state: Pointer to state information passed to irqentry_nmi_enter()
*
* Last action before returning to the low level assembly code.
*
* Counterpart to irqentry_nmi_enter().
*/
-void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state);
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state);

#endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 91e8fd50adf4..c5d586d5cf5b 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -290,15 +290,13 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
exit_to_user_mode();
}

-noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
+noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state)
{
- irqentry_state_t ret = {
- .exit_rcu = false,
- };
+ irq_state->exit_rcu = false;

if (user_mode(regs)) {
irqentry_enter_from_user_mode(regs);
- return ret;
+ return;
}

/*
@@ -336,8 +334,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
trace_hardirqs_off_finish();
instrumentation_end();

- ret.exit_rcu = true;
- return ret;
+ irq_state->exit_rcu = true;
+ return;
}

/*
@@ -351,8 +349,6 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
rcu_irq_enter_check_tick();
trace_hardirqs_off_finish();
instrumentation_end();
-
- return ret;
}

void irqentry_exit_cond_resched(void)
@@ -367,7 +363,7 @@ void irqentry_exit_cond_resched(void)
}
}

-noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
+noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state)
{
lockdep_assert_irqs_disabled();

@@ -380,7 +376,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
* carefully and needs the same ordering of lockdep/tracing
* and RCU as the return to user mode path.
*/
- if (state.exit_rcu) {
+ if (irq_state->exit_rcu) {
instrumentation_begin();
/* Tell the tracer that IRET will enable interrupts */
trace_hardirqs_on_prepare();
@@ -402,16 +398,14 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
* IRQ flags state is correct already. Just tell RCU if it
* was not watching on entry.
*/
- if (state.exit_rcu)
+ if (irq_state->exit_rcu)
rcu_irq_exit();
}
}

-irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
+void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_state)
{
- irqentry_state_t irq_state;
-
- irq_state.lockdep = lockdep_hardirqs_enabled();
+ irq_state->lockdep = lockdep_hardirqs_enabled();

__nmi_enter();
lockdep_hardirqs_off(CALLER_ADDR0);
@@ -422,15 +416,13 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
trace_hardirqs_off_finish();
ftrace_nmi_enter();
instrumentation_end();
-
- return irq_state;
}

-void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state)
{
instrumentation_begin();
ftrace_nmi_exit();
- if (irq_state.lockdep) {
+ if (irq_state->lockdep) {
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare(CALLER_ADDR0);
}
@@ -438,7 +430,7 @@ void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)

rcu_nmi_exit();
lockdep_hardirq_exit();
- if (irq_state.lockdep)
+ if (irq_state->lockdep)
lockdep_hardirqs_on(CALLER_ADDR0);
__nmi_exit();
}
--
2.28.0.rc0.12.gb6a658bd00c9

2020-12-13 04:43:43

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH V3.1] entry: Pass irqentry_state_t by reference

On Mon, Nov 23, 2020 at 10:10 PM <[email protected]> wrote:
>
> From: Ira Weiny <[email protected]>
>
> Currently struct irqentry_state_t only contains a single bool value
> which makes passing it by value is reasonable. However, future patches
> add information to this struct. This includes the PKRS thread state,
> included in this series, as well as information to store kmap reference
> tracking and PKS global state outside this series. In total, we
> anticipate 2 new 32 bit fields and an integer field to be added to the
> struct beyond the existing bool value.
>
> Adding information to irqentry_state_t makes passing by value less
> efficient. Therefore, change the entry/exit calls to pass irq_state by
> reference in preparation for the changes which follow.
>
> While at it, make the code easier to follow by changing all the usage
> sites to consistently use the variable name 'irq_state'.

After contemplating this for a bit, I think this isn't really the
right approach. It *works*, but we've mostly just created a bit of an
unfortunate situation. Our stack, on a (possibly nested) entry looks
like:

previous frame (or empty if we came from usermode)
---
SS
RSP
FLAGS
CS
RIP
rest of pt_regs

C frame

irqentry_state_t (maybe -- the compiler is within its rights to play
almost arbitrary games here)

more C stuff


So what we've accomplished is having two distinct arch register
regions, one called pt_regs and the other stuck in irqentry_state_t.
This is annoying because it means that, if we want to access this
thing without passing a pointer around or access it at all from outer
frames, we need to do something terrible with the unwinder, and we
don't want to go there.

So I propose a somewhat different solution: lay out the stack like this.

SS
RSP
FLAGS
CS
RIP
rest of pt_regs
PKS
^^^^^^^^ extended_pt_regs points here

C frame
more C stuff
...

IOW we have:

struct extended_pt_regs {
bool rcu_whatever;
other generic fields here;
struct arch_extended_pt_regs arch_regs;
struct pt_regs regs;
};

and arch_extended_pt_regs has unsigned long pks;

and instead of passing a pointer to irqentry_state_t to the generic
entry/exit code, we just pass a pt_regs pointer. And we have a little
accessor like:

struct extended_pt_regs *extended_regs(struct pt_regs *) { return
container_of(...); }

And we tell eBPF that extended_pt_regs is NOT ABI, and we will change
it whenever we feel like just to keep you on your toes, thank you very
much.

Does this seem reasonable? You get to drop patch 7 and instead modify
the show_regs() stuff to just display one extra register.

2020-12-17 00:42:44

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V3.1] entry: Pass irqentry_state_t by reference

On Tue, Dec 15, 2020 at 06:09:02PM -0800, Andy Lutomirski wrote:
> On Tue, Dec 15, 2020 at 5:32 PM Ira Weiny <[email protected]> wrote:
> >
> > On Fri, Dec 11, 2020 at 02:14:28PM -0800, Andy Lutomirski wrote:
> > > On Mon, Nov 23, 2020 at 10:10 PM <[email protected]> wrote:
>
> > > IOW we have:
> > >
> > > struct extended_pt_regs {
> > > bool rcu_whatever;
> > > other generic fields here;
> > > struct arch_extended_pt_regs arch_regs;
> > > struct pt_regs regs;
> > > };
> > >
> > > and arch_extended_pt_regs has unsigned long pks;
> > >
> > > and instead of passing a pointer to irqentry_state_t to the generic
> > > entry/exit code, we just pass a pt_regs pointer. And we have a little
> > > accessor like:
> > >
> > > struct extended_pt_regs *extended_regs(struct pt_regs *) { return
> > > container_of(...); }
> > >
> > > And we tell eBPF that extended_pt_regs is NOT ABI, and we will change
> > > it whenever we feel like just to keep you on your toes, thank you very
> > > much.
> > >
> > > Does this seem reasonable?
> >
> > Conceptually yes. But I'm failing to see how this implementation can be made
> > generic for the generic fields. The pks fields, assuming they stay x86
> > specific, would be reasonable to add in PUSH_AND_CLEAR_REGS. But the
> > rcu/lockdep field is generic. Wouldn't we have to modify every architecture to
> > add space for the rcu/lockdep bool?
> >
> > If not, where is a generic place that could be done? Basically I'm missing how
> > the effective stack structure can look like this:
> >
> > > struct extended_pt_regs {
> > > bool rcu_whatever;
> > > other generic fields here;
> > > struct arch_extended_pt_regs arch_regs;
> > > struct pt_regs regs;
> > > };
> >
> > It seems more reasonable to make it look like:
> >
> > #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > struct extended_pt_regs {
> > unsigned long pkrs;
> > struct pt_regs regs;
> > };
> > #endif
> >
> > And leave the rcu/lockdep bool passed by value as before (still in C).
>
> We could certainly do this,

I'm going to start with this basic support. Because I have 0 experience in
most of these architectures.

> but we could also allocate some generic
> space. PUSH_AND_CLEAR_REGS would get an extra instruction like:
>
> subq %rsp, $GENERIC_PTREGS_SIZE
>
> or however this should be written. That field would be defined in
> asm-offsets.c. And yes, all the generic-entry architectures would
> need to get onboard.

What do you mean by 'generic-entry' architectures? I thought they all used the
generic entry code?

Regardless I would need to start another thread on this topic with any of those
architecture maintainers to see what the work load would be for this. I don't
think I can do it on my own.

FWIW I think it is a bit unfair to hold up the PKS support in x86 for making
these generic fields part of the stack frame. So perhaps that could be made a
follow on to the PKS series?

>
> If we wanted to be fancy, we could split the generic area into
> initialize-to-zero and uninitialized for debugging purposes, but that
> might be more complication than is worthwhile.

Ok, agreed, but this is step 3 or 4 at the earliest.

Ira

2020-12-17 13:10:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3.1] entry: Pass irqentry_state_t by reference

On Fri, Dec 11 2020 at 14:14, Andy Lutomirski wrote:
> On Mon, Nov 23, 2020 at 10:10 PM <[email protected]> wrote:
> After contemplating this for a bit, I think this isn't really the
> right approach. It *works*, but we've mostly just created a bit of an
> unfortunate situation. Our stack, on a (possibly nested) entry looks
> like:
>
> previous frame (or empty if we came from usermode)
> ---
> SS
> RSP
> FLAGS
> CS
> RIP
> rest of pt_regs
>
> C frame
>
> irqentry_state_t (maybe -- the compiler is within its rights to play
> almost arbitrary games here)
>
> more C stuff
>
> So what we've accomplished is having two distinct arch register
> regions, one called pt_regs and the other stuck in irqentry_state_t.
> This is annoying because it means that, if we want to access this
> thing without passing a pointer around or access it at all from outer
> frames, we need to do something terrible with the unwinder, and we
> don't want to go there.
>
> So I propose a somewhat different solution: lay out the stack like this.
>
> SS
> RSP
> FLAGS
> CS
> RIP
> rest of pt_regs
> PKS
> ^^^^^^^^ extended_pt_regs points here
>
> C frame
> more C stuff
> ...
>
> IOW we have:
>
> struct extended_pt_regs {
> bool rcu_whatever;
> other generic fields here;
> struct arch_extended_pt_regs arch_regs;
> struct pt_regs regs;
> };
>
> and arch_extended_pt_regs has unsigned long pks;
>
> and instead of passing a pointer to irqentry_state_t to the generic
> entry/exit code, we just pass a pt_regs pointer.

While I agree vs. PKS which is architecture specific state and needed in
other places e.g. #PF, I'm not convinced that sticking the existing
state into the same area buys us anything more than an indirect access.

Peter?

Thanks,

tglx

2020-12-17 13:21:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V3.1] entry: Pass irqentry_state_t by reference

On Thu, Dec 17, 2020 at 02:07:01PM +0100, Thomas Gleixner wrote:
> On Fri, Dec 11 2020 at 14:14, Andy Lutomirski wrote:
> > On Mon, Nov 23, 2020 at 10:10 PM <[email protected]> wrote:
> > After contemplating this for a bit, I think this isn't really the
> > right approach. It *works*, but we've mostly just created a bit of an
> > unfortunate situation. Our stack, on a (possibly nested) entry looks
> > like:
> >
> > previous frame (or empty if we came from usermode)
> > ---
> > SS
> > RSP
> > FLAGS
> > CS
> > RIP
> > rest of pt_regs
> >
> > C frame
> >
> > irqentry_state_t (maybe -- the compiler is within its rights to play
> > almost arbitrary games here)
> >
> > more C stuff
> >
> > So what we've accomplished is having two distinct arch register
> > regions, one called pt_regs and the other stuck in irqentry_state_t.
> > This is annoying because it means that, if we want to access this
> > thing without passing a pointer around or access it at all from outer
> > frames, we need to do something terrible with the unwinder, and we
> > don't want to go there.
> >
> > So I propose a somewhat different solution: lay out the stack like this.
> >
> > SS
> > RSP
> > FLAGS
> > CS
> > RIP
> > rest of pt_regs
> > PKS
> > ^^^^^^^^ extended_pt_regs points here
> >
> > C frame
> > more C stuff
> > ...
> >
> > IOW we have:
> >
> > struct extended_pt_regs {
> > bool rcu_whatever;
> > other generic fields here;
> > struct arch_extended_pt_regs arch_regs;
> > struct pt_regs regs;
> > };
> >
> > and arch_extended_pt_regs has unsigned long pks;
> >
> > and instead of passing a pointer to irqentry_state_t to the generic
> > entry/exit code, we just pass a pt_regs pointer.
>
> While I agree vs. PKS which is architecture specific state and needed in
> other places e.g. #PF, I'm not convinced that sticking the existing
> state into the same area buys us anything more than an indirect access.
>
> Peter?

Agreed; that immediately solves the confusion Ira had as well. While
extending pt_regs sounds scary, I think we've isolated our pt_regs
implementation from actual ABI pretty well, but of course, that would
need an audit. We don't want to leak this into signals for example.


2020-12-17 15:37:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH V3.1] entry: Pass irqentry_state_t by reference


> On Dec 17, 2020, at 5:19 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Dec 17, 2020 at 02:07:01PM +0100, Thomas Gleixner wrote:
>>> On Fri, Dec 11 2020 at 14:14, Andy Lutomirski wrote:
>>>> On Mon, Nov 23, 2020 at 10:10 PM <[email protected]> wrote:
>>> After contemplating this for a bit, I think this isn't really the
>>> right approach. It *works*, but we've mostly just created a bit of an
>>> unfortunate situation. Our stack, on a (possibly nested) entry looks
>>> like:
>>>
>>> previous frame (or empty if we came from usermode)
>>> ---
>>> SS
>>> RSP
>>> FLAGS
>>> CS
>>> RIP
>>> rest of pt_regs
>>>
>>> C frame
>>>
>>> irqentry_state_t (maybe -- the compiler is within its rights to play
>>> almost arbitrary games here)
>>>
>>> more C stuff
>>>
>>> So what we've accomplished is having two distinct arch register
>>> regions, one called pt_regs and the other stuck in irqentry_state_t.
>>> This is annoying because it means that, if we want to access this
>>> thing without passing a pointer around or access it at all from outer
>>> frames, we need to do something terrible with the unwinder, and we
>>> don't want to go there.
>>>
>>> So I propose a somewhat different solution: lay out the stack like this.
>>>
>>> SS
>>> RSP
>>> FLAGS
>>> CS
>>> RIP
>>> rest of pt_regs
>>> PKS
>>> ^^^^^^^^ extended_pt_regs points here
>>>
>>> C frame
>>> more C stuff
>>> ...
>>>
>>> IOW we have:
>>>
>>> struct extended_pt_regs {
>>> bool rcu_whatever;
>>> other generic fields here;
>>> struct arch_extended_pt_regs arch_regs;
>>> struct pt_regs regs;
>>> };
>>>
>>> and arch_extended_pt_regs has unsigned long pks;
>>>
>>> and instead of passing a pointer to irqentry_state_t to the generic
>>> entry/exit code, we just pass a pt_regs pointer.
>>
>> While I agree vs. PKS which is architecture specific state and needed in
>> other places e.g. #PF, I'm not convinced that sticking the existing
>> state into the same area buys us anything more than an indirect access.
>>
>> Peter?
>
> Agreed; that immediately solves the confusion Ira had as well. While
> extending pt_regs sounds scary, I think we've isolated our pt_regs
> implementation from actual ABI pretty well, but of course, that would
> need an audit. We don't want to leak this into signals for example.
>

I’m okay with this.

My suggestion for having an extended pt_regs that contains pt_regs is to keep extensions like this invisible to unsuspecting parts of the kernel. In particular, BPF seems to pass around struct pt_regs *, and I don’t know what the implications of effectively offsetting all the registers relative to the pointer would be.

Anything that actually broke the signal regs ABI should be noticed by the x86 selftests — the tests read and write registers through ucontext.

>

2020-12-17 17:00:41

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3.1] entry: Pass irqentry_state_t by reference

On Mon, Nov 23 2020 at 22:09, ira weiny wrote:
> From: Ira Weiny <[email protected]>
>
> Currently struct irqentry_state_t only contains a single bool value
> which makes passing it by value is reasonable. However, future patches
> add information to this struct. This includes the PKRS thread state,
> included in this series, as well as information to store kmap reference
> tracking and PKS global state outside this series. In total, we
> anticipate 2 new 32 bit fields and an integer field to be added to the
> struct beyond the existing bool value.

Well yes, but why can't you provide at least in the comment section
below the '---' a pointer to the latest version of this reference muck
and PKS global state if you can't explain at least the concept of the
two things here?

It's one thing that you anticipate something but a different thing
whether it's the right thing to do.

Thanks,

tglx