With CONFIG_PREEMPT=y (involuntary preemption enabled), it is possible
to switch away from a task inside copy_{from,to}_user. This left the CPU
with userspace access enabled until after the next IRQ or privilege
level switch, when AMR/IAMR got reset to AMR_KU[AE]P_BLOCKED. Then, when
switching back to the original task, the userspace access would fault:
Kernel attempted to write user page (3fff7ab68190) - exploit attempt? (uid: 65536)
------------[ cut here ]------------
Bug: Write fault blocked by KUAP!
WARNING: CPU: 56 PID: 4939 at arch/powerpc/mm/fault.c:228 ___do_page_fault+0x7b4/0xaa0
CPU: 56 PID: 4939 Comm: git Tainted: G W 5.19.8-00005-gba424747260d #1
NIP: c0000000000555e4 LR: c0000000000555e0 CTR: c00000000079d9d0
REGS: c00000008f507370 TRAP: 0700 Tainted: G W (5.19.8-00005-gba424747260d)
MSR: 9000000000021033 <SF,HV,ME,IR,DR,RI,LE> CR: 28042222 XER: 20040000
CFAR: c000000000123780 IRQMASK: 3
NIP [c0000000000555e4] ___do_page_fault+0x7b4/0xaa0
LR [c0000000000555e0] ___do_page_fault+0x7b0/0xaa0
Call Trace:
[c00000008f507610] [c0000000000555e0] ___do_page_fault+0x7b0/0xaa0 (unreliable)
[c00000008f5076c0] [c000000000055938] do_page_fault+0x68/0x130
[c00000008f5076f0] [c000000000008914] data_access_common_virt+0x194/0x1f0
--- interrupt: 300 at __copy_tofrom_user_base+0x9c/0x5a4
NIP: c00000000007b1a8 LR: c00000000073f4d4 CTR: 0000000000000080
REGS: c00000008f507760 TRAP: 0300 Tainted: G W (5.19.8-00005-gba424747260d)
MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002220 XER: 20040000
CFAR: c00000000007b174 DAR: 00003fff7ab68190 DSISR: 0a000000 IRQMASK: 0
NIP [c00000000007b1a8] __copy_tofrom_user_base+0x9c/0x5a4
LR [c00000000073f4d4] copyout+0x74/0x150
--- interrupt: 300
[c00000008f507a30] [c0000000007430cc] copy_page_to_iter+0x12c/0x4b0
[c00000008f507ab0] [c0000000002c7c20] filemap_read+0x200/0x460
[c00000008f507bf0] [c0000000005f96f4] xfs_file_buffered_read+0x104/0x170
[c00000008f507c30] [c0000000005f9800] xfs_file_read_iter+0xa0/0x150
[c00000008f507c70] [c0000000003bddc8] new_sync_read+0x108/0x180
[c00000008f507d10] [c0000000003c06b0] vfs_read+0x1d0/0x240
[c00000008f507d60] [c0000000003c0ba4] ksys_read+0x84/0x140
[c00000008f507db0] [c00000000002a3fc] system_call_exception+0x15c/0x300
[c00000008f507e10] [c00000000000c63c] system_call_common+0xec/0x250
--- interrupt: c00 at 0x3fff83aa7238
NIP: 00003fff83aa7238 LR: 00003fff83a923b8 CTR: 0000000000000000
REGS: c00000008f507e80 TRAP: 0c00 Tainted: G W (5.19.8-00005-gba424747260d)
MSR: 900000000280f033 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 80002482 XER: 00000000
IRQMASK: 0
NIP [00003fff83aa7238] 0x3fff83aa7238
LR [00003fff83a923b8] 0x3fff83a923b8
--- interrupt: c00
Instruction dump:
e87f0100 48101021 60000000 2c230000 4182fee8 408e0128 3c82ff80 3884e978
3c62ff80 3863ea78 480ce13d 60000000 <0fe00000> fb010070 fb810090 e80100c0
---[ end trace 0000000000000000 ]---
Fix this by saving and restoring the kernel-side AMR/IAMR values when
switching tasks.
Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU")
Signed-off-by: Samuel Holland <[email protected]>
---
I have no idea if this is the right change to make, and it could be
optimized, but my system has been stable with this patch for 5 days now.
Without the patch, I hit the bug every few minutes when my load average
is <1, and I hit it immediately if I try to do a parallel kernel build.
Because of the instability (file I/O randomly raises SIGBUS), I don't
think anyone would run a system in this configuration, so I don't think
this bug is exploitable.
arch/powerpc/kernel/process.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 0fbda89cd1bb..69b189d63124 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1150,6 +1150,12 @@ static inline void save_sprs(struct thread_struct *t)
*/
t->tar = mfspr(SPRN_TAR);
}
+ if (t->regs) {
+ if (mmu_has_feature(MMU_FTR_BOOK3S_KUAP))
+ t->regs->amr = mfspr(SPRN_AMR);
+ if (mmu_has_feature(MMU_FTR_BOOK3S_KUEP))
+ t->regs->iamr = mfspr(SPRN_IAMR);
+ }
#endif
}
@@ -1228,6 +1234,13 @@ static inline void restore_sprs(struct thread_struct *old_thread,
if (cpu_has_feature(CPU_FTR_P9_TIDR) &&
old_thread->tidr != new_thread->tidr)
mtspr(SPRN_TIDR, new_thread->tidr);
+ if (new_thread->regs) {
+ if (mmu_has_feature(MMU_FTR_BOOK3S_KUAP))
+ mtspr(SPRN_AMR, new_thread->regs->amr);
+ if (mmu_has_feature(MMU_FTR_BOOK3S_KUEP))
+ mtspr(SPRN_IAMR, new_thread->regs->iamr);
+ isync();
+ }
#endif
}
--
2.35.1
Le 16/09/2022 à 07:05, Samuel Holland a écrit :
> With CONFIG_PREEMPT=y (involuntary preemption enabled), it is possible
> to switch away from a task inside copy_{from,to}_user. This left the CPU
> with userspace access enabled until after the next IRQ or privilege
> level switch, when AMR/IAMR got reset to AMR_KU[AE]P_BLOCKED. Then, when
> switching back to the original task, the userspace access would fault:
This is not supposed to happen. You never switch away from a task
magically. Task switch will always happen in an interrupt, that means
copy_{from,to}_user() get interrupted.
Whenever an interrupt is taken, kuap_save_amr_and_lock() macro is used
to save KUAP status into the stack then lock KUAP access. At interrupt
exit, kuap_kernel_restore() macro or function is used to restore KUAP
access from the stack. At the time the task switch happens, KUAP access
is expected to be locked. During task switch, the stack is switched so
the KUAP status is taken back from the new task's stack.
Your fix suggests that there is some path where the KUAP status is not
properly saved and/or restored. Did you try running with
CONFIG_PPC_KUAP_DEBUG ? It should warn whenever a KUAP access is left
unlocked.
>
> Kernel attempted to write user page (3fff7ab68190) - exploit attempt? (uid: 65536)
> ------------[ cut here ]------------
> Bug: Write fault blocked by KUAP!
> WARNING: CPU: 56 PID: 4939 at arch/powerpc/mm/fault.c:228 ___do_page_fault+0x7b4/0xaa0
> CPU: 56 PID: 4939 Comm: git Tainted: G W 5.19.8-00005-gba424747260d #1
> NIP: c0000000000555e4 LR: c0000000000555e0 CTR: c00000000079d9d0
> REGS: c00000008f507370 TRAP: 0700 Tainted: G W (5.19.8-00005-gba424747260d)
> MSR: 9000000000021033 <SF,HV,ME,IR,DR,RI,LE> CR: 28042222 XER: 20040000
> CFAR: c000000000123780 IRQMASK: 3
> NIP [c0000000000555e4] ___do_page_fault+0x7b4/0xaa0
> LR [c0000000000555e0] ___do_page_fault+0x7b0/0xaa0
> Call Trace:
> [c00000008f507610] [c0000000000555e0] ___do_page_fault+0x7b0/0xaa0 (unreliable)
> [c00000008f5076c0] [c000000000055938] do_page_fault+0x68/0x130
> [c00000008f5076f0] [c000000000008914] data_access_common_virt+0x194/0x1f0
> --- interrupt: 300 at __copy_tofrom_user_base+0x9c/0x5a4
...
>
> Fix this by saving and restoring the kernel-side AMR/IAMR values when
> switching tasks.
As explained above, KUAP access should be locked at that time, so saving
and restoring it should not have any effect. If it does, it means
something goes wrong somewhere else.
>
> Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU")
> Signed-off-by: Samuel Holland <[email protected]>
> ---
> I have no idea if this is the right change to make, and it could be
> optimized, but my system has been stable with this patch for 5 days now.
>
> Without the patch, I hit the bug every few minutes when my load average
> is <1, and I hit it immediately if I try to do a parallel kernel build.
Great, then can you make a try with CONFIG_PPC_KUAP_DEBUG ?
>
> Because of the instability (file I/O randomly raises SIGBUS), I don't
> think anyone would run a system in this configuration, so I don't think
> this bug is exploitable.
>
> arch/powerpc/kernel/process.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index 0fbda89cd1bb..69b189d63124 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1150,6 +1150,12 @@ static inline void save_sprs(struct thread_struct *t)
> */
> t->tar = mfspr(SPRN_TAR);
> }
> + if (t->regs) {
> + if (mmu_has_feature(MMU_FTR_BOOK3S_KUAP))
> + t->regs->amr = mfspr(SPRN_AMR);
> + if (mmu_has_feature(MMU_FTR_BOOK3S_KUEP))
> + t->regs->iamr = mfspr(SPRN_IAMR);
> + }
> #endif
> }
>
> @@ -1228,6 +1234,13 @@ static inline void restore_sprs(struct thread_struct *old_thread,
> if (cpu_has_feature(CPU_FTR_P9_TIDR) &&
> old_thread->tidr != new_thread->tidr)
> mtspr(SPRN_TIDR, new_thread->tidr);
> + if (new_thread->regs) {
> + if (mmu_has_feature(MMU_FTR_BOOK3S_KUAP))
> + mtspr(SPRN_AMR, new_thread->regs->amr);
> + if (mmu_has_feature(MMU_FTR_BOOK3S_KUEP))
> + mtspr(SPRN_IAMR, new_thread->regs->iamr);
> + isync();
> + }
> #endif
>
> }
On 9/17/22 03:16, Christophe Leroy wrote:
> Le 16/09/2022 à 07:05, Samuel Holland a écrit :
>> With CONFIG_PREEMPT=y (involuntary preemption enabled), it is possible
>> to switch away from a task inside copy_{from,to}_user. This left the CPU
>> with userspace access enabled until after the next IRQ or privilege
>> level switch, when AMR/IAMR got reset to AMR_KU[AE]P_BLOCKED. Then, when
>> switching back to the original task, the userspace access would fault:
>
> This is not supposed to happen. You never switch away from a task
> magically. Task switch will always happen in an interrupt, that means
> copy_{from,to}_user() get interrupted.
That makes sense, the interrupt handler is responsible for saving the
KUAP status. It looks like neither DEFINE_INTERRUPT_HANDLER_RAW nor any
of its users (performance_monitor_exception(), do_slb_fault()) do that.
Yet they still call one of the interrupt_return variants, which restores
the status from the stack.
> Whenever an interrupt is taken, kuap_save_amr_and_lock() macro is used
> to save KUAP status into the stack then lock KUAP access. At interrupt
> exit, kuap_kernel_restore() macro or function is used to restore KUAP
> access from the stack. At the time the task switch happens, KUAP access
> is expected to be locked. During task switch, the stack is switched so
> the KUAP status is taken back from the new task's stack.
What if another task calls schedule() from kernel process context, and
the scheduler switches to a task that had been preempted inside
copy_{from,to}_user()? Then there is no interrupt involved, and I don't
see where kuap_kernel_restore() would get called.
> Your fix suggests that there is some path where the KUAP status is not
> properly saved and/or restored. Did you try running with
> CONFIG_PPC_KUAP_DEBUG ? It should warn whenever a KUAP access is left
> unlocked.
>
>>
>> Kernel attempted to write user page (3fff7ab68190) - exploit attempt? (uid: 65536)
>> ------------[ cut here ]------------
>> Bug: Write fault blocked by KUAP!
>> WARNING: CPU: 56 PID: 4939 at arch/powerpc/mm/fault.c:228 ___do_page_fault+0x7b4/0xaa0
>> CPU: 56 PID: 4939 Comm: git Tainted: G W 5.19.8-00005-gba424747260d #1
>> NIP: c0000000000555e4 LR: c0000000000555e0 CTR: c00000000079d9d0
>> REGS: c00000008f507370 TRAP: 0700 Tainted: G W (5.19.8-00005-gba424747260d)
>> MSR: 9000000000021033 <SF,HV,ME,IR,DR,RI,LE> CR: 28042222 XER: 20040000
>> CFAR: c000000000123780 IRQMASK: 3
>> NIP [c0000000000555e4] ___do_page_fault+0x7b4/0xaa0
>> LR [c0000000000555e0] ___do_page_fault+0x7b0/0xaa0
>> Call Trace:
>> [c00000008f507610] [c0000000000555e0] ___do_page_fault+0x7b0/0xaa0 (unreliable)
>> [c00000008f5076c0] [c000000000055938] do_page_fault+0x68/0x130
>> [c00000008f5076f0] [c000000000008914] data_access_common_virt+0x194/0x1f0
>> --- interrupt: 300 at __copy_tofrom_user_base+0x9c/0x5a4
>
> ...
>
>>
>> Fix this by saving and restoring the kernel-side AMR/IAMR values when
>> switching tasks.
>
> As explained above, KUAP access should be locked at that time, so saving
> and restoring it should not have any effect. If it does, it means
> something goes wrong somewhere else.
>
>>
>> Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU")
>> Signed-off-by: Samuel Holland <[email protected]>
>> ---
>> I have no idea if this is the right change to make, and it could be
>> optimized, but my system has been stable with this patch for 5 days now.
>>
>> Without the patch, I hit the bug every few minutes when my load average
>> is <1, and I hit it immediately if I try to do a parallel kernel build.
>
> Great, then can you make a try with CONFIG_PPC_KUAP_DEBUG ?
Yes, I will try this out in the next few days.
Regards,
Samuel
Le 17/09/2022 à 20:38, Samuel Holland a écrit :
> On 9/17/22 03:16, Christophe Leroy wrote:
>> Le 16/09/2022 à 07:05, Samuel Holland a écrit :
>>> With CONFIG_PREEMPT=y (involuntary preemption enabled), it is possible
>>> to switch away from a task inside copy_{from,to}_user. This left the CPU
>>> with userspace access enabled until after the next IRQ or privilege
>>> level switch, when AMR/IAMR got reset to AMR_KU[AE]P_BLOCKED. Then, when
>>> switching back to the original task, the userspace access would fault:
>>
>> This is not supposed to happen. You never switch away from a task
>> magically. Task switch will always happen in an interrupt, that means
>> copy_{from,to}_user() get interrupted.
>
> That makes sense, the interrupt handler is responsible for saving the
> KUAP status. It looks like neither DEFINE_INTERRUPT_HANDLER_RAW nor any
> of its users (performance_monitor_exception(), do_slb_fault()) do that.
As far as I can see, that's done earlier, in exceptions-64s.S.
Look for kuap_save_amr_and_lock.
Now, it may be possible that one of the exceptions pathes misses it.
> Yet they still call one of the interrupt_return variants, which restores
> the status from the stack.
>
>> Whenever an interrupt is taken, kuap_save_amr_and_lock() macro is used
>> to save KUAP status into the stack then lock KUAP access. At interrupt
>> exit, kuap_kernel_restore() macro or function is used to restore KUAP
>> access from the stack. At the time the task switch happens, KUAP access
>> is expected to be locked. During task switch, the stack is switched so
>> the KUAP status is taken back from the new task's stack.
>
> What if another task calls schedule() from kernel process context, and
> the scheduler switches to a task that had been preempted inside
> copy_{from,to}_user()? Then there is no interrupt involved, and I don't
> see where kuap_kernel_restore() would get called.
Yes there is interrupt involved. That task, if it has been preempted
inside copy_from_user(), it must have been through an interrupt, likely
a timer interrupt. So switching back to that task means doing an
interrupt return in the context of that task. That's when KUAP status
should be restored.
>
>> Your fix suggests that there is some path where the KUAP status is not
>> properly saved and/or restored. Did you try running with
>> CONFIG_PPC_KUAP_DEBUG ? It should warn whenever a KUAP access is left
>> unlocked.
>>
>>>
>>> Kernel attempted to write user page (3fff7ab68190) - exploit attempt? (uid: 65536)
>>> ------------[ cut here ]------------
>>> Bug: Write fault blocked by KUAP!
>>> WARNING: CPU: 56 PID: 4939 at arch/powerpc/mm/fault.c:228 ___do_page_fault+0x7b4/0xaa0
>>> CPU: 56 PID: 4939 Comm: git Tainted: G W 5.19.8-00005-gba424747260d #1
>>> NIP: c0000000000555e4 LR: c0000000000555e0 CTR: c00000000079d9d0
>>> REGS: c00000008f507370 TRAP: 0700 Tainted: G W (5.19.8-00005-gba424747260d)
>>> MSR: 9000000000021033 <SF,HV,ME,IR,DR,RI,LE> CR: 28042222 XER: 20040000
>>> CFAR: c000000000123780 IRQMASK: 3
>>> NIP [c0000000000555e4] ___do_page_fault+0x7b4/0xaa0
>>> LR [c0000000000555e0] ___do_page_fault+0x7b0/0xaa0
>>> Call Trace:
>>> [c00000008f507610] [c0000000000555e0] ___do_page_fault+0x7b0/0xaa0 (unreliable)
>>> [c00000008f5076c0] [c000000000055938] do_page_fault+0x68/0x130
>>> [c00000008f5076f0] [c000000000008914] data_access_common_virt+0x194/0x1f0
>>> --- interrupt: 300 at __copy_tofrom_user_base+0x9c/0x5a4
>>
>> ...
>>
>>>
>>> Fix this by saving and restoring the kernel-side AMR/IAMR values when
>>> switching tasks.
>>
>> As explained above, KUAP access should be locked at that time, so saving
>> and restoring it should not have any effect. If it does, it means
>> something goes wrong somewhere else.
>>
>>>
>>> Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU")
>>> Signed-off-by: Samuel Holland <[email protected]>
>>> ---
>>> I have no idea if this is the right change to make, and it could be
>>> optimized, but my system has been stable with this patch for 5 days now.
>>>
>>> Without the patch, I hit the bug every few minutes when my load average
>>> is <1, and I hit it immediately if I try to do a parallel kernel build.
>>
>> Great, then can you make a try with CONFIG_PPC_KUAP_DEBUG ?
>
> Yes, I will try this out in the next few days.
>
Thanks
Christophe
Christophe Leroy <[email protected]> writes:
> Le 16/09/2022 à 07:05, Samuel Holland a écrit :
>> With CONFIG_PREEMPT=y (involuntary preemption enabled), it is possible
>> to switch away from a task inside copy_{from,to}_user. This left the CPU
>> with userspace access enabled until after the next IRQ or privilege
>> level switch, when AMR/IAMR got reset to AMR_KU[AE]P_BLOCKED. Then, when
>> switching back to the original task, the userspace access would fault:
>
> This is not supposed to happen. You never switch away from a task
> magically. Task switch will always happen in an interrupt, that means
> copy_{from,to}_user() get interrupted.
Unfortunately this isn't true when CONFIG_PREEMPT=y.
We can switch away without an interrupt via:
__copy_tofrom_user()
-> __copy_tofrom_user_power7()
-> exit_vmx_usercopy()
-> preempt_enable()
-> __preempt_schedule()
-> preempt_schedule()
-> preempt_schedule_common()
-> __schedule()
I do some boot tests with CONFIG_PREEMPT=y, but I realise now those are
all on Power8, which is a bit of an oversight on my part.
And clearly no one else tests it, until now :)
I think the root of our problem is that our KUAP lock/unlock is at too
high a level, ie. we do it in C around the low-level copy to/from.
eg:
static inline unsigned long
raw_copy_to_user(void __user *to, const void *from, unsigned long n)
{
unsigned long ret;
allow_write_to_user(to, n);
ret = __copy_tofrom_user(to, (__force const void __user *)from, n);
prevent_write_to_user(to, n);
return ret;
}
There's a reason we did that, which is that we have various different
KUAP methods on different platforms, not a simple instruction like other
arches.
But that means we have that exit_vmx_usercopy() being called deep in the
guts of __copy_tofrom_user(), with KUAP disabled, and then we call into
the preempt machinery and eventually schedule.
I don't see an easy way to fix that "properly", it would be a big change
to all platforms to push the KUAP save/restore down into the low level
asm code.
But I think the patch below does fix it, although it abuses things a
little. Namely it only works because the 64s KUAP code can handle a
double call to prevent, and doesn't need the addresses or size for the
allow.
Still I think it might be our best option for an easy fix.
Samuel, can you try this on your system and check it works for you?
cheers
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index 97a77b37daa3..c50080c6a136 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -432,6 +432,7 @@ int speround_handler(struct pt_regs *regs);
/* VMX copying */
int enter_vmx_usercopy(void);
int exit_vmx_usercopy(void);
+void exit_vmx_usercopy_continue(void);
int enter_vmx_ops(void);
void *exit_vmx_ops(void *dest);
diff --git a/arch/powerpc/lib/copyuser_power7.S b/arch/powerpc/lib/copyuser_power7.S
index 28f0be523c06..77804860383c 100644
--- a/arch/powerpc/lib/copyuser_power7.S
+++ b/arch/powerpc/lib/copyuser_power7.S
@@ -47,7 +47,7 @@
ld r15,STK_REG(R15)(r1)
ld r14,STK_REG(R14)(r1)
.Ldo_err3:
- bl exit_vmx_usercopy
+ bl exit_vmx_usercopy_continue
ld r0,STACKFRAMESIZE+16(r1)
mtlr r0
b .Lexit
diff --git a/arch/powerpc/lib/vmx-helper.c b/arch/powerpc/lib/vmx-helper.c
index f76a50291fd7..78a18b8384ff 100644
--- a/arch/powerpc/lib/vmx-helper.c
+++ b/arch/powerpc/lib/vmx-helper.c
@@ -8,6 +8,7 @@
*/
#include <linux/uaccess.h>
#include <linux/hardirq.h>
+#include <asm/kup.h>
#include <asm/switch_to.h>
int enter_vmx_usercopy(void)
@@ -34,12 +35,19 @@ int enter_vmx_usercopy(void)
*/
int exit_vmx_usercopy(void)
{
+ prevent_user_access(KUAP_READ_WRITE);
disable_kernel_altivec();
pagefault_enable();
preempt_enable();
return 0;
}
+void exit_vmx_usercopy_continue(void)
+{
+ exit_vmx_usercopy();
+ allow_read_write_user(NULL, NULL, 0);
+}
+
int enter_vmx_ops(void)
{
if (in_interrupt())
Le 19/09/2022 à 14:37, Michael Ellerman a écrit :
> Christophe Leroy <[email protected]> writes:
>> Le 16/09/2022 à 07:05, Samuel Holland a écrit :
>>> With CONFIG_PREEMPT=y (involuntary preemption enabled), it is possible
>>> to switch away from a task inside copy_{from,to}_user. This left the CPU
>>> with userspace access enabled until after the next IRQ or privilege
>>> level switch, when AMR/IAMR got reset to AMR_KU[AE]P_BLOCKED. Then, when
>>> switching back to the original task, the userspace access would fault:
>>
>> This is not supposed to happen. You never switch away from a task
>> magically. Task switch will always happen in an interrupt, that means
>> copy_{from,to}_user() get interrupted.
>
> Unfortunately this isn't true when CONFIG_PREEMPT=y.
Argh, yes, I wrote the above with the assumption that we properly follow
the main principles that no complex fonction is to be used while KUAP is
open ... Which is apparently not true here. x86 would have detected it
with objtool, but we don't have it yet in powerpc.
>
> We can switch away without an interrupt via:
> __copy_tofrom_user()
> -> __copy_tofrom_user_power7()
> -> exit_vmx_usercopy()
> -> preempt_enable()
> -> __preempt_schedule()
> -> preempt_schedule()
> -> preempt_schedule_common()
> -> __schedule()
Should we use preempt_enable_no_resched() to avoid that ?
>
> I do some boot tests with CONFIG_PREEMPT=y, but I realise now those are
> all on Power8, which is a bit of an oversight on my part.
>
> And clearly no one else tests it, until now :)
>
> I think the root of our problem is that our KUAP lock/unlock is at too
> high a level, ie. we do it in C around the low-level copy to/from.
>
> eg:
>
> static inline unsigned long
> raw_copy_to_user(void __user *to, const void *from, unsigned long n)
> {
> unsigned long ret;
>
> allow_write_to_user(to, n);
> ret = __copy_tofrom_user(to, (__force const void __user *)from, n);
> prevent_write_to_user(to, n);
> return ret;
> }
>
> There's a reason we did that, which is that we have various different
> KUAP methods on different platforms, not a simple instruction like other
> arches.
>
> But that means we have that exit_vmx_usercopy() being called deep in the
> guts of __copy_tofrom_user(), with KUAP disabled, and then we call into
> the preempt machinery and eventually schedule.
>
> I don't see an easy way to fix that "properly", it would be a big change
> to all platforms to push the KUAP save/restore down into the low level
> asm code.
>
> But I think the patch below does fix it, although it abuses things a
> little. Namely it only works because the 64s KUAP code can handle a
> double call to prevent, and doesn't need the addresses or size for the
> allow.
>
> Still I think it might be our best option for an easy fix.
Wouldn't it be even easier and less abusive to use
preemt_enable_no_resched() ? Or is there definitively a good reason to
resched after a VMX copy while we don't with regular copies ?
Christophe
On 9/19/22 07:37, Michael Ellerman wrote:
> Christophe Leroy <[email protected]> writes:
>> Le 16/09/2022 à 07:05, Samuel Holland a écrit :
>>> With CONFIG_PREEMPT=y (involuntary preemption enabled), it is possible
>>> to switch away from a task inside copy_{from,to}_user. This left the CPU
>>> with userspace access enabled until after the next IRQ or privilege
>>> level switch, when AMR/IAMR got reset to AMR_KU[AE]P_BLOCKED. Then, when
>>> switching back to the original task, the userspace access would fault:
>>
>> This is not supposed to happen. You never switch away from a task
>> magically. Task switch will always happen in an interrupt, that means
>> copy_{from,to}_user() get interrupted.
>
> Unfortunately this isn't true when CONFIG_PREEMPT=y.
>
> We can switch away without an interrupt via:
> __copy_tofrom_user()
> -> __copy_tofrom_user_power7()
> -> exit_vmx_usercopy()
> -> preempt_enable()
> -> __preempt_schedule()
> -> preempt_schedule()
> -> preempt_schedule_common()
> -> __schedule()
>
> I do some boot tests with CONFIG_PREEMPT=y, but I realise now those are
> all on Power8, which is a bit of an oversight on my part.
>
> And clearly no one else tests it, until now :)
>
> I think the root of our problem is that our KUAP lock/unlock is at too
> high a level, ie. we do it in C around the low-level copy to/from.
>
> eg:
>
> static inline unsigned long
> raw_copy_to_user(void __user *to, const void *from, unsigned long n)
> {
> unsigned long ret;
>
> allow_write_to_user(to, n);
> ret = __copy_tofrom_user(to, (__force const void __user *)from, n);
> prevent_write_to_user(to, n);
> return ret;
> }
>
> There's a reason we did that, which is that we have various different
> KUAP methods on different platforms, not a simple instruction like other
> arches.
>
> But that means we have that exit_vmx_usercopy() being called deep in the
> guts of __copy_tofrom_user(), with KUAP disabled, and then we call into
> the preempt machinery and eventually schedule.
>
> I don't see an easy way to fix that "properly", it would be a big change
> to all platforms to push the KUAP save/restore down into the low level
> asm code.
>
> But I think the patch below does fix it, although it abuses things a
> little. Namely it only works because the 64s KUAP code can handle a
> double call to prevent, and doesn't need the addresses or size for the
> allow.
>
> Still I think it might be our best option for an easy fix.
>
> Samuel, can you try this on your system and check it works for you?
It looks like your patch works. Thanks for the correct fix!
I replaced my patch with the one below, and enabled
CONFIG_PPC_KUAP_DEBUG=y, and I was able to do several kernel builds
without any crashes or splats in dmesg.
I suppose the other calls to exit_vmx_usercopy() in copyuser_power7.S
would not cause a crash, because there is no userspace memory access
afterward, but couldn't they still leave KUAP erroneously unlocked?
Regards,
Samuel
> diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
> index 97a77b37daa3..c50080c6a136 100644
> --- a/arch/powerpc/include/asm/processor.h
> +++ b/arch/powerpc/include/asm/processor.h
> @@ -432,6 +432,7 @@ int speround_handler(struct pt_regs *regs);
> /* VMX copying */
> int enter_vmx_usercopy(void);
> int exit_vmx_usercopy(void);
> +void exit_vmx_usercopy_continue(void);
> int enter_vmx_ops(void);
> void *exit_vmx_ops(void *dest);
>
> diff --git a/arch/powerpc/lib/copyuser_power7.S b/arch/powerpc/lib/copyuser_power7.S
> index 28f0be523c06..77804860383c 100644
> --- a/arch/powerpc/lib/copyuser_power7.S
> +++ b/arch/powerpc/lib/copyuser_power7.S
> @@ -47,7 +47,7 @@
> ld r15,STK_REG(R15)(r1)
> ld r14,STK_REG(R14)(r1)
> .Ldo_err3:
> - bl exit_vmx_usercopy
> + bl exit_vmx_usercopy_continue
> ld r0,STACKFRAMESIZE+16(r1)
> mtlr r0
> b .Lexit
> diff --git a/arch/powerpc/lib/vmx-helper.c b/arch/powerpc/lib/vmx-helper.c
> index f76a50291fd7..78a18b8384ff 100644
> --- a/arch/powerpc/lib/vmx-helper.c
> +++ b/arch/powerpc/lib/vmx-helper.c
> @@ -8,6 +8,7 @@
> */
> #include <linux/uaccess.h>
> #include <linux/hardirq.h>
> +#include <asm/kup.h>
> #include <asm/switch_to.h>
>
> int enter_vmx_usercopy(void)
> @@ -34,12 +35,19 @@ int enter_vmx_usercopy(void)
> */
> int exit_vmx_usercopy(void)
> {
> + prevent_user_access(KUAP_READ_WRITE);
> disable_kernel_altivec();
> pagefault_enable();
> preempt_enable();
> return 0;
> }
>
> +void exit_vmx_usercopy_continue(void)
> +{
> + exit_vmx_usercopy();
> + allow_read_write_user(NULL, NULL, 0);
> +}
> +
> int enter_vmx_ops(void)
> {
> if (in_interrupt())
>
Le 21/09/2022 à 05:33, Samuel Holland a écrit :
> On 9/19/22 07:37, Michael Ellerman wrote:
>> Christophe Leroy <[email protected]> writes:
>>> Le 16/09/2022 à 07:05, Samuel Holland a écrit :
>>>> With CONFIG_PREEMPT=y (involuntary preemption enabled), it is possible
>>>> to switch away from a task inside copy_{from,to}_user. This left the CPU
>>>> with userspace access enabled until after the next IRQ or privilege
>>>> level switch, when AMR/IAMR got reset to AMR_KU[AE]P_BLOCKED. Then, when
>>>> switching back to the original task, the userspace access would fault:
>>>
>>> This is not supposed to happen. You never switch away from a task
>>> magically. Task switch will always happen in an interrupt, that means
>>> copy_{from,to}_user() get interrupted.
>>
>> Unfortunately this isn't true when CONFIG_PREEMPT=y.
>>
>> We can switch away without an interrupt via:
>> __copy_tofrom_user()
>> -> __copy_tofrom_user_power7()
>> -> exit_vmx_usercopy()
>> -> preempt_enable()
>> -> __preempt_schedule()
>> -> preempt_schedule()
>> -> preempt_schedule_common()
>> -> __schedule()
>>
>> I do some boot tests with CONFIG_PREEMPT=y, but I realise now those are
>> all on Power8, which is a bit of an oversight on my part.
>>
>> And clearly no one else tests it, until now :)
>>
>> I think the root of our problem is that our KUAP lock/unlock is at too
>> high a level, ie. we do it in C around the low-level copy to/from.
>>
>> eg:
>>
>> static inline unsigned long
>> raw_copy_to_user(void __user *to, const void *from, unsigned long n)
>> {
>> unsigned long ret;
>>
>> allow_write_to_user(to, n);
>> ret = __copy_tofrom_user(to, (__force const void __user *)from, n);
>> prevent_write_to_user(to, n);
>> return ret;
>> }
>>
>> There's a reason we did that, which is that we have various different
>> KUAP methods on different platforms, not a simple instruction like other
>> arches.
>>
>> But that means we have that exit_vmx_usercopy() being called deep in the
>> guts of __copy_tofrom_user(), with KUAP disabled, and then we call into
>> the preempt machinery and eventually schedule.
>>
>> I don't see an easy way to fix that "properly", it would be a big change
>> to all platforms to push the KUAP save/restore down into the low level
>> asm code.
>>
>> But I think the patch below does fix it, although it abuses things a
>> little. Namely it only works because the 64s KUAP code can handle a
>> double call to prevent, and doesn't need the addresses or size for the
>> allow.
>>
>> Still I think it might be our best option for an easy fix.
>>
>> Samuel, can you try this on your system and check it works for you?
>
> It looks like your patch works. Thanks for the correct fix!
Instead of the patch from Michael, could you try by replacing
preempt_enable() by preempt_enable_no_resched() in exit_vmx_usercopy() ?
>
> I replaced my patch with the one below, and enabled
> CONFIG_PPC_KUAP_DEBUG=y, and I was able to do several kernel builds
> without any crashes or splats in dmesg.
Did you try CONFIG_PPC_KUAP_DEBUG without the patch ? Did it detect any
problem ?
Thanks
Christophe
>
> I suppose the other calls to exit_vmx_usercopy() in copyuser_power7.S
> would not cause a crash, because there is no userspace memory access
> afterward, but couldn't they still leave KUAP erroneously unlocked?
>
> Regards,
> Samuel
>
>> diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
>> index 97a77b37daa3..c50080c6a136 100644
>> --- a/arch/powerpc/include/asm/processor.h
>> +++ b/arch/powerpc/include/asm/processor.h
>> @@ -432,6 +432,7 @@ int speround_handler(struct pt_regs *regs);
>> /* VMX copying */
>> int enter_vmx_usercopy(void);
>> int exit_vmx_usercopy(void);
>> +void exit_vmx_usercopy_continue(void);
>> int enter_vmx_ops(void);
>> void *exit_vmx_ops(void *dest);
>>
>> diff --git a/arch/powerpc/lib/copyuser_power7.S b/arch/powerpc/lib/copyuser_power7.S
>> index 28f0be523c06..77804860383c 100644
>> --- a/arch/powerpc/lib/copyuser_power7.S
>> +++ b/arch/powerpc/lib/copyuser_power7.S
>> @@ -47,7 +47,7 @@
>> ld r15,STK_REG(R15)(r1)
>> ld r14,STK_REG(R14)(r1)
>> .Ldo_err3:
>> - bl exit_vmx_usercopy
>> + bl exit_vmx_usercopy_continue
>> ld r0,STACKFRAMESIZE+16(r1)
>> mtlr r0
>> b .Lexit
>> diff --git a/arch/powerpc/lib/vmx-helper.c b/arch/powerpc/lib/vmx-helper.c
>> index f76a50291fd7..78a18b8384ff 100644
>> --- a/arch/powerpc/lib/vmx-helper.c
>> +++ b/arch/powerpc/lib/vmx-helper.c
>> @@ -8,6 +8,7 @@
>> */
>> #include <linux/uaccess.h>
>> #include <linux/hardirq.h>
>> +#include <asm/kup.h>
>> #include <asm/switch_to.h>
>>
>> int enter_vmx_usercopy(void)
>> @@ -34,12 +35,19 @@ int enter_vmx_usercopy(void)
>> */
>> int exit_vmx_usercopy(void)
>> {
>> + prevent_user_access(KUAP_READ_WRITE);
>> disable_kernel_altivec();
>> pagefault_enable();
>> preempt_enable();
>> return 0;
>> }
>>
>> +void exit_vmx_usercopy_continue(void)
>> +{
>> + exit_vmx_usercopy();
>> + allow_read_write_user(NULL, NULL, 0);
>> +}
>> +
>> int enter_vmx_ops(void)
>> {
>> if (in_interrupt())
>>
>
Christophe Leroy <[email protected]> writes:
> Le 19/09/2022 à 14:37, Michael Ellerman a écrit :
>> Christophe Leroy <[email protected]> writes:
>>> Le 16/09/2022 à 07:05, Samuel Holland a écrit :
>>>> With CONFIG_PREEMPT=y (involuntary preemption enabled), it is possible
>>>> to switch away from a task inside copy_{from,to}_user. This left the CPU
>>>> with userspace access enabled until after the next IRQ or privilege
>>>> level switch, when AMR/IAMR got reset to AMR_KU[AE]P_BLOCKED. Then, when
>>>> switching back to the original task, the userspace access would fault:
>>>
>>> This is not supposed to happen. You never switch away from a task
>>> magically. Task switch will always happen in an interrupt, that means
>>> copy_{from,to}_user() get interrupted.
>>
>> Unfortunately this isn't true when CONFIG_PREEMPT=y.
>
> Argh, yes, I wrote the above with the assumption that we properly follow
> the main principles that no complex fonction is to be used while KUAP is
> open ... Which is apparently not true here. x86 would have detected it
> with objtool, but we don't have it yet in powerpc.
Yes and yes :/
>> We can switch away without an interrupt via:
>> __copy_tofrom_user()
>> -> __copy_tofrom_user_power7()
>> -> exit_vmx_usercopy()
>> -> preempt_enable()
>> -> __preempt_schedule()
>> -> preempt_schedule()
>> -> preempt_schedule_common()
>> -> __schedule()
>
>
> Should we use preempt_enable_no_resched() to avoid that ?
Good point :)
...
>>
>> Still I think it might be our best option for an easy fix.
>
> Wouldn't it be even easier and less abusive to use
> preemt_enable_no_resched() ? Or is there definitively a good reason to
> resched after a VMX copy while we don't with regular copies ?
I don't think it's important to reschedule there. I guess it means
another task that wants to preempt will have to wait a little longer,
but I doubt it's measurable.
One reason to do the KUAP lock is it also protects us from running
disable_kernel_altivec() with KUAP unlocked.
That in turn calls into msr_check_and_clear() and
__msr_check_and_clear(), none of which is a problem per-se, but we'd
need to make that all notrace to be 100% safe.
At the moment we're running that all with KUAP unlocked anyway, so using
preempt_enable_no_resched() would fix the actual bug and is much nicer
than my patch, so we should probably do that.
We can look at making disable_kernel_altivec() etc. notrace as a
follow-up.
cheers
On 9/21/22 00:17, Christophe Leroy wrote:
> Le 21/09/2022 à 05:33, Samuel Holland a écrit :
>> On 9/19/22 07:37, Michael Ellerman wrote:
>>> Christophe Leroy <[email protected]> writes:
>>>> Le 16/09/2022 à 07:05, Samuel Holland a écrit :
>>>>> With CONFIG_PREEMPT=y (involuntary preemption enabled), it is possible
>>>>> to switch away from a task inside copy_{from,to}_user. This left the CPU
>>>>> with userspace access enabled until after the next IRQ or privilege
>>>>> level switch, when AMR/IAMR got reset to AMR_KU[AE]P_BLOCKED. Then, when
>>>>> switching back to the original task, the userspace access would fault:
>>>>
>>>> This is not supposed to happen. You never switch away from a task
>>>> magically. Task switch will always happen in an interrupt, that means
>>>> copy_{from,to}_user() get interrupted.
>>>
>>> Unfortunately this isn't true when CONFIG_PREEMPT=y.
>>>
>>> We can switch away without an interrupt via:
>>> __copy_tofrom_user()
>>> -> __copy_tofrom_user_power7()
>>> -> exit_vmx_usercopy()
>>> -> preempt_enable()
>>> -> __preempt_schedule()
>>> -> preempt_schedule()
>>> -> preempt_schedule_common()
>>> -> __schedule()
>>>
>>> I do some boot tests with CONFIG_PREEMPT=y, but I realise now those are
>>> all on Power8, which is a bit of an oversight on my part.
>>>
>>> And clearly no one else tests it, until now :)
>>>
>>> I think the root of our problem is that our KUAP lock/unlock is at too
>>> high a level, ie. we do it in C around the low-level copy to/from.
>>>
>>> eg:
>>>
>>> static inline unsigned long
>>> raw_copy_to_user(void __user *to, const void *from, unsigned long n)
>>> {
>>> unsigned long ret;
>>>
>>> allow_write_to_user(to, n);
>>> ret = __copy_tofrom_user(to, (__force const void __user *)from, n);
>>> prevent_write_to_user(to, n);
>>> return ret;
>>> }
>>>
>>> There's a reason we did that, which is that we have various different
>>> KUAP methods on different platforms, not a simple instruction like other
>>> arches.
>>>
>>> But that means we have that exit_vmx_usercopy() being called deep in the
>>> guts of __copy_tofrom_user(), with KUAP disabled, and then we call into
>>> the preempt machinery and eventually schedule.
>>>
>>> I don't see an easy way to fix that "properly", it would be a big change
>>> to all platforms to push the KUAP save/restore down into the low level
>>> asm code.
>>>
>>> But I think the patch below does fix it, although it abuses things a
>>> little. Namely it only works because the 64s KUAP code can handle a
>>> double call to prevent, and doesn't need the addresses or size for the
>>> allow.
>>>
>>> Still I think it might be our best option for an easy fix.
>>>
>>> Samuel, can you try this on your system and check it works for you?
>>
>> It looks like your patch works. Thanks for the correct fix!
>
> Instead of the patch from Michael, could you try by replacing
> preempt_enable() by preempt_enable_no_resched() in exit_vmx_usercopy() ?
I finally got a chance to test this, and the simpler fix of using
preempt_enable_no_resched() works as well.
>> I replaced my patch with the one below, and enabled
>> CONFIG_PPC_KUAP_DEBUG=y, and I was able to do several kernel builds
>> without any crashes or splats in dmesg.
>
> Did you try CONFIG_PPC_KUAP_DEBUG without the patch ? Did it detect any
> problem ?
I believe I did at one point, and it did not detect anything.
Regards,
Samuel
Le 26/10/2022 à 06:27, Samuel Holland a écrit :
> On 9/21/22 00:17, Christophe Leroy wrote:
>>
>> Instead of the patch from Michael, could you try by replacing
>> preempt_enable() by preempt_enable_no_resched() in exit_vmx_usercopy() ?
>
> I finally got a chance to test this, and the simpler fix of using
> preempt_enable_no_resched() works as well.
>
Thanks.
That should be fixed upstream now, by commit
https://github.com/torvalds/linux/commit/00ff1eaac129a24516a3f6d75adfb9df1efb55dd
Christophe