Linus,
Please pull the latest x86-topology-for-linus git tree from:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-topology-for-linus
# HEAD: eb876fbc248e6eb4773a5bc80d205ff7262b1bb5 perf/x86/intel/rapl: Cosmetic rename internal variables in response to multi-die/pkg support
Implement multi-die topology support on Intel CPUs and expose the die
topology to user-space tooling, by Len Brown, Kan Liang and Zhang Rui.
These changes should have no effect on the kernel's existing
understanding of topologies, i.e. there should be no behavioral impact on
cache, NUMA, scheduler, perf and other topologies and overall system
performance.
out-of-topic modifications in x86-topology-for-linus:
-------------------------------------------------------
drivers/base/topology.c # 2e4c54dac7b3: topology: Create core_cpus a
# b73ed8dc0597: topology: Create package_cpu
# 0e344d8c709f: cpu/topology: Export die_id
drivers/hwmon/coretemp.c # 835896a59b95: hwmon/coretemp: Cosmetic: Re
# cfcd82e63288: hwmon/coretemp: Support mult
drivers/powercap/intel_rapl.c # 9ea7612c4658: powercap/intel_rapl: Update
# 32fb480e0a2c: powercap/intel_rapl: Support
# aadf7b383371: powercap/intel_rapl: Simplif
drivers/thermal/intel/x86_pkg_temp_thermal.c# b2ce1c883df9: thermal/x86_pkg_temp_thermal
# 724adec33c24: thermal/x86_pkg_temp_thermal
include/linux/topology.h # 2e4c54dac7b3: topology: Create core_cpus a
# 0e344d8c709f: cpu/topology: Export die_id
Thanks,
Ingo
------------------>
Kan Liang (5):
perf/x86/intel/uncore: Support multi-die/package
perf/x86/intel/rapl: Support multi-die/package
perf/x86/intel/cstate: Support multi-die/package
perf/x86/intel/uncore: Cosmetic renames in response to multi-die/pkg support
perf/x86/intel/rapl: Cosmetic rename internal variables in response to multi-die/pkg support
Len Brown (9):
x86/topology: Add CPUID.1F multi-die/package support
x86/topology: Create topology_max_die_per_package()
cpu/topology: Export die_id
x86/topology: Define topology_die_id()
x86/topology: Define topology_logical_die_id()
topology: Create package_cpus sysfs attribute
topology: Create core_cpus and die_cpus sysfs attributes
thermal/x86_pkg_temp_thermal: Cosmetic: Rename internal variables to zones from packages
hwmon/coretemp: Cosmetic: Rename internal variables to zones from packages
Zhang Rui (5):
powercap/intel_rapl: Simplify rapl_find_package()
powercap/intel_rapl: Support multi-die/package
thermal/x86_pkg_temp_thermal: Support multi-die/package
powercap/intel_rapl: Update RAPL domain name and debug messages
hwmon/coretemp: Support multi-die/package
Documentation/cputopology.txt | 48 ++++++---
Documentation/x86/topology.rst | 4 +
arch/x86/events/intel/cstate.c | 14 ++-
arch/x86/events/intel/rapl.c | 20 ++--
arch/x86/events/intel/uncore.c | 80 +++++++--------
arch/x86/events/intel/uncore.h | 4 +-
arch/x86/events/intel/uncore_snbep.c | 4 +-
arch/x86/include/asm/processor.h | 4 +-
arch/x86/include/asm/smp.h | 1 +
arch/x86/include/asm/topology.h | 17 ++++
arch/x86/kernel/cpu/common.c | 1 +
arch/x86/kernel/cpu/topology.c | 88 +++++++++++++----
arch/x86/kernel/smpboot.c | 69 +++++++++++++
arch/x86/xen/smp_pv.c | 1 +
drivers/base/topology.c | 22 +++++
drivers/hwmon/coretemp.c | 36 +++----
drivers/powercap/intel_rapl.c | 75 +++++++-------
drivers/thermal/intel/x86_pkg_temp_thermal.c | 142 ++++++++++++++-------------
include/linux/topology.h | 6 ++
19 files changed, 422 insertions(+), 214 deletions(-)
The pull request you sent on Mon, 8 Jul 2019 18:27:56 +0200:
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-topology-for-linus
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/222a21d29521d144f3dd7a0bc4d4020e448f0126
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker
On Tue, Jul 9, 2019 at 2:20 PM Linus Torvalds
<[email protected]> wrote:
>
> Will keep you updated as bisection narrows it down.
Hmm. Already narrowed down to either of
Pull x86 asm updates from Ingo Molnar:
Pull scheduler updates from Ingo Molnar:
but still bisecting to pinpoint the exact thing.
Linus
After doing a lot of build testing and merging, I finally got around
to actually boot-testing the result.
Something in this series has broken booting for me. Or rather, things
*boot*, but then it hangs in user space, with
A start job is running for udev Wait for Complete Device Initialization
it does this on both my desktop and laptop, although the exact hang is
different (on my laptop, it hangs earlier - I don't even get to input
the disk encryption keys).
I'm bisecting, and it will take a while to pinpoint, but I can already
tell that it's from one of these pulls:
Pull x86 topology updates from Ingo Molnar:
Pull x86 platform updayes from Ingo Molnar:
Pull x86 paravirt updates from Ingo Molnar:
Pull x86 AVX512 status update from Ingo Molnar:
Pull x86 cleanups from Ingo Molnar:
Pull x86 cache resource control update from Ingo Molnar:
Pull x86 build updates from Ingo Molnar:
Pull x86 asm updates from Ingo Molnar:
Pull scheduler updates from Ingo Molnar:
Will keep you updated as bisection narrows it down.
Linus
On Tue, Jul 9, 2019 at 2:26 PM Linus Torvalds
<[email protected]> wrote:
>
> but still bisecting to pinpoint the exact thing.
One of:
c21ac93288f0 (refs/bisect/bad) Merge tag 'v5.2-rc6' into x86/asm, to
refresh the branch
8dbec27a242c x86/asm: Pin sensitive CR0 bits
873d50d58f67 x86/asm: Pin sensitive CR4 bits
7b347ad4938d (HEAD) Merge tag 'v5.2-rc5' into x86/asm, to refresh the branch
9db9b76767f1 Documentation/x86: Fix path to entry_32.S
7231d0165df3 x86/asm: Remove unused TASK_TI_flags from asm-offsets.c
and I suspect it's the sensitive bit pinning. But I'll delve all the way.
Linus
On Tue, Jul 9, 2019 at 2:45 PM Linus Torvalds
<[email protected]> wrote:
>
> and I suspect it's the sensitive bit pinning. But I'll delve all the way.
Confirmed. Bisection says
873d50d58f67ef15d2777b5e7f7a5268bb1fbae2 is the first bad commit
commit 873d50d58f67ef15d2777b5e7f7a5268bb1fbae2
Author: Kees Cook <[email protected]>
Date: Mon Jun 17 21:55:02 2019 -0700
x86/asm: Pin sensitive CR4 bits
this is on a bog-standard Intel setup with F30, both desktop and
laptop (i9-9900k and i7-8565u respectively).
I haven't confirmed yet whether reverting just that one commit is
required, or if I need to revert the cr0 one too.
I also don't have any logs, because the boot never gets far enough. I
assume that there was a problem bringing up a non-boot CPU, and the
eventual hang ends up being due to that.
Linus
On Tue, Jul 9, 2019 at 3:00 PM Linus Torvalds
<[email protected]> wrote:
>
> I haven't confirmed yet whether reverting just that one commit is
> required, or if I need to revert the cr0 one too.
Oh, I can't revert just the cr4 one, because the cr0 one depends on it.
But reverting both gets my desktop back to life. My laptop (that hung
earlier) seems to be another (possibly additional) issue.
Linus
On Tue, 9 Jul 2019, Linus Torvalds wrote:
> On Tue, Jul 9, 2019 at 2:45 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > and I suspect it's the sensitive bit pinning. But I'll delve all the way.
>
> Confirmed. Bisection says
>
> 873d50d58f67ef15d2777b5e7f7a5268bb1fbae2 is the first bad commit
> commit 873d50d58f67ef15d2777b5e7f7a5268bb1fbae2
> Author: Kees Cook <[email protected]>
> Date: Mon Jun 17 21:55:02 2019 -0700
>
> x86/asm: Pin sensitive CR4 bits
>
> this is on a bog-standard Intel setup with F30, both desktop and
> laptop (i9-9900k and i7-8565u respectively).
>
> I haven't confirmed yet whether reverting just that one commit is
> required, or if I need to revert the cr0 one too.
>
> I also don't have any logs, because the boot never gets far enough. I
> assume that there was a problem bringing up a non-boot CPU, and the
> eventual hang ends up being due to that.
Hrm. I just build the tip of your tree and bootet it. It hangs at:
[ 4.788678] ACPI: 4 ACPI AML tables successfully acquired and loaded
[ 4.793860] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
[ 4.821476] ACPI: Dynamic OEM Table Load:
That's way after the secondary CPUs have been brought up. tip/master boots
without problems with the same config. Let me try x86/asm alone and also a
revert of the CR4/CR0 stuff.
And while writing this the softlockup detector muttered:
[ 245.849408] INFO: task swapper/0:1 blocked for more than 120 seconds.
[ 245.853406] Not tainted 5.2.0+ #69
[ 245.857318] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 245.865405] swapper/0 D 0 1 0 0x80004000
[ 245.869405] Call Trace:
[ 245.871859] ? __schedule+0x2bb/0x690
[ 245.877410] ? acpi_ps_complete_op+0x259/0x279
[ 245.881406] schedule+0x29/0x90
[ 245.884540] schedule_timeout+0x20d/0x310
[ 245.889409] ? acpi_os_release_object+0xa/0x10
[ 245.893406] ? acpi_ps_delete_parse_tree+0x2d/0x59
[ 245.897405] __down_timeout+0x9b/0x100
[ 245.901150] down_timeout+0x43/0x50
[ 245.905407] acpi_os_wait_semaphore+0x48/0x60
[ 245.909408] acpi_ut_acquire_mutex+0x45/0x89
[ 245.913406] ? acpi_ns_init_one_package+0x44/0x44
[ 245.917406] acpi_walk_namespace+0x62/0xc2
[ 245.921406] acpi_ns_initialize_objects+0x40/0x7b
[ 245.925407] acpi_ex_load_table_op+0x157/0x1b9
[ 245.933406] acpi_ex_opcode_6A_0T_1R+0x158/0x1c2
[ 245.937406] acpi_ds_exec_end_op+0xca/0x401
[ 245.941406] acpi_ps_parse_loop+0x492/0x5c6
[ 245.945406] acpi_ps_parse_aml+0x91/0x2b8
[ 245.949405] acpi_ps_execute_method+0x15d/0x191
[ 245.953406] acpi_ns_evaluate+0x1bf/0x24c
[ 245.957406] acpi_evaluate_object+0x137/0x240
[ 245.961408] ? kmem_cache_alloc_trace+0x15b/0x1c0
[ 245.965406] acpi_processor_set_pdc+0x135/0x180
[ 245.969408] early_init_pdc+0xb3/0xd0
[ 245.973064] acpi_ns_walk_namespace+0xda/0x1aa
[ 245.977407] ? set_no_mwait+0x23/0x23
[ 245.981062] ? set_no_mwait+0x23/0x23
[ 245.985406] acpi_walk_namespace+0x9a/0xc2
[ 245.989406] ? acpi_sleep_proc_init+0x24/0x24
[ 245.993407] ? do_early_param+0x8e/0x8e
[ 245.997235] acpi_early_processor_set_pdc+0x31/0x49
[ 246.001406] acpi_init+0x17c/0x321
Thanks,
tglx
On Wed, 10 Jul 2019, Thomas Gleixner wrote:
> On Tue, 9 Jul 2019, Linus Torvalds wrote:
> > I also don't have any logs, because the boot never gets far enough. I
> > assume that there was a problem bringing up a non-boot CPU, and the
> > eventual hang ends up being due to that.
>
> Hrm. I just build the tip of your tree and bootet it. It hangs at:
>
> [ 4.788678] ACPI: 4 ACPI AML tables successfully acquired and loaded
> [ 4.793860] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
> [ 4.821476] ACPI: Dynamic OEM Table Load:
>
> That's way after the secondary CPUs have been brought up. tip/master boots
> without problems with the same config. Let me try x86/asm alone and also a
> revert of the CR4/CR0 stuff.
x86/asm works alone
revert of cr4/0 pinning on top of your tree gets stuck as well
> And while writing this the softlockup detector muttered:
Have not yet fully bisected it, but Jiri did and he ended up with
c522ad0637ca ("ACPICA: Update table load object initialization")
Reverting that on top of your tree makes it work again. Tony has the same
issue.
That still does not explain the cr4/0 issue you have. Can you send me your
.config please?
Thanks,
tglx
On Wed, 10 Jul 2019, Thomas Gleixner wrote:
>
> That still does not explain the cr4/0 issue you have. Can you send me your
> .config please?
Does your machine have UMIP support? None of my test boxes has. So that'd
be the difference of bits enforced in CR4. Should not matter because it's
User mode instruction prevention, but who knows.
Thanks,
tglx
On Wed, Jul 10, 2019 at 01:17:11AM +0200, Thomas Gleixner wrote:
> On Wed, 10 Jul 2019, Thomas Gleixner wrote:
> >
> > That still does not explain the cr4/0 issue you have. Can you send me your
> > .config please?
>
> Does your machine have UMIP support? None of my test boxes has. So that'd
> be the difference of bits enforced in CR4. Should not matter because it's
> User mode instruction prevention, but who knows.
Ew. Yeah, I don't have i9 nor i7 for testing this. I did try everything
else I had (and hibernation). Is only Linus able to reproduce this so far?
To rule out (in?) UMIP, this would remove UMIP from the pinning:
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 309b6b9b49d4..f3beedb6da8a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -380,7 +380,7 @@ static void __init setup_cr_pinning(void)
{
unsigned long mask;
- mask = (X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP);
+ mask = (X86_CR4_SMEP | X86_CR4_SMAP);
cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & mask;
static_key_enable(&cr_pinning.key);
}
--
Kees Cook
On Tue, Jul 9, 2019 at 5:59 PM Linus Torvalds
<[email protected]> wrote:
>
> On my laptop (which I am at right now), the hang is different, and
> maybe it's similar to your ACPI hang issue. I will try that revert,
> and at least see if that solves the laptop side.
Nope, that wasn't it. Apparently there are three different issues.
I guess I'll just have to do a full bisect on the laptop too. That's
going to take forever.
Linus
On Tue, Jul 9, 2019 at 4:00 PM Thomas Gleixner <[email protected]> wrote:
>
> That still does not explain the cr4/0 issue you have. Can you send me your
> .config please?
Gaah. Now I'm off my desktop and won't be at it again until tomorrow.
And that's the one that bisected to the cr0/cr4 bits (and I did all my
bisection on that because it's so much faster).
Anyway, the kernel config itself is pretty simple. It's basically a
F30 kernel config that has been through "make localmodconfig" and then
pared down some more to remove stuff I don't use (paravirt etc). I
would expect that if it's a confiuration difference, it's more likely
to be about the user mode side - since the hang does seem to happen in
user mode, not in the kernel (but a very early problem at cpu bringup
time could easily have been hidden).
On my laptop (which I am at right now), the hang is different, and
maybe it's similar to your ACPI hang issue. I will try that revert,
and at least see if that solves the laptop side.
Linus
On Tue, Jul 9, 2019 at 6:08 PM Linus Torvalds
<[email protected]> wrote:
>
> I guess I'll just have to do a full bisect on the laptop too. That's
> going to take forever.
Whee. It looks like it's bisecting to the same thing. Only partway
done, but it feels very much like my desktop. Right now it's showing
this to be due to
a1aab6f3d295 (refs/bisect/bad) Merge branch 'x86-asm-for-linus' of
git://git.kernel.org/pub/scm/..
dad1c12ed831 Merge branch 'sched-core-for-linus' of
git://git.kernel.org/pub/scm/..
which was what my desktop bisect was doing too.
So here's my config that I'm using. This really is taking forever, I
don't end up having the patience to stay at the machine when the build
takes 5-10 minutes, so I'm walking away for each bisection making it
even slower. Does anybody see anything that might be wrong?
The laptop is a i7-8565U, so there's nothing special here. Certainly
not UMIP hw that nobody else has.
Linus
On Tue, Jul 09, 2019 at 10:15:21PM -0700, Linus Torvalds wrote:
> On Tue, Jul 9, 2019 at 8:21 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > Whee. It looks like it's bisecting to the same thing. Only partway
> > done, but it feels very much like my desktop.
>
> Confirmed.
>
> With that config, I get this
>
> c21ac93288f0 (refs/bisect/bad) Merge tag 'v5.2-rc6' into x86/asm, to
> refresh the branch
> 8dbec27a242c (HEAD) x86/asm: Pin sensitive CR0 bits
> 873d50d58f67 x86/asm: Pin sensitive CR4 bits
>
> ie those "pin sensitive bits" merge is bad, but before the commits is good.
>
> I think there is _another_ problem too, and maybe it's the APCI one,
> but this one triggers some issue before the other issue even gets to
> play..
Okay, fun. Thanks for confirming it. Well, I guess I get to try to
find some more modern hardware to track this down. Does booting with
init=/bin/sh get far enough to see if all the CPUs are online or anything
like that? I'm baffled about the "gets mostly to userspace" part. I'd
expect this to explode very badly if it misbehaved. Or maybe something
gets confused between how many CPUs are expected and how many actually
show up to the party. Hmmm.
--
Kees Cook
On Tue, Jul 9, 2019 at 8:21 PM Linus Torvalds
<[email protected]> wrote:
>
> Whee. It looks like it's bisecting to the same thing. Only partway
> done, but it feels very much like my desktop.
Confirmed.
With that config, I get this
c21ac93288f0 (refs/bisect/bad) Merge tag 'v5.2-rc6' into x86/asm, to
refresh the branch
8dbec27a242c (HEAD) x86/asm: Pin sensitive CR0 bits
873d50d58f67 x86/asm: Pin sensitive CR4 bits
ie those "pin sensitive bits" merge is bad, but before the commits is good.
I think there is _another_ problem too, and maybe it's the APCI one,
but this one triggers some issue before the other issue even gets to
play..
Linus
On Wednesday, July 10, 2019 1:00:32 AM CEST Thomas Gleixner wrote:
> On Wed, 10 Jul 2019, Thomas Gleixner wrote:
> > On Tue, 9 Jul 2019, Linus Torvalds wrote:
> > > I also don't have any logs, because the boot never gets far enough. I
> > > assume that there was a problem bringing up a non-boot CPU, and the
> > > eventual hang ends up being due to that.
> >
> > Hrm. I just build the tip of your tree and bootet it. It hangs at:
> >
> > [ 4.788678] ACPI: 4 ACPI AML tables successfully acquired and loaded
> > [ 4.793860] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
> > [ 4.821476] ACPI: Dynamic OEM Table Load:
> >
> > That's way after the secondary CPUs have been brought up. tip/master boots
> > without problems with the same config. Let me try x86/asm alone and also a
> > revert of the CR4/CR0 stuff.
>
> x86/asm works alone
>
> revert of cr4/0 pinning on top of your tree gets stuck as well
>
> > And while writing this the softlockup detector muttered:
>
> Have not yet fully bisected it, but Jiri did and he ended up with
>
> c522ad0637ca ("ACPICA: Update table load object initialization")
>
> Reverting that on top of your tree makes it work again. Tony has the same
> issue.
I have a revert of this one queued up (with a plan to push it later today).
Cheers,
Rafael
On 2019-07-09 17:31 -0700, Kees Cook wrote:
> On Wed, Jul 10, 2019 at 01:17:11AM +0200, Thomas Gleixner wrote:
> > On Wed, 10 Jul 2019, Thomas Gleixner wrote:
> > > That still does not explain the cr4/0 issue you have. Can you send me your
> > > .config please?
> >
> > Does your machine have UMIP support? None of my test boxes has. So that'd
> > be the difference of bits enforced in CR4. Should not matter because it's
> > User mode instruction prevention, but who knows.
>
> Ew. Yeah, I don't have i9 nor i7 for testing this. I did try everything
> else I had (and hibernation). Is only Linus able to reproduce this so far?
I can, too.
> To rule out (in?) UMIP, this would remove UMIP from the pinning:
>
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 309b6b9b49d4..f3beedb6da8a 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -380,7 +380,7 @@ static void __init setup_cr_pinning(void)
> {
> unsigned long mask;
>
> - mask = (X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP);
> + mask = (X86_CR4_SMEP | X86_CR4_SMAP);
> cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & mask;
> static_key_enable(&cr_pinning.key);
> }
I'll try it.
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University
On Wed, 10 Jul 2019, Xi Ruoyao wrote:
> On 2019-07-10 19:27 +0800, Xi Ruoyao wrote:
> > On 2019-07-09 17:31 -0700, Kees Cook wrote:
> > > On Wed, Jul 10, 2019 at 01:17:11AM +0200, Thomas Gleixner wrote:
> > > > On Wed, 10 Jul 2019, Thomas Gleixner wrote:
> > > > > That still does not explain the cr4/0 issue you have. Can you send me
> > > > > your
> > > > > .config please?
> > > >
> > > > Does your machine have UMIP support? None of my test boxes has. So that'd
> > > > be the difference of bits enforced in CR4. Should not matter because it's
> > > > User mode instruction prevention, but who knows.
> > >
> > > Ew. Yeah, I don't have i9 nor i7 for testing this. I did try everything
> > > else I had (and hibernation). Is only Linus able to reproduce this so far?
> >
> > I can, too.
> >
> > > To rule out (in?) UMIP, this would remove UMIP from the pinning:
> > >
> > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > > index 309b6b9b49d4..f3beedb6da8a 100644
> > > --- a/arch/x86/kernel/cpu/common.c
> > > +++ b/arch/x86/kernel/cpu/common.c
> > > @@ -380,7 +380,7 @@ static void __init setup_cr_pinning(void)
> > > {
> > > unsigned long mask;
> > >
> > > - mask = (X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP);
> > > + mask = (X86_CR4_SMEP | X86_CR4_SMAP);
> > > cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & mask;
> > > static_key_enable(&cr_pinning.key);
> > > }
> >
> > I'll try it.
>
> That doesn't work, sadly.
>
> My laptop is an old i3-3217u.
>
> My .config and syslog snip are attached.
From the log:
BUG: unable to handle page fault for address: ffffffff9edc1598
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
PGD 1a20c067 P4D 1a20c067 PUD 1a20d063 PMD 8000000019e000e1
Oops: 0003 [#1] SMP PTI
2 PID: 151 Comm: systemd-udevd Not tainted 5.2.0+ #54
Hardware name: LENOVO 20175/INVALID, BIOS 66CN54WW 01/21/2013
RIP: 0010:static_key_set_mod.isra.0+0x10/0x30
Code: 48 8b 37 83 e6 03 48 09 c6 48 89 37 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f0 a8 03 75 0d 48 8b 37 83 e6 03 48 09 c6 <48> 89 37 c3 0f 0b 48 8b 37 83 e6 03 48 09 c6 48 89 37 c3 66 66 2e
RSP: 0000:ffffa606c032bc98 EFLAGS: 00010286
RAX: ffff9981ddce30a0 RBX: ffffffff9edc1590 RCX: 0000000000000000
RDX: 0000000000000020 RSI: ffff9981ddce30a0 RDI: ffffffff9edc1598
RBP: ffffffffc06f4000 R08: ffff9981e6003980 R09: ffff9981ddce30a0
R10: 0000000000000000 R11: 0000000000028b56 R12: ffffffffc06f8880
R13: ffff9981ddce3080 R14: ffffffffc06f4008 R15: ffffffffc06f6dc0
FS: 00007f992dd9a680(0000) GS:ffff9981e7080000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff9edc1598 CR3: 00000002233aa001 CR4: 00000000001606e0
Call Trace:
jump_label_module_notify+0x1e7/0x2b0
notifier_call_chain+0x44/0x70
blocking_notifier_call_chain+0x43/0x60
load_module+0x1bcb/0x2490
? vfs_read+0x11f/0x150
? __do_sys_finit_module+0xbf/0xe0
__do_sys_finit_module+0xbf/0xe0
do_syscall_64+0x43/0x110
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Josh, didn't you mention that yesterday or so?
RIP: 0033:0x7f992e2eeaf9
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 67 73 0d 00 f7 d8 64 89 01 48
RSP: 002b:00007ffca220d288 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 00000000009b8da0 RCX: 00007f992e2eeaf9
RDX: 0000000000000000 RSI: 00007f992e464885 RDI: 0000000000000010
RBP: 0000000000020000 R08: 0000000000000000 R09: 00000000009c45c0
R10: 0000000000000010 R11: 0000000000000246 R12: 00007f992e464885
R13: 0000000000000000 R14: 00000000009acc50 R15: 00000000009b8da0
Modules linked in: kvm_intel(+) kvm irqbypass hid_sensor_hub crc32_pclmul mfd_core i2c_i801 snd_hda_intel i915(+) intel_gtt snd_hda_codec i2c_algo_bit snd_hwdep snd_hda_core drm_kms_helper snd_pcm syscopyarea sysfillrect sysimgblt fb_sys_fops drm hid_multitouch ideapad_laptop sparse_keymap hid_generic wmi efivarfs
CR2: ffffffff9edc1598
[ end trace dbeb7e66daa9bdca ]---
RIP: 0010:static_key_set_mod.isra.0+0x10/0x30
Code: 48 8b 37 83 e6 03 48 09 c6 48 89 37 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f0 a8 03 75 0d 48 8b 37 83 e6 03 48 09 c6 <48> 89 37 c3 0f 0b 48 8b 37 83 e6 03 48 09 c6 48 89 37 c3 66 66 2e
RSP: 0000:ffffa606c032bc98 EFLAGS: 00010286
RAX: ffff9981ddce30a0 RBX: ffffffff9edc1590 RCX: 0000000000000000
RDX: 0000000000000020 RSI: ffff9981ddce30a0 RDI: ffffffff9edc1598
RBP: ffffffffc06f4000 R08: ffff9981e6003980 R09: ffff9981ddce30a0
R10: 0000000000000000 R11: 0000000000028b56 R12: ffffffffc06f8880
R13: ffff9981ddce3080 R14: ffffffffc06f4008 R15: ffffffffc06f6dc0
FS: 00007f992dd9a680(0000) GS:ffff9981e7080000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff9edc1598 CR3: 00000002233aa001 CR4: 00000000001606e0
On 2019-07-10 19:27 +0800, Xi Ruoyao wrote:
> On 2019-07-09 17:31 -0700, Kees Cook wrote:
> > On Wed, Jul 10, 2019 at 01:17:11AM +0200, Thomas Gleixner wrote:
> > > On Wed, 10 Jul 2019, Thomas Gleixner wrote:
> > > > That still does not explain the cr4/0 issue you have. Can you send me
> > > > your
> > > > .config please?
> > >
> > > Does your machine have UMIP support? None of my test boxes has. So that'd
> > > be the difference of bits enforced in CR4. Should not matter because it's
> > > User mode instruction prevention, but who knows.
> >
> > Ew. Yeah, I don't have i9 nor i7 for testing this. I did try everything
> > else I had (and hibernation). Is only Linus able to reproduce this so far?
>
> I can, too.
>
> > To rule out (in?) UMIP, this would remove UMIP from the pinning:
> >
> > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > index 309b6b9b49d4..f3beedb6da8a 100644
> > --- a/arch/x86/kernel/cpu/common.c
> > +++ b/arch/x86/kernel/cpu/common.c
> > @@ -380,7 +380,7 @@ static void __init setup_cr_pinning(void)
> > {
> > unsigned long mask;
> >
> > - mask = (X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP);
> > + mask = (X86_CR4_SMEP | X86_CR4_SMAP);
> > cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & mask;
> > static_key_enable(&cr_pinning.key);
> > }
>
> I'll try it.
That doesn't work, sadly.
My laptop is an old i3-3217u.
My .config and syslog snip are attached.
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University
On Wed, 10 Jul 2019, Thomas Gleixner wrote:
> From the log:
>
> BUG: unable to handle page fault for address: ffffffff9edc1598
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0003) - permissions violation
> PGD 1a20c067 P4D 1a20c067 PUD 1a20d063 PMD 8000000019e000e1
> Oops: 0003 [#1] SMP PTI
> 2 PID: 151 Comm: systemd-udevd Not tainted 5.2.0+ #54
> Hardware name: LENOVO 20175/INVALID, BIOS 66CN54WW 01/21/2013
> RIP: 0010:static_key_set_mod.isra.0+0x10/0x30
> Code: 48 8b 37 83 e6 03 48 09 c6 48 89 37 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f0 a8 03 75 0d 48 8b 37 83 e6 03 48 09 c6 <48> 89 37 c3 0f 0b 48 8b 37 83 e6 03 48 09 c6 48 89 37 c3 66 66 2e
> RSP: 0000:ffffa606c032bc98 EFLAGS: 00010286
> RAX: ffff9981ddce30a0 RBX: ffffffff9edc1590 RCX: 0000000000000000
> RDX: 0000000000000020 RSI: ffff9981ddce30a0 RDI: ffffffff9edc1598
> RBP: ffffffffc06f4000 R08: ffff9981e6003980 R09: ffff9981ddce30a0
> R10: 0000000000000000 R11: 0000000000028b56 R12: ffffffffc06f8880
> R13: ffff9981ddce3080 R14: ffffffffc06f4008 R15: ffffffffc06f6dc0
> FS: 00007f992dd9a680(0000) GS:ffff9981e7080000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffffff9edc1598 CR3: 00000002233aa001 CR4: 00000000001606e0
> Call Trace:
> jump_label_module_notify+0x1e7/0x2b0
> notifier_call_chain+0x44/0x70
> blocking_notifier_call_chain+0x43/0x60
> load_module+0x1bcb/0x2490
> ? vfs_read+0x11f/0x150
> ? __do_sys_finit_module+0xbf/0xe0
> __do_sys_finit_module+0xbf/0xe0
> do_syscall_64+0x43/0x110
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Josh, didn't you mention that yesterday or so?
That's what Tony yesterday indicated on IRC that his system is suffering
from as well.
Adding Daniel to check whether this couldn't be some fallout of jumplabel
batching.
>
>
> RIP: 0033:0x7f992e2eeaf9
> Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 67 73 0d 00 f7 d8 64 89 01 48
> RSP: 002b:00007ffca220d288 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> RAX: ffffffffffffffda RBX: 00000000009b8da0 RCX: 00007f992e2eeaf9
> RDX: 0000000000000000 RSI: 00007f992e464885 RDI: 0000000000000010
> RBP: 0000000000020000 R08: 0000000000000000 R09: 00000000009c45c0
> R10: 0000000000000010 R11: 0000000000000246 R12: 00007f992e464885
> R13: 0000000000000000 R14: 00000000009acc50 R15: 00000000009b8da0
> Modules linked in: kvm_intel(+) kvm irqbypass hid_sensor_hub crc32_pclmul mfd_core i2c_i801 snd_hda_intel i915(+) intel_gtt snd_hda_codec i2c_algo_bit snd_hwdep snd_hda_core drm_kms_helper snd_pcm syscopyarea sysfillrect sysimgblt fb_sys_fops drm hid_multitouch ideapad_laptop sparse_keymap hid_generic wmi efivarfs
> CR2: ffffffff9edc1598
> [ end trace dbeb7e66daa9bdca ]---
>
> RIP: 0010:static_key_set_mod.isra.0+0x10/0x30
> Code: 48 8b 37 83 e6 03 48 09 c6 48 89 37 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f0 a8 03 75 0d 48 8b 37 83 e6 03 48 09 c6 <48> 89 37 c3 0f 0b 48 8b 37 83 e6 03 48 09 c6 48 89 37 c3 66 66 2e
> RSP: 0000:ffffa606c032bc98 EFLAGS: 00010286
> RAX: ffff9981ddce30a0 RBX: ffffffff9edc1590 RCX: 0000000000000000
> RDX: 0000000000000020 RSI: ffff9981ddce30a0 RDI: ffffffff9edc1598
> RBP: ffffffffc06f4000 R08: ffff9981e6003980 R09: ffff9981ddce30a0
> R10: 0000000000000000 R11: 0000000000028b56 R12: ffffffffc06f8880
> R13: ffff9981ddce3080 R14: ffffffffc06f4008 R15: ffffffffc06f6dc0
> FS: 00007f992dd9a680(0000) GS:ffff9981e7080000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffffff9edc1598 CR3: 00000002233aa001 CR4: 00000000001606e0
>
--
Jiri Kosina
SUSE Labs
On Wed, Jul 10, 2019 at 02:31:29PM +0200, Jiri Kosina wrote:
> On Wed, 10 Jul 2019, Thomas Gleixner wrote:
>
> > From the log:
> >
> > BUG: unable to handle page fault for address: ffffffff9edc1598
> > #PF: supervisor write access in kernel mode
> > #PF: error_code(0x0003) - permissions violation
> > PGD 1a20c067 P4D 1a20c067 PUD 1a20d063 PMD 8000000019e000e1
> > Oops: 0003 [#1] SMP PTI
> > 2 PID: 151 Comm: systemd-udevd Not tainted 5.2.0+ #54
> > Hardware name: LENOVO 20175/INVALID, BIOS 66CN54WW 01/21/2013
> > RIP: 0010:static_key_set_mod.isra.0+0x10/0x30
> > Code: 48 8b 37 83 e6 03 48 09 c6 48 89 37 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f0 a8 03 75 0d 48 8b 37 83 e6 03 48 09 c6 <48> 89 37 c3 0f 0b 48 8b 37 83 e6 03 48 09 c6 48 89 37 c3 66 66 2e
> > RSP: 0000:ffffa606c032bc98 EFLAGS: 00010286
> > RAX: ffff9981ddce30a0 RBX: ffffffff9edc1590 RCX: 0000000000000000
> > RDX: 0000000000000020 RSI: ffff9981ddce30a0 RDI: ffffffff9edc1598
> > RBP: ffffffffc06f4000 R08: ffff9981e6003980 R09: ffff9981ddce30a0
> > R10: 0000000000000000 R11: 0000000000028b56 R12: ffffffffc06f8880
> > R13: ffff9981ddce3080 R14: ffffffffc06f4008 R15: ffffffffc06f6dc0
> > FS: 00007f992dd9a680(0000) GS:ffff9981e7080000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: ffffffff9edc1598 CR3: 00000002233aa001 CR4: 00000000001606e0
> > Call Trace:
> > jump_label_module_notify+0x1e7/0x2b0
> > notifier_call_chain+0x44/0x70
> > blocking_notifier_call_chain+0x43/0x60
> > load_module+0x1bcb/0x2490
> > ? vfs_read+0x11f/0x150
> > ? __do_sys_finit_module+0xbf/0xe0
> > __do_sys_finit_module+0xbf/0xe0
> > do_syscall_64+0x43/0x110
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > Josh, didn't you mention that yesterday or so?
>
> That's what Tony yesterday indicated on IRC that his system is suffering
> from as well.
>
> Adding Daniel to check whether this couldn't be some fallout of jumplabel
> batching.
AFAICT this is _before_ we get to patching. The function that explodes
is static_key_set_mod(), as called from jump_label_add_module().
What that function does is for all patch sites in the module, find the
corresponding key; if that key is not also in that module, allocate a
static_key_mod structure and link the module entries to the key. Such
that we can find all instances from a given key.
I don't think anything here has changed in a while.
On Wed, 10 Jul 2019, Peter Zijlstra wrote:
> > > BUG: unable to handle page fault for address: ffffffff9edc1598
> > > #PF: supervisor write access in kernel mode
> > > #PF: error_code(0x0003) - permissions violation
> > > PGD 1a20c067 P4D 1a20c067 PUD 1a20d063 PMD 8000000019e000e1
> > > Oops: 0003 [#1] SMP PTI
> > > 2 PID: 151 Comm: systemd-udevd Not tainted 5.2.0+ #54
> > > Hardware name: LENOVO 20175/INVALID, BIOS 66CN54WW 01/21/2013
> > > RIP: 0010:static_key_set_mod.isra.0+0x10/0x30
> > > Code: 48 8b 37 83 e6 03 48 09 c6 48 89 37 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f0 a8 03 75 0d 48 8b 37 83 e6 03 48 09 c6 <48> 89 37 c3 0f 0b 48 8b 37 83 e6 03 48 09 c6 48 89 37 c3 66 66 2e
> > > RSP: 0000:ffffa606c032bc98 EFLAGS: 00010286
> > > RAX: ffff9981ddce30a0 RBX: ffffffff9edc1590 RCX: 0000000000000000
> > > RDX: 0000000000000020 RSI: ffff9981ddce30a0 RDI: ffffffff9edc1598
> > > RBP: ffffffffc06f4000 R08: ffff9981e6003980 R09: ffff9981ddce30a0
> > > R10: 0000000000000000 R11: 0000000000028b56 R12: ffffffffc06f8880
> > > R13: ffff9981ddce3080 R14: ffffffffc06f4008 R15: ffffffffc06f6dc0
> > > FS: 00007f992dd9a680(0000) GS:ffff9981e7080000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: ffffffff9edc1598 CR3: 00000002233aa001 CR4: 00000000001606e0
> > > Call Trace:
> > > jump_label_module_notify+0x1e7/0x2b0
> > > notifier_call_chain+0x44/0x70
> > > blocking_notifier_call_chain+0x43/0x60
> > > load_module+0x1bcb/0x2490
> > > ? vfs_read+0x11f/0x150
> > > ? __do_sys_finit_module+0xbf/0xe0
> > > __do_sys_finit_module+0xbf/0xe0
> > > do_syscall_64+0x43/0x110
> > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >
> > > Josh, didn't you mention that yesterday or so?
> >
> > That's what Tony yesterday indicated on IRC that his system is suffering
> > from as well.
> >
> > Adding Daniel to check whether this couldn't be some fallout of jumplabel
> > batching.
>
> AFAICT this is _before_ we get to patching. The function that explodes
> is static_key_set_mod(), as called from jump_label_add_module().
>
> What that function does is for all patch sites in the module, find the
> corresponding key; if that key is not also in that module, allocate a
> static_key_mod structure and link the module entries to the key. Such
> that we can find all instances from a given key.
>
> I don't think anything here has changed in a while.
Hm, and it seems to explode on dereferencing the static_key* in %rsi
21: 48 8b 37 mov (%rdi),%rsi
24: 83 e6 03 and $0x3,%esi
27: 48 09 c6 or %rax,%rsi
2a:* 48 89 37 mov %rsi,(%rdi) <-- trapping instruction
which looks odd, as it derefenced it successfully just 3 instructions ago.
--
Jiri Kosina
SUSE Labs
On Wed, Jul 10, 2019 at 09:25:16PM +0800, Xi Ruoyao wrote:
> On 2019-07-10 14:31 +0200, Jiri Kosina wrote:
> > Adding Daniel to check whether this couldn't be some fallout of jumplabel
> > batching.
>
> I don't think so. I tried to revert Daniel's jumplabel batching commits and the
> issue wasn't solved. But reverting Kees' CR0 and CR4 commits can "fix" it
> (apprently).
Xi, could you please try the below instead?
If we mark the key as RO after init, and then try and modify the key to
link module usage sites, things might go bang as described.
Thanks!
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 27d7864e7252..5bf7a8354da2 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -366,7 +366,7 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
cr4_clear_bits(X86_CR4_UMIP);
}
-DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
+DEFINE_STATIC_KEY_FALSE(cr_pinning);
EXPORT_SYMBOL(cr_pinning);
unsigned long cr4_pinned_bits __ro_after_init;
EXPORT_SYMBOL(cr4_pinned_bits);
On Wed, 10 Jul 2019, Peter Zijlstra wrote:
> On Wed, Jul 10, 2019 at 09:25:16PM +0800, Xi Ruoyao wrote:
> > On 2019-07-10 14:31 +0200, Jiri Kosina wrote:
> > > Adding Daniel to check whether this couldn't be some fallout of jumplabel
> > > batching.
> >
> > I don't think so. I tried to revert Daniel's jumplabel batching commits and the
> > issue wasn't solved. But reverting Kees' CR0 and CR4 commits can "fix" it
> > (apprently).
>
> Xi, could you please try the below instead?
>
> If we mark the key as RO after init, and then try and modify the key to
> link module usage sites, things might go bang as described.
Right. I finally was able to reproduce that with Linus' config (slightly
modified). Applying your patch makes it go away.
Now what puzzles me is that this never exploded in my face before and with
a debian config it JustWorks.
Both configs have:
CONFIG_KVM=m
CONFIG_KVM_INTEL=m
so I'd expect both to crash and burn when KVM_INTEL is loaded as that has a
cr4 operation inside.
So something papers over that ... Still looking.
Thanks,
tglx
On 2019-07-10 14:31 +0200, Jiri Kosina wrote:
> Adding Daniel to check whether this couldn't be some fallout of jumplabel
> batching.
I don't think so. I tried to revert Daniel's jumplabel batching commits and the
issue wasn't solved. But reverting Kees' CR0 and CR4 commits can "fix" it
(apprently).
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University
On 2019-07-10 15:28 +0200, Jiri Kosina wrote:
> On Wed, 10 Jul 2019, Jiri Kosina wrote:
> > > > > BUG: unable to handle page fault for address: ffffffff9edc1598
> > > > > #PF: supervisor write access in kernel mode
> > > > > #PF: error_code(0x0003) - permissions violation
> > Hm, and it seems to explode on dereferencing the static_key* in %rsi
>
> ^^^ %rdi of
> course
>
> > 21: 48 8b 37 mov (%rdi),%rsi
> > 24: 83 e6 03 and $0x3,%esi
> > 27: 48 09 c6 or %rax,%rsi
> > 2a:* 48 89 37 mov %rsi,(%rdi) <-- trapping
> > instruction
> >
> > which looks odd, as it derefenced it successfully just 3 instructions ago.
It seems the MMU (I guess ?) allows to read it, but disallows to write it:
"supervisor write access in kernel mode".
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University
On Wed, 10 Jul 2019, Jiri Kosina wrote:
> On Wed, 10 Jul 2019, Peter Zijlstra wrote:
>
> > > > BUG: unable to handle page fault for address: ffffffff9edc1598
> > > > #PF: supervisor write access in kernel mode
> > > > #PF: error_code(0x0003) - permissions violation
> > > > PGD 1a20c067 P4D 1a20c067 PUD 1a20d063 PMD 8000000019e000e1
> > > > Oops: 0003 [#1] SMP PTI
> > > > 2 PID: 151 Comm: systemd-udevd Not tainted 5.2.0+ #54
> > > > Hardware name: LENOVO 20175/INVALID, BIOS 66CN54WW 01/21/2013
> > > > RIP: 0010:static_key_set_mod.isra.0+0x10/0x30
> > > > Code: 48 8b 37 83 e6 03 48 09 c6 48 89 37 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f0 a8 03 75 0d 48 8b 37 83 e6 03 48 09 c6 <48> 89 37 c3 0f 0b 48 8b 37 83 e6 03 48 09 c6 48 89 37 c3 66 66 2e
> > > > RSP: 0000:ffffa606c032bc98 EFLAGS: 00010286
> > > > RAX: ffff9981ddce30a0 RBX: ffffffff9edc1590 RCX: 0000000000000000
> > > > RDX: 0000000000000020 RSI: ffff9981ddce30a0 RDI: ffffffff9edc1598
> > > > RBP: ffffffffc06f4000 R08: ffff9981e6003980 R09: ffff9981ddce30a0
> > > > R10: 0000000000000000 R11: 0000000000028b56 R12: ffffffffc06f8880
> > > > R13: ffff9981ddce3080 R14: ffffffffc06f4008 R15: ffffffffc06f6dc0
> > > > FS: 00007f992dd9a680(0000) GS:ffff9981e7080000(0000) knlGS:0000000000000000
> > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: ffffffff9edc1598 CR3: 00000002233aa001 CR4: 00000000001606e0
> > > > Call Trace:
> > > > jump_label_module_notify+0x1e7/0x2b0
> > > > notifier_call_chain+0x44/0x70
> > > > blocking_notifier_call_chain+0x43/0x60
> > > > load_module+0x1bcb/0x2490
> > > > ? vfs_read+0x11f/0x150
> > > > ? __do_sys_finit_module+0xbf/0xe0
> > > > __do_sys_finit_module+0xbf/0xe0
> > > > do_syscall_64+0x43/0x110
> > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > >
> > > > Josh, didn't you mention that yesterday or so?
> > >
> > > That's what Tony yesterday indicated on IRC that his system is suffering
> > > from as well.
> > >
> > > Adding Daniel to check whether this couldn't be some fallout of jumplabel
> > > batching.
> >
> > AFAICT this is _before_ we get to patching. The function that explodes
> > is static_key_set_mod(), as called from jump_label_add_module().
> >
> > What that function does is for all patch sites in the module, find the
> > corresponding key; if that key is not also in that module, allocate a
> > static_key_mod structure and link the module entries to the key. Such
> > that we can find all instances from a given key.
> >
> > I don't think anything here has changed in a while.
>
> Hm, and it seems to explode on dereferencing the static_key* in %rsi
^^^ %rdi of course
> 21: 48 8b 37 mov (%rdi),%rsi
> 24: 83 e6 03 and $0x3,%esi
> 27: 48 09 c6 or %rax,%rsi
> 2a:* 48 89 37 mov %rsi,(%rdi) <-- trapping instruction
>
> which looks odd, as it derefenced it successfully just 3 instructions ago.
>
> --
> Jiri Kosina
> SUSE Labs
>
>
--
Jiri Kosina
SUSE Labs
On Wed, 10 Jul 2019, Peter Zijlstra wrote:
> If we mark the key as RO after init, and then try and modify the key to
> link module usage sites, things might go bang as described.
>
> Thanks!
>
>
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 27d7864e7252..5bf7a8354da2 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -366,7 +366,7 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
> cr4_clear_bits(X86_CR4_UMIP);
> }
>
> -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
> +DEFINE_STATIC_KEY_FALSE(cr_pinning);
Good catch, I guess that is going to fix it.
At the same time though, it sort of destroys the original intent of Kees'
patch, right? The exploits will just have to call static_key_disable()
prior to calling native_write_cr4() again, and the protection is gone.
--
Jiri Kosina
SUSE Labs
On Wed, Jul 10, 2019 at 04:22:51PM +0200, Jiri Kosina wrote:
> On Wed, 10 Jul 2019, Peter Zijlstra wrote:
>
> > If we mark the key as RO after init, and then try and modify the key to
> > link module usage sites, things might go bang as described.
> >
> > Thanks!
> >
> >
> > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > index 27d7864e7252..5bf7a8354da2 100644
> > --- a/arch/x86/kernel/cpu/common.c
> > +++ b/arch/x86/kernel/cpu/common.c
> > @@ -366,7 +366,7 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
> > cr4_clear_bits(X86_CR4_UMIP);
> > }
> >
> > -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
> > +DEFINE_STATIC_KEY_FALSE(cr_pinning);
>
> Good catch, I guess that is going to fix it.
>
> At the same time though, it sort of destroys the original intent of Kees'
> patch, right? The exploits will just have to call static_key_disable()
> prior to calling native_write_cr4() again, and the protection is gone.
This is fixable by moving native_write_cr*() out-of-line, such that they
never end up in a module.
On Wed, 10 Jul 2019, Thomas Gleixner wrote:
> On Wed, 10 Jul 2019, Peter Zijlstra wrote:
>
> > On Wed, Jul 10, 2019 at 09:25:16PM +0800, Xi Ruoyao wrote:
> > > On 2019-07-10 14:31 +0200, Jiri Kosina wrote:
> > > > Adding Daniel to check whether this couldn't be some fallout of jumplabel
> > > > batching.
> > >
> > > I don't think so. I tried to revert Daniel's jumplabel batching commits and the
> > > issue wasn't solved. But reverting Kees' CR0 and CR4 commits can "fix" it
> > > (apprently).
> >
> > Xi, could you please try the below instead?
> >
> > If we mark the key as RO after init, and then try and modify the key to
> > link module usage sites, things might go bang as described.
>
> Right. I finally was able to reproduce that with Linus' config (slightly
> modified). Applying your patch makes it go away.
>
> Now what puzzles me is that this never exploded in my face before and with
> a debian config it JustWorks.
>
> Both configs have:
>
> CONFIG_KVM=m
> CONFIG_KVM_INTEL=m
>
> so I'd expect both to crash and burn when KVM_INTEL is loaded as that has a
> cr4 operation inside.
>
> So something papers over that ... Still looking.
Just found it. All my usual configs have the paravirt muck enabled. Linus's
config does not, neither has Xi's.
With paravirt the CR pinning key is never in the module. It's in the
paravirt function and obviously always built in.
Thanks,
tglx
On 2019-07-10 16:22 +0200, Jiri Kosina wrote:
> On Wed, 10 Jul 2019, Peter Zijlstra wrote:
>
> > If we mark the key as RO after init, and then try and modify the key to
> > link module usage sites, things might go bang as described.
> >
> > Thanks!
> >
> >
> > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > index 27d7864e7252..5bf7a8354da2 100644
> > --- a/arch/x86/kernel/cpu/common.c
> > +++ b/arch/x86/kernel/cpu/common.c
> > @@ -366,7 +366,7 @@ static __always_inline void setup_umip(struct
> > cpuinfo_x86 *c)
> > cr4_clear_bits(X86_CR4_UMIP);
> > }
> >
> > -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
> > +DEFINE_STATIC_KEY_FALSE(cr_pinning);
>
> Good catch, I guess that is going to fix it.
Yes it works.
> At the same time though, it sort of destroys the original intent of Kees'
> patch, right? The exploits will just have to call static_key_disable()
> prior to calling native_write_cr4() again, and the protection is gone.
I think I should do some study and try to understand the full story of Kees'
change...
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University
On 2019-07-10 17:13 +0200, Thomas Gleixner wrote:
> Something like the below. Builds and boots, must be perfect.
>
> Thanks,
>
> tglx
Tested-by: Xi Ruoyao <[email protected]>
> 8<----------------
>
> arch/x86/include/asm/processor.h | 1
> arch/x86/include/asm/special_insns.h | 41 -------------------
> arch/x86/kernel/cpu/common.c | 72 +++++++++++++++++++++++++++----
> ----
> arch/x86/kernel/smpboot.c | 14 ------
> arch/x86/xen/smp_pv.c | 1
> 5 files changed, 61 insertions(+), 68 deletions(-)
>
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -741,6 +741,7 @@ extern void load_direct_gdt(int);
> extern void load_fixmap_gdt(int);
> extern void load_percpu_segment(int);
> extern void cpu_init(void);
> +extern void cr4_init(void);
>
> static inline unsigned long get_debugctlmsr(void)
> {
> --- a/arch/x86/include/asm/special_insns.h
> +++ b/arch/x86/include/asm/special_insns.h
> @@ -18,9 +18,7 @@
> */
> extern unsigned long __force_order;
>
> -/* Starts false and gets enabled once CPU feature detection is done. */
> -DECLARE_STATIC_KEY_FALSE(cr_pinning);
> -extern unsigned long cr4_pinned_bits;
> +void native_write_cr0(unsigned long val);
>
> static inline unsigned long native_read_cr0(void)
> {
> @@ -29,24 +27,6 @@ static inline unsigned long native_read_
> return val;
> }
>
> -static inline void native_write_cr0(unsigned long val)
> -{
> - unsigned long bits_missing = 0;
> -
> -set_register:
> - asm volatile("mov %0,%%cr0": "+r" (val), "+m" (__force_order));
> -
> - if (static_branch_likely(&cr_pinning)) {
> - if (unlikely((val & X86_CR0_WP) != X86_CR0_WP)) {
> - bits_missing = X86_CR0_WP;
> - val |= bits_missing;
> - goto set_register;
> - }
> - /* Warn after we've set the missing bits. */
> - WARN_ONCE(bits_missing, "CR0 WP bit went missing!?\n");
> - }
> -}
> -
> static inline unsigned long native_read_cr2(void)
> {
> unsigned long val;
> @@ -91,24 +71,7 @@ static inline unsigned long native_read_
> return val;
> }
>
> -static inline void native_write_cr4(unsigned long val)
> -{
> - unsigned long bits_missing = 0;
> -
> -set_register:
> - asm volatile("mov %0,%%cr4": "+r" (val), "+m" (cr4_pinned_bits));
> -
> - if (static_branch_likely(&cr_pinning)) {
> - if (unlikely((val & cr4_pinned_bits) != cr4_pinned_bits)) {
> - bits_missing = ~val & cr4_pinned_bits;
> - val |= bits_missing;
> - goto set_register;
> - }
> - /* Warn after we've set the missing bits. */
> - WARN_ONCE(bits_missing, "CR4 bits went missing: %lx!?\n",
> - bits_missing);
> - }
> -}
> +void native_write_cr4(unsigned long val);
>
> #ifdef CONFIG_X86_64
> static inline unsigned long native_read_cr8(void)
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -366,10 +366,62 @@ static __always_inline void setup_umip(s
> cr4_clear_bits(X86_CR4_UMIP);
> }
>
> -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
> -EXPORT_SYMBOL(cr_pinning);
> -unsigned long cr4_pinned_bits __ro_after_init;
> -EXPORT_SYMBOL(cr4_pinned_bits);
> +static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
> +static unsigned long cr4_pinned_bits __ro_after_init;
> +
> +void native_write_cr0(unsigned long val)
> +{
> + unsigned long bits_missing = 0;
> +
> +set_register:
> + asm volatile("mov %0,%%cr0": "+r" (val), "+m" (__force_order));
> +
> + if (static_branch_likely(&cr_pinning)) {
> + if (unlikely((val & X86_CR0_WP) != X86_CR0_WP)) {
> + bits_missing = X86_CR0_WP;
> + val |= bits_missing;
> + goto set_register;
> + }
> + /* Warn after we've set the missing bits. */
> + WARN_ONCE(bits_missing, "CR0 WP bit went missing!?\n");
> + }
> +}
> +EXPORT_SYMBOL(native_write_cr0);
> +
> +void native_write_cr4(unsigned long val)
> +{
> + unsigned long bits_missing = 0;
> +
> +set_register:
> + asm volatile("mov %0,%%cr4": "+r" (val), "+m" (cr4_pinned_bits));
> +
> + if (static_branch_likely(&cr_pinning)) {
> + if (unlikely((val & cr4_pinned_bits) != cr4_pinned_bits)) {
> + bits_missing = ~val & cr4_pinned_bits;
> + val |= bits_missing;
> + goto set_register;
> + }
> + /* Warn after we've set the missing bits. */
> + WARN_ONCE(bits_missing, "CR4 bits went missing: %lx!?\n",
> + bits_missing);
> + }
> +}
> +EXPORT_SYMBOL(native_write_cr4);
> +
> +void cr4_init(void)
> +{
> + unsigned long cr4 = __read_cr4();
> +
> + if (boot_cpu_has(X86_FEATURE_PCID))
> + cr4 |= X86_CR4_PCIDE;
> + if (static_branch_likely(&cr_pinning))
> + cr4 |= cr4_pinned_bits;
> +
> + __write_cr4(cr4);
> +
> + /* Initialize cr4 shadow for this CPU. */
> + this_cpu_write(cpu_tlbstate.cr4, cr4);
> +}
>
> /*
> * Once CPU feature detection is finished (and boot params have been
> @@ -1723,12 +1775,6 @@ void cpu_init(void)
>
> wait_for_master_cpu(cpu);
>
> - /*
> - * Initialize the CR4 shadow before doing anything that could
> - * try to read it.
> - */
> - cr4_init_shadow();
> -
> if (cpu)
> load_ucode_ap();
>
> @@ -1823,12 +1869,6 @@ void cpu_init(void)
>
> wait_for_master_cpu(cpu);
>
> - /*
> - * Initialize the CR4 shadow before doing anything that could
> - * try to read it.
> - */
> - cr4_init_shadow();
> -
> show_ucode_info_early();
>
> pr_info("Initializing CPU#%d\n", cpu);
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -210,28 +210,16 @@ static int enable_start_cpu0;
> */
> static void notrace start_secondary(void *unused)
> {
> - unsigned long cr4 = __read_cr4();
> -
> /*
> * Don't put *anything* except direct CPU state initialization
> * before cpu_init(), SMP booting is too fragile that we want to
> * limit the things done here to the most necessary things.
> */
> - if (boot_cpu_has(X86_FEATURE_PCID))
> - cr4 |= X86_CR4_PCIDE;
> - if (static_branch_likely(&cr_pinning))
> - cr4 |= cr4_pinned_bits;
> -
> - __write_cr4(cr4);
> + cr4_init();
>
> #ifdef CONFIG_X86_32
> /* switch away from the initial page table */
> load_cr3(swapper_pg_dir);
> - /*
> - * Initialize the CR4 shadow before doing anything that could
> - * try to read it.
> - */
> - cr4_init_shadow();
> __flush_tlb_all();
> #endif
> load_current_idt();
> --- a/arch/x86/xen/smp_pv.c
> +++ b/arch/x86/xen/smp_pv.c
> @@ -58,6 +58,7 @@ static void cpu_bringup(void)
> {
> int cpu;
>
> + cr4_init();
> cpu_init();
> touch_softlockup_watchdog();
> preempt_disable();
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University
On Wed, 10 Jul 2019, Peter Zijlstra wrote:
> On Wed, Jul 10, 2019 at 04:22:51PM +0200, Jiri Kosina wrote:
> > On Wed, 10 Jul 2019, Peter Zijlstra wrote:
> >
> > > If we mark the key as RO after init, and then try and modify the key to
> > > link module usage sites, things might go bang as described.
> > >
> > > Thanks!
> > >
> > >
> > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > > index 27d7864e7252..5bf7a8354da2 100644
> > > --- a/arch/x86/kernel/cpu/common.c
> > > +++ b/arch/x86/kernel/cpu/common.c
> > > @@ -366,7 +366,7 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
> > > cr4_clear_bits(X86_CR4_UMIP);
> > > }
> > >
> > > -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
> > > +DEFINE_STATIC_KEY_FALSE(cr_pinning);
> >
> > Good catch, I guess that is going to fix it.
> >
> > At the same time though, it sort of destroys the original intent of Kees'
> > patch, right? The exploits will just have to call static_key_disable()
> > prior to calling native_write_cr4() again, and the protection is gone.
>
> This is fixable by moving native_write_cr*() out-of-line, such that they
> never end up in a module.
Something like the below. Builds and boots, must be perfect.
Thanks,
tglx
8<----------------
arch/x86/include/asm/processor.h | 1
arch/x86/include/asm/special_insns.h | 41 -------------------
arch/x86/kernel/cpu/common.c | 72 +++++++++++++++++++++++++++--------
arch/x86/kernel/smpboot.c | 14 ------
arch/x86/xen/smp_pv.c | 1
5 files changed, 61 insertions(+), 68 deletions(-)
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -741,6 +741,7 @@ extern void load_direct_gdt(int);
extern void load_fixmap_gdt(int);
extern void load_percpu_segment(int);
extern void cpu_init(void);
+extern void cr4_init(void);
static inline unsigned long get_debugctlmsr(void)
{
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -18,9 +18,7 @@
*/
extern unsigned long __force_order;
-/* Starts false and gets enabled once CPU feature detection is done. */
-DECLARE_STATIC_KEY_FALSE(cr_pinning);
-extern unsigned long cr4_pinned_bits;
+void native_write_cr0(unsigned long val);
static inline unsigned long native_read_cr0(void)
{
@@ -29,24 +27,6 @@ static inline unsigned long native_read_
return val;
}
-static inline void native_write_cr0(unsigned long val)
-{
- unsigned long bits_missing = 0;
-
-set_register:
- asm volatile("mov %0,%%cr0": "+r" (val), "+m" (__force_order));
-
- if (static_branch_likely(&cr_pinning)) {
- if (unlikely((val & X86_CR0_WP) != X86_CR0_WP)) {
- bits_missing = X86_CR0_WP;
- val |= bits_missing;
- goto set_register;
- }
- /* Warn after we've set the missing bits. */
- WARN_ONCE(bits_missing, "CR0 WP bit went missing!?\n");
- }
-}
-
static inline unsigned long native_read_cr2(void)
{
unsigned long val;
@@ -91,24 +71,7 @@ static inline unsigned long native_read_
return val;
}
-static inline void native_write_cr4(unsigned long val)
-{
- unsigned long bits_missing = 0;
-
-set_register:
- asm volatile("mov %0,%%cr4": "+r" (val), "+m" (cr4_pinned_bits));
-
- if (static_branch_likely(&cr_pinning)) {
- if (unlikely((val & cr4_pinned_bits) != cr4_pinned_bits)) {
- bits_missing = ~val & cr4_pinned_bits;
- val |= bits_missing;
- goto set_register;
- }
- /* Warn after we've set the missing bits. */
- WARN_ONCE(bits_missing, "CR4 bits went missing: %lx!?\n",
- bits_missing);
- }
-}
+void native_write_cr4(unsigned long val);
#ifdef CONFIG_X86_64
static inline unsigned long native_read_cr8(void)
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -366,10 +366,62 @@ static __always_inline void setup_umip(s
cr4_clear_bits(X86_CR4_UMIP);
}
-DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
-EXPORT_SYMBOL(cr_pinning);
-unsigned long cr4_pinned_bits __ro_after_init;
-EXPORT_SYMBOL(cr4_pinned_bits);
+static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
+static unsigned long cr4_pinned_bits __ro_after_init;
+
+void native_write_cr0(unsigned long val)
+{
+ unsigned long bits_missing = 0;
+
+set_register:
+ asm volatile("mov %0,%%cr0": "+r" (val), "+m" (__force_order));
+
+ if (static_branch_likely(&cr_pinning)) {
+ if (unlikely((val & X86_CR0_WP) != X86_CR0_WP)) {
+ bits_missing = X86_CR0_WP;
+ val |= bits_missing;
+ goto set_register;
+ }
+ /* Warn after we've set the missing bits. */
+ WARN_ONCE(bits_missing, "CR0 WP bit went missing!?\n");
+ }
+}
+EXPORT_SYMBOL(native_write_cr0);
+
+void native_write_cr4(unsigned long val)
+{
+ unsigned long bits_missing = 0;
+
+set_register:
+ asm volatile("mov %0,%%cr4": "+r" (val), "+m" (cr4_pinned_bits));
+
+ if (static_branch_likely(&cr_pinning)) {
+ if (unlikely((val & cr4_pinned_bits) != cr4_pinned_bits)) {
+ bits_missing = ~val & cr4_pinned_bits;
+ val |= bits_missing;
+ goto set_register;
+ }
+ /* Warn after we've set the missing bits. */
+ WARN_ONCE(bits_missing, "CR4 bits went missing: %lx!?\n",
+ bits_missing);
+ }
+}
+EXPORT_SYMBOL(native_write_cr4);
+
+void cr4_init(void)
+{
+ unsigned long cr4 = __read_cr4();
+
+ if (boot_cpu_has(X86_FEATURE_PCID))
+ cr4 |= X86_CR4_PCIDE;
+ if (static_branch_likely(&cr_pinning))
+ cr4 |= cr4_pinned_bits;
+
+ __write_cr4(cr4);
+
+ /* Initialize cr4 shadow for this CPU. */
+ this_cpu_write(cpu_tlbstate.cr4, cr4);
+}
/*
* Once CPU feature detection is finished (and boot params have been
@@ -1723,12 +1775,6 @@ void cpu_init(void)
wait_for_master_cpu(cpu);
- /*
- * Initialize the CR4 shadow before doing anything that could
- * try to read it.
- */
- cr4_init_shadow();
-
if (cpu)
load_ucode_ap();
@@ -1823,12 +1869,6 @@ void cpu_init(void)
wait_for_master_cpu(cpu);
- /*
- * Initialize the CR4 shadow before doing anything that could
- * try to read it.
- */
- cr4_init_shadow();
-
show_ucode_info_early();
pr_info("Initializing CPU#%d\n", cpu);
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -210,28 +210,16 @@ static int enable_start_cpu0;
*/
static void notrace start_secondary(void *unused)
{
- unsigned long cr4 = __read_cr4();
-
/*
* Don't put *anything* except direct CPU state initialization
* before cpu_init(), SMP booting is too fragile that we want to
* limit the things done here to the most necessary things.
*/
- if (boot_cpu_has(X86_FEATURE_PCID))
- cr4 |= X86_CR4_PCIDE;
- if (static_branch_likely(&cr_pinning))
- cr4 |= cr4_pinned_bits;
-
- __write_cr4(cr4);
+ cr4_init();
#ifdef CONFIG_X86_32
/* switch away from the initial page table */
load_cr3(swapper_pg_dir);
- /*
- * Initialize the CR4 shadow before doing anything that could
- * try to read it.
- */
- cr4_init_shadow();
__flush_tlb_all();
#endif
load_current_idt();
--- a/arch/x86/xen/smp_pv.c
+++ b/arch/x86/xen/smp_pv.c
@@ -58,6 +58,7 @@ static void cpu_bringup(void)
{
int cpu;
+ cr4_init();
cpu_init();
touch_softlockup_watchdog();
preempt_disable();
On Tue, Jul 9, 2019 at 10:15 PM Linus Torvalds
<[email protected]> wrote:
>
> I think there is _another_ problem too, and maybe it's the APCI one,
> but this one triggers some issue before the other issue even gets to
> play..
No, the other problem was the keyring ACL support from David Howells,
which seems to have broken encrypted disk setups.
So apparently I wasn't impacted by the ACPI thing, but had two other
issues instead. Not a great start to the merge window.
Linus
The pinning of sensitive CR0 and CR4 bits caused a boot crash when loading
the kvm_intel module on a kernel compiled with CONFIG_PARAVIRT=n.
The reason is that the static key which controls the pinning is marked RO
after init. The kvm_intel module contains a CR4 write which requires to
update the static key entry list. That obviously does not work when the key
is in a RO section.
With CONFIG_PARAVIRT enabled this does not happen because the CR4 write
uses the paravirt indirection and the actual write function is built in.
As the key is intended to be immutable after init, move
native_write_cr0/3() out of line.
While at it consolidate the update of the cr4 shadow variable and store the
value right away when the pinning is initialized on a booting CPU. No point
in reading it back 20 instructions later. This allows to confine the static
key and the pinning variable to cpu/common and allows to mark them static.
Fixes: 8dbec27a242c ("x86/asm: Pin sensitive CR0 bits")
Fixes: 873d50d58f67 ("x86/asm: Pin sensitive CR4 bits")
Reported-by: Linus Torvalds <[email protected]>
Reported-by: Xi Ruoyao <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Xi Ruoyao <[email protected]>
---
arch/x86/include/asm/processor.h | 1
arch/x86/include/asm/special_insns.h | 41 -------------------
arch/x86/kernel/cpu/common.c | 72 +++++++++++++++++++++++++++--------
arch/x86/kernel/smpboot.c | 14 ------
arch/x86/xen/smp_pv.c | 1
5 files changed, 61 insertions(+), 68 deletions(-)
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -741,6 +741,7 @@ extern void load_direct_gdt(int);
extern void load_fixmap_gdt(int);
extern void load_percpu_segment(int);
extern void cpu_init(void);
+extern void cr4_init(void);
static inline unsigned long get_debugctlmsr(void)
{
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -18,9 +18,7 @@
*/
extern unsigned long __force_order;
-/* Starts false and gets enabled once CPU feature detection is done. */
-DECLARE_STATIC_KEY_FALSE(cr_pinning);
-extern unsigned long cr4_pinned_bits;
+void native_write_cr0(unsigned long val);
static inline unsigned long native_read_cr0(void)
{
@@ -29,24 +27,6 @@ static inline unsigned long native_read_
return val;
}
-static inline void native_write_cr0(unsigned long val)
-{
- unsigned long bits_missing = 0;
-
-set_register:
- asm volatile("mov %0,%%cr0": "+r" (val), "+m" (__force_order));
-
- if (static_branch_likely(&cr_pinning)) {
- if (unlikely((val & X86_CR0_WP) != X86_CR0_WP)) {
- bits_missing = X86_CR0_WP;
- val |= bits_missing;
- goto set_register;
- }
- /* Warn after we've set the missing bits. */
- WARN_ONCE(bits_missing, "CR0 WP bit went missing!?\n");
- }
-}
-
static inline unsigned long native_read_cr2(void)
{
unsigned long val;
@@ -91,24 +71,7 @@ static inline unsigned long native_read_
return val;
}
-static inline void native_write_cr4(unsigned long val)
-{
- unsigned long bits_missing = 0;
-
-set_register:
- asm volatile("mov %0,%%cr4": "+r" (val), "+m" (cr4_pinned_bits));
-
- if (static_branch_likely(&cr_pinning)) {
- if (unlikely((val & cr4_pinned_bits) != cr4_pinned_bits)) {
- bits_missing = ~val & cr4_pinned_bits;
- val |= bits_missing;
- goto set_register;
- }
- /* Warn after we've set the missing bits. */
- WARN_ONCE(bits_missing, "CR4 bits went missing: %lx!?\n",
- bits_missing);
- }
-}
+void native_write_cr4(unsigned long val);
#ifdef CONFIG_X86_64
static inline unsigned long native_read_cr8(void)
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -366,10 +366,62 @@ static __always_inline void setup_umip(s
cr4_clear_bits(X86_CR4_UMIP);
}
-DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
-EXPORT_SYMBOL(cr_pinning);
-unsigned long cr4_pinned_bits __ro_after_init;
-EXPORT_SYMBOL(cr4_pinned_bits);
+static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
+static unsigned long cr4_pinned_bits __ro_after_init;
+
+void native_write_cr0(unsigned long val)
+{
+ unsigned long bits_missing = 0;
+
+set_register:
+ asm volatile("mov %0,%%cr0": "+r" (val), "+m" (__force_order));
+
+ if (static_branch_likely(&cr_pinning)) {
+ if (unlikely((val & X86_CR0_WP) != X86_CR0_WP)) {
+ bits_missing = X86_CR0_WP;
+ val |= bits_missing;
+ goto set_register;
+ }
+ /* Warn after we've set the missing bits. */
+ WARN_ONCE(bits_missing, "CR0 WP bit went missing!?\n");
+ }
+}
+EXPORT_SYMBOL(native_write_cr0);
+
+void native_write_cr4(unsigned long val)
+{
+ unsigned long bits_missing = 0;
+
+set_register:
+ asm volatile("mov %0,%%cr4": "+r" (val), "+m" (cr4_pinned_bits));
+
+ if (static_branch_likely(&cr_pinning)) {
+ if (unlikely((val & cr4_pinned_bits) != cr4_pinned_bits)) {
+ bits_missing = ~val & cr4_pinned_bits;
+ val |= bits_missing;
+ goto set_register;
+ }
+ /* Warn after we've set the missing bits. */
+ WARN_ONCE(bits_missing, "CR4 bits went missing: %lx!?\n",
+ bits_missing);
+ }
+}
+EXPORT_SYMBOL(native_write_cr4);
+
+void cr4_init(void)
+{
+ unsigned long cr4 = __read_cr4();
+
+ if (boot_cpu_has(X86_FEATURE_PCID))
+ cr4 |= X86_CR4_PCIDE;
+ if (static_branch_likely(&cr_pinning))
+ cr4 |= cr4_pinned_bits;
+
+ __write_cr4(cr4);
+
+ /* Initialize cr4 shadow for this CPU. */
+ this_cpu_write(cpu_tlbstate.cr4, cr4);
+}
/*
* Once CPU feature detection is finished (and boot params have been
@@ -1723,12 +1775,6 @@ void cpu_init(void)
wait_for_master_cpu(cpu);
- /*
- * Initialize the CR4 shadow before doing anything that could
- * try to read it.
- */
- cr4_init_shadow();
-
if (cpu)
load_ucode_ap();
@@ -1823,12 +1869,6 @@ void cpu_init(void)
wait_for_master_cpu(cpu);
- /*
- * Initialize the CR4 shadow before doing anything that could
- * try to read it.
- */
- cr4_init_shadow();
-
show_ucode_info_early();
pr_info("Initializing CPU#%d\n", cpu);
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -210,28 +210,16 @@ static int enable_start_cpu0;
*/
static void notrace start_secondary(void *unused)
{
- unsigned long cr4 = __read_cr4();
-
/*
* Don't put *anything* except direct CPU state initialization
* before cpu_init(), SMP booting is too fragile that we want to
* limit the things done here to the most necessary things.
*/
- if (boot_cpu_has(X86_FEATURE_PCID))
- cr4 |= X86_CR4_PCIDE;
- if (static_branch_likely(&cr_pinning))
- cr4 |= cr4_pinned_bits;
-
- __write_cr4(cr4);
+ cr4_init();
#ifdef CONFIG_X86_32
/* switch away from the initial page table */
load_cr3(swapper_pg_dir);
- /*
- * Initialize the CR4 shadow before doing anything that could
- * try to read it.
- */
- cr4_init_shadow();
__flush_tlb_all();
#endif
load_current_idt();
--- a/arch/x86/xen/smp_pv.c
+++ b/arch/x86/xen/smp_pv.c
@@ -58,6 +58,7 @@ static void cpu_bringup(void)
{
int cpu;
+ cr4_init();
cpu_init();
touch_softlockup_watchdog();
preempt_disable();
On Wed, 10 Jul 2019, Kees Cook wrote:
> On Wed, Jul 10, 2019 at 09:42:46PM +0200, Thomas Gleixner wrote:
> > The pinning of sensitive CR0 and CR4 bits caused a boot crash when loading
> > the kvm_intel module on a kernel compiled with CONFIG_PARAVIRT=n.
> >
> > The reason is that the static key which controls the pinning is marked RO
> > after init. The kvm_intel module contains a CR4 write which requires to
> > update the static key entry list. That obviously does not work when the key
> > is in a RO section.
> >
> > With CONFIG_PARAVIRT enabled this does not happen because the CR4 write
> > uses the paravirt indirection and the actual write function is built in.
> >
> > As the key is intended to be immutable after init, move
> > native_write_cr0/3() out of line.
> >
> > While at it consolidate the update of the cr4 shadow variable and store the
> > value right away when the pinning is initialized on a booting CPU. No point
> > in reading it back 20 instructions later. This allows to confine the static
> > key and the pinning variable to cpu/common and allows to mark them static.
> >
> > Fixes: 8dbec27a242c ("x86/asm: Pin sensitive CR0 bits")
> > Fixes: 873d50d58f67 ("x86/asm: Pin sensitive CR4 bits")
> > Reported-by: Linus Torvalds <[email protected]>
> > Reported-by: Xi Ruoyao <[email protected]>
> > Signed-off-by: Thomas Gleixner <[email protected]>
> > Tested-by: Xi Ruoyao <[email protected]>
>
> Thank you for tracking this down and solving it!
>
> Nit: should be "cr0/4()" in Subject and in paragraph 4.
Yeah. My brain is not working today.
On Wed, Jul 10, 2019 at 09:42:46PM +0200, Thomas Gleixner wrote:
> The pinning of sensitive CR0 and CR4 bits caused a boot crash when loading
> the kvm_intel module on a kernel compiled with CONFIG_PARAVIRT=n.
>
> The reason is that the static key which controls the pinning is marked RO
> after init. The kvm_intel module contains a CR4 write which requires to
> update the static key entry list. That obviously does not work when the key
> is in a RO section.
>
> With CONFIG_PARAVIRT enabled this does not happen because the CR4 write
> uses the paravirt indirection and the actual write function is built in.
>
> As the key is intended to be immutable after init, move
> native_write_cr0/3() out of line.
>
> While at it consolidate the update of the cr4 shadow variable and store the
> value right away when the pinning is initialized on a booting CPU. No point
> in reading it back 20 instructions later. This allows to confine the static
> key and the pinning variable to cpu/common and allows to mark them static.
>
> Fixes: 8dbec27a242c ("x86/asm: Pin sensitive CR0 bits")
> Fixes: 873d50d58f67 ("x86/asm: Pin sensitive CR4 bits")
> Reported-by: Linus Torvalds <[email protected]>
> Reported-by: Xi Ruoyao <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Tested-by: Xi Ruoyao <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
On Wed, Jul 10, 2019 at 09:42:46PM +0200, Thomas Gleixner wrote:
> The pinning of sensitive CR0 and CR4 bits caused a boot crash when loading
> the kvm_intel module on a kernel compiled with CONFIG_PARAVIRT=n.
>
> The reason is that the static key which controls the pinning is marked RO
> after init. The kvm_intel module contains a CR4 write which requires to
> update the static key entry list. That obviously does not work when the key
> is in a RO section.
>
> With CONFIG_PARAVIRT enabled this does not happen because the CR4 write
> uses the paravirt indirection and the actual write function is built in.
>
> As the key is intended to be immutable after init, move
> native_write_cr0/3() out of line.
>
> While at it consolidate the update of the cr4 shadow variable and store the
> value right away when the pinning is initialized on a booting CPU. No point
> in reading it back 20 instructions later. This allows to confine the static
> key and the pinning variable to cpu/common and allows to mark them static.
>
> Fixes: 8dbec27a242c ("x86/asm: Pin sensitive CR0 bits")
> Fixes: 873d50d58f67 ("x86/asm: Pin sensitive CR4 bits")
> Reported-by: Linus Torvalds <[email protected]>
> Reported-by: Xi Ruoyao <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Tested-by: Xi Ruoyao <[email protected]>
Thank you for tracking this down and solving it!
Nit: should be "cr0/4()" in Subject and in paragraph 4.
Acked-by: Kees Cook <[email protected]>
-Kees
> ---
> arch/x86/include/asm/processor.h | 1
> arch/x86/include/asm/special_insns.h | 41 -------------------
> arch/x86/kernel/cpu/common.c | 72 +++++++++++++++++++++++++++--------
> arch/x86/kernel/smpboot.c | 14 ------
> arch/x86/xen/smp_pv.c | 1
> 5 files changed, 61 insertions(+), 68 deletions(-)
>
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -741,6 +741,7 @@ extern void load_direct_gdt(int);
> extern void load_fixmap_gdt(int);
> extern void load_percpu_segment(int);
> extern void cpu_init(void);
> +extern void cr4_init(void);
>
> static inline unsigned long get_debugctlmsr(void)
> {
> --- a/arch/x86/include/asm/special_insns.h
> +++ b/arch/x86/include/asm/special_insns.h
> @@ -18,9 +18,7 @@
> */
> extern unsigned long __force_order;
>
> -/* Starts false and gets enabled once CPU feature detection is done. */
> -DECLARE_STATIC_KEY_FALSE(cr_pinning);
> -extern unsigned long cr4_pinned_bits;
> +void native_write_cr0(unsigned long val);
>
> static inline unsigned long native_read_cr0(void)
> {
> @@ -29,24 +27,6 @@ static inline unsigned long native_read_
> return val;
> }
>
> -static inline void native_write_cr0(unsigned long val)
> -{
> - unsigned long bits_missing = 0;
> -
> -set_register:
> - asm volatile("mov %0,%%cr0": "+r" (val), "+m" (__force_order));
> -
> - if (static_branch_likely(&cr_pinning)) {
> - if (unlikely((val & X86_CR0_WP) != X86_CR0_WP)) {
> - bits_missing = X86_CR0_WP;
> - val |= bits_missing;
> - goto set_register;
> - }
> - /* Warn after we've set the missing bits. */
> - WARN_ONCE(bits_missing, "CR0 WP bit went missing!?\n");
> - }
> -}
> -
> static inline unsigned long native_read_cr2(void)
> {
> unsigned long val;
> @@ -91,24 +71,7 @@ static inline unsigned long native_read_
> return val;
> }
>
> -static inline void native_write_cr4(unsigned long val)
> -{
> - unsigned long bits_missing = 0;
> -
> -set_register:
> - asm volatile("mov %0,%%cr4": "+r" (val), "+m" (cr4_pinned_bits));
> -
> - if (static_branch_likely(&cr_pinning)) {
> - if (unlikely((val & cr4_pinned_bits) != cr4_pinned_bits)) {
> - bits_missing = ~val & cr4_pinned_bits;
> - val |= bits_missing;
> - goto set_register;
> - }
> - /* Warn after we've set the missing bits. */
> - WARN_ONCE(bits_missing, "CR4 bits went missing: %lx!?\n",
> - bits_missing);
> - }
> -}
> +void native_write_cr4(unsigned long val);
>
> #ifdef CONFIG_X86_64
> static inline unsigned long native_read_cr8(void)
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -366,10 +366,62 @@ static __always_inline void setup_umip(s
> cr4_clear_bits(X86_CR4_UMIP);
> }
>
> -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
> -EXPORT_SYMBOL(cr_pinning);
> -unsigned long cr4_pinned_bits __ro_after_init;
> -EXPORT_SYMBOL(cr4_pinned_bits);
> +static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
> +static unsigned long cr4_pinned_bits __ro_after_init;
> +
> +void native_write_cr0(unsigned long val)
> +{
> + unsigned long bits_missing = 0;
> +
> +set_register:
> + asm volatile("mov %0,%%cr0": "+r" (val), "+m" (__force_order));
> +
> + if (static_branch_likely(&cr_pinning)) {
> + if (unlikely((val & X86_CR0_WP) != X86_CR0_WP)) {
> + bits_missing = X86_CR0_WP;
> + val |= bits_missing;
> + goto set_register;
> + }
> + /* Warn after we've set the missing bits. */
> + WARN_ONCE(bits_missing, "CR0 WP bit went missing!?\n");
> + }
> +}
> +EXPORT_SYMBOL(native_write_cr0);
> +
> +void native_write_cr4(unsigned long val)
> +{
> + unsigned long bits_missing = 0;
> +
> +set_register:
> + asm volatile("mov %0,%%cr4": "+r" (val), "+m" (cr4_pinned_bits));
> +
> + if (static_branch_likely(&cr_pinning)) {
> + if (unlikely((val & cr4_pinned_bits) != cr4_pinned_bits)) {
> + bits_missing = ~val & cr4_pinned_bits;
> + val |= bits_missing;
> + goto set_register;
> + }
> + /* Warn after we've set the missing bits. */
> + WARN_ONCE(bits_missing, "CR4 bits went missing: %lx!?\n",
> + bits_missing);
> + }
> +}
> +EXPORT_SYMBOL(native_write_cr4);
> +
> +void cr4_init(void)
> +{
> + unsigned long cr4 = __read_cr4();
> +
> + if (boot_cpu_has(X86_FEATURE_PCID))
> + cr4 |= X86_CR4_PCIDE;
> + if (static_branch_likely(&cr_pinning))
> + cr4 |= cr4_pinned_bits;
> +
> + __write_cr4(cr4);
> +
> + /* Initialize cr4 shadow for this CPU. */
> + this_cpu_write(cpu_tlbstate.cr4, cr4);
> +}
>
> /*
> * Once CPU feature detection is finished (and boot params have been
> @@ -1723,12 +1775,6 @@ void cpu_init(void)
>
> wait_for_master_cpu(cpu);
>
> - /*
> - * Initialize the CR4 shadow before doing anything that could
> - * try to read it.
> - */
> - cr4_init_shadow();
> -
> if (cpu)
> load_ucode_ap();
>
> @@ -1823,12 +1869,6 @@ void cpu_init(void)
>
> wait_for_master_cpu(cpu);
>
> - /*
> - * Initialize the CR4 shadow before doing anything that could
> - * try to read it.
> - */
> - cr4_init_shadow();
> -
> show_ucode_info_early();
>
> pr_info("Initializing CPU#%d\n", cpu);
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -210,28 +210,16 @@ static int enable_start_cpu0;
> */
> static void notrace start_secondary(void *unused)
> {
> - unsigned long cr4 = __read_cr4();
> -
> /*
> * Don't put *anything* except direct CPU state initialization
> * before cpu_init(), SMP booting is too fragile that we want to
> * limit the things done here to the most necessary things.
> */
> - if (boot_cpu_has(X86_FEATURE_PCID))
> - cr4 |= X86_CR4_PCIDE;
> - if (static_branch_likely(&cr_pinning))
> - cr4 |= cr4_pinned_bits;
> -
> - __write_cr4(cr4);
> + cr4_init();
>
> #ifdef CONFIG_X86_32
> /* switch away from the initial page table */
> load_cr3(swapper_pg_dir);
> - /*
> - * Initialize the CR4 shadow before doing anything that could
> - * try to read it.
> - */
> - cr4_init_shadow();
> __flush_tlb_all();
> #endif
> load_current_idt();
> --- a/arch/x86/xen/smp_pv.c
> +++ b/arch/x86/xen/smp_pv.c
> @@ -58,6 +58,7 @@ static void cpu_bringup(void)
> {
> int cpu;
>
> + cr4_init();
> cpu_init();
> touch_softlockup_watchdog();
> preempt_disable();
--
Kees Cook
Commit-ID: 7652ac92018536eb807b6c2130100c85f1ba7e3b
Gitweb: https://git.kernel.org/tip/7652ac92018536eb807b6c2130100c85f1ba7e3b
Author: Thomas Gleixner <[email protected]>
AuthorDate: Wed, 10 Jul 2019 21:42:46 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Wed, 10 Jul 2019 22:15:05 +0200
x86/asm: Move native_write_cr0/4() out of line
The pinning of sensitive CR0 and CR4 bits caused a boot crash when loading
the kvm_intel module on a kernel compiled with CONFIG_PARAVIRT=n.
The reason is that the static key which controls the pinning is marked RO
after init. The kvm_intel module contains a CR4 write which requires to
update the static key entry list. That obviously does not work when the key
is in a RO section.
With CONFIG_PARAVIRT enabled this does not happen because the CR4 write
uses the paravirt indirection and the actual write function is built in.
As the key is intended to be immutable after init, move
native_write_cr0/4() out of line.
While at it consolidate the update of the cr4 shadow variable and store the
value right away when the pinning is initialized on a booting CPU. No point
in reading it back 20 instructions later. This allows to confine the static
key and the pinning variable to cpu/common and allows to mark them static.
Fixes: 8dbec27a242c ("x86/asm: Pin sensitive CR0 bits")
Fixes: 873d50d58f67 ("x86/asm: Pin sensitive CR4 bits")
Reported-by: Linus Torvalds <[email protected]>
Reported-by: Xi Ruoyao <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Xi Ruoyao <[email protected]>
Acked-by: Kees Cook <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/processor.h | 1 +
arch/x86/include/asm/special_insns.h | 41 +-------------------
arch/x86/kernel/cpu/common.c | 72 ++++++++++++++++++++++++++++--------
arch/x86/kernel/smpboot.c | 14 +------
arch/x86/xen/smp_pv.c | 1 +
5 files changed, 61 insertions(+), 68 deletions(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3eab6ece52b4..6e0a3b43d027 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -741,6 +741,7 @@ extern void load_direct_gdt(int);
extern void load_fixmap_gdt(int);
extern void load_percpu_segment(int);
extern void cpu_init(void);
+extern void cr4_init(void);
static inline unsigned long get_debugctlmsr(void)
{
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index b2e84d113f2a..219be88a59d2 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -18,9 +18,7 @@
*/
extern unsigned long __force_order;
-/* Starts false and gets enabled once CPU feature detection is done. */
-DECLARE_STATIC_KEY_FALSE(cr_pinning);
-extern unsigned long cr4_pinned_bits;
+void native_write_cr0(unsigned long val);
static inline unsigned long native_read_cr0(void)
{
@@ -29,24 +27,6 @@ static inline unsigned long native_read_cr0(void)
return val;
}
-static inline void native_write_cr0(unsigned long val)
-{
- unsigned long bits_missing = 0;
-
-set_register:
- asm volatile("mov %0,%%cr0": "+r" (val), "+m" (__force_order));
-
- if (static_branch_likely(&cr_pinning)) {
- if (unlikely((val & X86_CR0_WP) != X86_CR0_WP)) {
- bits_missing = X86_CR0_WP;
- val |= bits_missing;
- goto set_register;
- }
- /* Warn after we've set the missing bits. */
- WARN_ONCE(bits_missing, "CR0 WP bit went missing!?\n");
- }
-}
-
static inline unsigned long native_read_cr2(void)
{
unsigned long val;
@@ -91,24 +71,7 @@ static inline unsigned long native_read_cr4(void)
return val;
}
-static inline void native_write_cr4(unsigned long val)
-{
- unsigned long bits_missing = 0;
-
-set_register:
- asm volatile("mov %0,%%cr4": "+r" (val), "+m" (cr4_pinned_bits));
-
- if (static_branch_likely(&cr_pinning)) {
- if (unlikely((val & cr4_pinned_bits) != cr4_pinned_bits)) {
- bits_missing = ~val & cr4_pinned_bits;
- val |= bits_missing;
- goto set_register;
- }
- /* Warn after we've set the missing bits. */
- WARN_ONCE(bits_missing, "CR4 bits went missing: %lx!?\n",
- bits_missing);
- }
-}
+void native_write_cr4(unsigned long val);
#ifdef CONFIG_X86_64
static inline unsigned long native_read_cr8(void)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 309b6b9b49d4..11472178e17f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -366,10 +366,62 @@ out:
cr4_clear_bits(X86_CR4_UMIP);
}
-DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
-EXPORT_SYMBOL(cr_pinning);
-unsigned long cr4_pinned_bits __ro_after_init;
-EXPORT_SYMBOL(cr4_pinned_bits);
+static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
+static unsigned long cr4_pinned_bits __ro_after_init;
+
+void native_write_cr0(unsigned long val)
+{
+ unsigned long bits_missing = 0;
+
+set_register:
+ asm volatile("mov %0,%%cr0": "+r" (val), "+m" (__force_order));
+
+ if (static_branch_likely(&cr_pinning)) {
+ if (unlikely((val & X86_CR0_WP) != X86_CR0_WP)) {
+ bits_missing = X86_CR0_WP;
+ val |= bits_missing;
+ goto set_register;
+ }
+ /* Warn after we've set the missing bits. */
+ WARN_ONCE(bits_missing, "CR0 WP bit went missing!?\n");
+ }
+}
+EXPORT_SYMBOL(native_write_cr0);
+
+void native_write_cr4(unsigned long val)
+{
+ unsigned long bits_missing = 0;
+
+set_register:
+ asm volatile("mov %0,%%cr4": "+r" (val), "+m" (cr4_pinned_bits));
+
+ if (static_branch_likely(&cr_pinning)) {
+ if (unlikely((val & cr4_pinned_bits) != cr4_pinned_bits)) {
+ bits_missing = ~val & cr4_pinned_bits;
+ val |= bits_missing;
+ goto set_register;
+ }
+ /* Warn after we've set the missing bits. */
+ WARN_ONCE(bits_missing, "CR4 bits went missing: %lx!?\n",
+ bits_missing);
+ }
+}
+EXPORT_SYMBOL(native_write_cr4);
+
+void cr4_init(void)
+{
+ unsigned long cr4 = __read_cr4();
+
+ if (boot_cpu_has(X86_FEATURE_PCID))
+ cr4 |= X86_CR4_PCIDE;
+ if (static_branch_likely(&cr_pinning))
+ cr4 |= cr4_pinned_bits;
+
+ __write_cr4(cr4);
+
+ /* Initialize cr4 shadow for this CPU. */
+ this_cpu_write(cpu_tlbstate.cr4, cr4);
+}
/*
* Once CPU feature detection is finished (and boot params have been
@@ -1723,12 +1775,6 @@ void cpu_init(void)
wait_for_master_cpu(cpu);
- /*
- * Initialize the CR4 shadow before doing anything that could
- * try to read it.
- */
- cr4_init_shadow();
-
if (cpu)
load_ucode_ap();
@@ -1823,12 +1869,6 @@ void cpu_init(void)
wait_for_master_cpu(cpu);
- /*
- * Initialize the CR4 shadow before doing anything that could
- * try to read it.
- */
- cr4_init_shadow();
-
show_ucode_info_early();
pr_info("Initializing CPU#%d\n", cpu);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index f78801114ee1..259d1d2be076 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -210,28 +210,16 @@ static int enable_start_cpu0;
*/
static void notrace start_secondary(void *unused)
{
- unsigned long cr4 = __read_cr4();
-
/*
* Don't put *anything* except direct CPU state initialization
* before cpu_init(), SMP booting is too fragile that we want to
* limit the things done here to the most necessary things.
*/
- if (boot_cpu_has(X86_FEATURE_PCID))
- cr4 |= X86_CR4_PCIDE;
- if (static_branch_likely(&cr_pinning))
- cr4 |= cr4_pinned_bits;
-
- __write_cr4(cr4);
+ cr4_init();
#ifdef CONFIG_X86_32
/* switch away from the initial page table */
load_cr3(swapper_pg_dir);
- /*
- * Initialize the CR4 shadow before doing anything that could
- * try to read it.
- */
- cr4_init_shadow();
__flush_tlb_all();
#endif
load_current_idt();
diff --git a/arch/x86/xen/smp_pv.c b/arch/x86/xen/smp_pv.c
index 77d81c1a63e9..802ee5bba66c 100644
--- a/arch/x86/xen/smp_pv.c
+++ b/arch/x86/xen/smp_pv.c
@@ -58,6 +58,7 @@ static void cpu_bringup(void)
{
int cpu;
+ cr4_init();
cpu_init();
touch_softlockup_watchdog();
preempt_disable();
> On Jul 10, 2019, at 7:22 AM, Jiri Kosina <[email protected]> wrote:
>
> On Wed, 10 Jul 2019, Peter Zijlstra wrote:
>
>> If we mark the key as RO after init, and then try and modify the key to
>> link module usage sites, things might go bang as described.
>>
>> Thanks!
>>
>>
>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>> index 27d7864e7252..5bf7a8354da2 100644
>> --- a/arch/x86/kernel/cpu/common.c
>> +++ b/arch/x86/kernel/cpu/common.c
>> @@ -366,7 +366,7 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
>> cr4_clear_bits(X86_CR4_UMIP);
>> }
>>
>> -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
>> +DEFINE_STATIC_KEY_FALSE(cr_pinning);
>
> Good catch, I guess that is going to fix it.
>
> At the same time though, it sort of destroys the original intent of Kees'
> patch, right? The exploits will just have to call static_key_disable()
> prior to calling native_write_cr4() again, and the protection is gone.
Even with DEFINE_STATIC_KEY_FALSE_RO(), I presume you can just call
set_memory_rw(), make the page that holds the key writable, and then call
static_key_disable(), followed by a call to native_write_cr4().
On Thu, 11 Jul 2019, Nadav Amit wrote:
> > On Jul 10, 2019, at 7:22 AM, Jiri Kosina <[email protected]> wrote:
> >
> > On Wed, 10 Jul 2019, Peter Zijlstra wrote:
> >
> >> If we mark the key as RO after init, and then try and modify the key to
> >> link module usage sites, things might go bang as described.
> >>
> >> Thanks!
> >>
> >>
> >> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> >> index 27d7864e7252..5bf7a8354da2 100644
> >> --- a/arch/x86/kernel/cpu/common.c
> >> +++ b/arch/x86/kernel/cpu/common.c
> >> @@ -366,7 +366,7 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
> >> cr4_clear_bits(X86_CR4_UMIP);
> >> }
> >>
> >> -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
> >> +DEFINE_STATIC_KEY_FALSE(cr_pinning);
> >
> > Good catch, I guess that is going to fix it.
> >
> > At the same time though, it sort of destroys the original intent of Kees'
> > patch, right? The exploits will just have to call static_key_disable()
> > prior to calling native_write_cr4() again, and the protection is gone.
>
> Even with DEFINE_STATIC_KEY_FALSE_RO(), I presume you can just call
> set_memory_rw(), make the page that holds the key writable, and then call
> static_key_disable(), followed by a call to native_write_cr4().
That's true, but it's not worth the trouble.
Thanks,
tglx
On Thu, Jul 11, 2019 at 07:11:19AM +0000, Nadav Amit wrote:
> > On Jul 10, 2019, at 7:22 AM, Jiri Kosina <[email protected]> wrote:
> >
> > On Wed, 10 Jul 2019, Peter Zijlstra wrote:
> >
> >> If we mark the key as RO after init, and then try and modify the key to
> >> link module usage sites, things might go bang as described.
> >>
> >> Thanks!
> >>
> >>
> >> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> >> index 27d7864e7252..5bf7a8354da2 100644
> >> --- a/arch/x86/kernel/cpu/common.c
> >> +++ b/arch/x86/kernel/cpu/common.c
> >> @@ -366,7 +366,7 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
> >> cr4_clear_bits(X86_CR4_UMIP);
> >> }
> >>
> >> -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
> >> +DEFINE_STATIC_KEY_FALSE(cr_pinning);
> >
> > Good catch, I guess that is going to fix it.
> >
> > At the same time though, it sort of destroys the original intent of Kees'
> > patch, right? The exploits will just have to call static_key_disable()
> > prior to calling native_write_cr4() again, and the protection is gone.
>
> Even with DEFINE_STATIC_KEY_FALSE_RO(), I presume you can just call
> set_memory_rw(), make the page that holds the key writable, and then call
> static_key_disable(), followed by a call to native_write_cr4().
Or call text_poke_bp() with the right set of arguments.
On Thu, Jul 11, 2019 at 10:01:34AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2019 at 07:11:19AM +0000, Nadav Amit wrote:
> > > On Jul 10, 2019, at 7:22 AM, Jiri Kosina <[email protected]> wrote:
> > >
> > > On Wed, 10 Jul 2019, Peter Zijlstra wrote:
> > >
> > >> If we mark the key as RO after init, and then try and modify the key to
> > >> link module usage sites, things might go bang as described.
> > >>
> > >> Thanks!
> > >>
> > >>
> > >> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > >> index 27d7864e7252..5bf7a8354da2 100644
> > >> --- a/arch/x86/kernel/cpu/common.c
> > >> +++ b/arch/x86/kernel/cpu/common.c
> > >> @@ -366,7 +366,7 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
> > >> cr4_clear_bits(X86_CR4_UMIP);
> > >> }
> > >>
> > >> -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
> > >> +DEFINE_STATIC_KEY_FALSE(cr_pinning);
> > >
> > > Good catch, I guess that is going to fix it.
> > >
> > > At the same time though, it sort of destroys the original intent of Kees'
> > > patch, right? The exploits will just have to call static_key_disable()
> > > prior to calling native_write_cr4() again, and the protection is gone.
> >
> > Even with DEFINE_STATIC_KEY_FALSE_RO(), I presume you can just call
> > set_memory_rw(), make the page that holds the key writable, and then call
> > static_key_disable(), followed by a call to native_write_cr4().
>
> Or call text_poke_bp() with the right set of arguments.
Right -- the point is to make it defended against an arbitrary write,
not arbitrary execution. Nothing is safe from arbitrary exec, but we can
do our due diligence on making things read-only.
--
Kees Cook
> On Jul 11, 2019, at 8:08 AM, Kees Cook <[email protected]> wrote:
>
> On Thu, Jul 11, 2019 at 10:01:34AM +0200, Peter Zijlstra wrote:
>> On Thu, Jul 11, 2019 at 07:11:19AM +0000, Nadav Amit wrote:
>>>> On Jul 10, 2019, at 7:22 AM, Jiri Kosina <[email protected]> wrote:
>>>>
>>>> On Wed, 10 Jul 2019, Peter Zijlstra wrote:
>>>>
>>>>> If we mark the key as RO after init, and then try and modify the key to
>>>>> link module usage sites, things might go bang as described.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>>>>> index 27d7864e7252..5bf7a8354da2 100644
>>>>> --- a/arch/x86/kernel/cpu/common.c
>>>>> +++ b/arch/x86/kernel/cpu/common.c
>>>>> @@ -366,7 +366,7 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
>>>>> cr4_clear_bits(X86_CR4_UMIP);
>>>>> }
>>>>>
>>>>> -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
>>>>> +DEFINE_STATIC_KEY_FALSE(cr_pinning);
>>>>
>>>> Good catch, I guess that is going to fix it.
>>>>
>>>> At the same time though, it sort of destroys the original intent of Kees'
>>>> patch, right? The exploits will just have to call static_key_disable()
>>>> prior to calling native_write_cr4() again, and the protection is gone.
>>>
>>> Even with DEFINE_STATIC_KEY_FALSE_RO(), I presume you can just call
>>> set_memory_rw(), make the page that holds the key writable, and then call
>>> static_key_disable(), followed by a call to native_write_cr4().
>>
>> Or call text_poke_bp() with the right set of arguments.
>
> Right -- the point is to make it defended against an arbitrary write,
> not arbitrary execution. Nothing is safe from arbitrary exec, but we can
> do our due diligence on making things read-only.
I don’t understand.
In the PoC that motivated this this patch [1], the attacker gained the
ability to call a function, control its first argument and used it to
disable SMEP/SMAP by calling native_write_cr4(), which he did before doing
an arbitrary write (another ability he gain).
After this patch, the attacker can instead call three functions, and by
controlling their first arguments (*) disable SMEP. I didn’t see something
in the motivating PoC that prevented calling 3 functions one at a time.
So it seems to me that it raised the bar for an attack by very little.
--
(*) set_memory_rw() has a second argument - the number of pages - but many
non-zero values may be fine (if not a warning from __cpa_process_fault()
might appear).
[1] https://googleprojectzero.blogspot.com/2017/05/exploiting-linux-kernel-via-packet.html