Since we now have infrastructure to analyze module text, disallow
modules that write to CRn and DRn registers.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/kernel/module.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -266,6 +266,22 @@ static bool insn_is_vmx(struct insn *ins
return false;
}
+static bool insn_is_mov_CRn(struct insn *insn)
+{
+ if (insn->opcode.bytes[0] == 0x0f && insn->opcode.bytes[1] == 0x22)
+ return true;
+
+ return false;
+}
+
+static bool insn_is_mov_DRn(struct insn *insn)
+{
+ if (insn->opcode.bytes[0] == 0x0f && insn->opcode.bytes[1] == 0x23)
+ return true;
+
+ return false;
+}
+
static int decode_module(struct module *mod, void *text, void *text_end, bool sld_safe)
{
bool allow_vmx = sld_safe || !split_lock_enabled();
@@ -285,6 +301,11 @@ static int decode_module(struct module *
return -ENOEXEC;
}
+ if (insn_is_mov_CRn(&insn) || insn_is_mov_DRn(&insn)) {
+ pr_err("Module writes to CRn or DRn, please use the proper accessors: %s\n", mod->name);
+ return -ENOEXEC;
+ }
+
text += insn.length;
}
On Tue, Apr 07, 2020 at 01:02:40PM +0200, Peter Zijlstra wrote:
> Since we now have infrastructure to analyze module text, disallow
> modules that write to CRn and DRn registers.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> arch/x86/kernel/module.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -266,6 +266,22 @@ static bool insn_is_vmx(struct insn *ins
> return false;
> }
>
> +static bool insn_is_mov_CRn(struct insn *insn)
> +{
> + if (insn->opcode.bytes[0] == 0x0f && insn->opcode.bytes[1] == 0x22)
> + return true;
I always cringe at numeric literals. Would it be overkill to add defines
for these (and the others that have comments next to them in 3/4)? It
makes stuff easier to grep, etc. (e.g. we have register names in
arch/x86/include/asm/asm.h, do we need instruction names somewhere else?
I assume objtool has a bunch of this too...)
-Kees
> +
> + return false;
> +}
> +
> +static bool insn_is_mov_DRn(struct insn *insn)
> +{
> + if (insn->opcode.bytes[0] == 0x0f && insn->opcode.bytes[1] == 0x23)
> + return true;
> +
> + return false;
> +}
> +
> static int decode_module(struct module *mod, void *text, void *text_end, bool sld_safe)
> {
> bool allow_vmx = sld_safe || !split_lock_enabled();
> @@ -285,6 +301,11 @@ static int decode_module(struct module *
> return -ENOEXEC;
> }
>
> + if (insn_is_mov_CRn(&insn) || insn_is_mov_DRn(&insn)) {
> + pr_err("Module writes to CRn or DRn, please use the proper accessors: %s\n", mod->name);
> + return -ENOEXEC;
> + }
> +
> text += insn.length;
> }
>
>
>
--
Kees Cook
On Tue, Apr 07, 2020 at 10:01:04AM -0700, Kees Cook wrote:
> On Tue, Apr 07, 2020 at 01:02:40PM +0200, Peter Zijlstra wrote:
> > Since we now have infrastructure to analyze module text, disallow
> > modules that write to CRn and DRn registers.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > ---
> > arch/x86/kernel/module.c | 21 +++++++++++++++++++++
> > 1 file changed, 21 insertions(+)
> >
> > --- a/arch/x86/kernel/module.c
> > +++ b/arch/x86/kernel/module.c
> > @@ -266,6 +266,22 @@ static bool insn_is_vmx(struct insn *ins
> > return false;
> > }
> >
> > +static bool insn_is_mov_CRn(struct insn *insn)
> > +{
> > + if (insn->opcode.bytes[0] == 0x0f && insn->opcode.bytes[1] == 0x22)
> > + return true;
>
> I always cringe at numeric literals. Would it be overkill to add defines
> for these (and the others that have comments next to them in 3/4)? It
> makes stuff easier to grep, etc. (e.g. we have register names in
> arch/x86/include/asm/asm.h, do we need instruction names somewhere else?
> I assume objtool has a bunch of this too...)
objtool does not, have a peek at tools/objtool/arch/x86/decode.c
I'm not sure what the best way is here, the x86 opcode map is a
disaster. Even the mnemonic doesn't help us here, as that's just MOV :/
On Tue, Apr 07, 2020 at 08:13:25PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 07, 2020 at 10:01:04AM -0700, Kees Cook wrote:
> > On Tue, Apr 07, 2020 at 01:02:40PM +0200, Peter Zijlstra wrote:
> > > Since we now have infrastructure to analyze module text, disallow
> > > modules that write to CRn and DRn registers.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > > ---
> > > arch/x86/kernel/module.c | 21 +++++++++++++++++++++
> > > 1 file changed, 21 insertions(+)
> > >
> > > --- a/arch/x86/kernel/module.c
> > > +++ b/arch/x86/kernel/module.c
> > > @@ -266,6 +266,22 @@ static bool insn_is_vmx(struct insn *ins
> > > return false;
> > > }
> > >
> > > +static bool insn_is_mov_CRn(struct insn *insn)
> > > +{
> > > + if (insn->opcode.bytes[0] == 0x0f && insn->opcode.bytes[1] == 0x22)
> > > + return true;
> >
> > I always cringe at numeric literals. Would it be overkill to add defines
> > for these (and the others that have comments next to them in 3/4)? It
> > makes stuff easier to grep, etc. (e.g. we have register names in
> > arch/x86/include/asm/asm.h, do we need instruction names somewhere else?
> > I assume objtool has a bunch of this too...)
>
> objtool does not, have a peek at tools/objtool/arch/x86/decode.c
Eek.
> I'm not sure what the best way is here, the x86 opcode map is a
> disaster. Even the mnemonic doesn't help us here, as that's just MOV :/
Yeah, I'm not sure either. I guess leave this as-is.
--
Kees Cook
> On Apr 7, 2020, at 4:02 AM, Peter Zijlstra <[email protected]> wrote:
>
> Since we now have infrastructure to analyze module text, disallow
> modules that write to CRn and DRn registers.
Assuming the kernel is built without CONFIG_PARAVIRT, what is the right way
for out-of-tree modules to write to CRs? Let’s say CR2?
On Tue, Apr 07, 2020 at 11:55:21AM -0700, Nadav Amit wrote:
> > On Apr 7, 2020, at 4:02 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > Since we now have infrastructure to analyze module text, disallow
> > modules that write to CRn and DRn registers.
>
> Assuming the kernel is built without CONFIG_PARAVIRT, what is the right way
> for out-of-tree modules to write to CRs? Let’s say CR2?
Most of them there is no real justification for ever writing to. CR2 I
suppose we can have an exception for given a sane rationale for why
you'd need to rewrite the fault address.
> On Apr 7, 2020, at 12:38 PM, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Apr 07, 2020 at 11:55:21AM -0700, Nadav Amit wrote:
>>> On Apr 7, 2020, at 4:02 AM, Peter Zijlstra <[email protected]> wrote:
>>>
>>> Since we now have infrastructure to analyze module text, disallow
>>> modules that write to CRn and DRn registers.
>>
>> Assuming the kernel is built without CONFIG_PARAVIRT, what is the right way
>> for out-of-tree modules to write to CRs? Let’s say CR2?
>
> Most of them there is no real justification for ever writing to. CR2 I
> suppose we can have an exception for given a sane rationale for why
> you'd need to rewrite the fault address.
For the same reason that KVM writes to CR2 - to restore CR2 before entering
a guest, since CR2 not architecturally loaded from the VMCS. I suspect there
are additional use-cases which are not covered by the kernel interfaces.
On Tue, Apr 07, 2020 at 01:27:45PM -0700, Nadav Amit wrote:
> > On Apr 7, 2020, at 12:38 PM, Peter Zijlstra <[email protected]> wrote:
> >
> > On Tue, Apr 07, 2020 at 11:55:21AM -0700, Nadav Amit wrote:
> >>> On Apr 7, 2020, at 4:02 AM, Peter Zijlstra <[email protected]> wrote:
> >>>
> >>> Since we now have infrastructure to analyze module text, disallow
> >>> modules that write to CRn and DRn registers.
> >>
> >> Assuming the kernel is built without CONFIG_PARAVIRT, what is the right way
> >> for out-of-tree modules to write to CRs? Let’s say CR2?
> >
> > Most of them there is no real justification for ever writing to. CR2 I
> > suppose we can have an exception for given a sane rationale for why
> > you'd need to rewrite the fault address.
>
> For the same reason that KVM writes to CR2 - to restore CR2 before entering
> a guest, since CR2 not architecturally loaded from the VMCS. I suspect there
> are additional use-cases which are not covered by the kernel interfaces.
So I'm not much of a virt guy (clearly), and *groan*, that's horrible.
I'll go make an exception for CR2.
> On Apr 7, 2020, at 1:50 PM, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Apr 07, 2020 at 01:27:45PM -0700, Nadav Amit wrote:
>>> On Apr 7, 2020, at 12:38 PM, Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Tue, Apr 07, 2020 at 11:55:21AM -0700, Nadav Amit wrote:
>>>>> On Apr 7, 2020, at 4:02 AM, Peter Zijlstra <[email protected]> wrote:
>>>>>
>>>>> Since we now have infrastructure to analyze module text, disallow
>>>>> modules that write to CRn and DRn registers.
>>>>
>>>> Assuming the kernel is built without CONFIG_PARAVIRT, what is the right way
>>>> for out-of-tree modules to write to CRs? Let’s say CR2?
>>>
>>> Most of them there is no real justification for ever writing to. CR2 I
>>> suppose we can have an exception for given a sane rationale for why
>>> you'd need to rewrite the fault address.
>>
>> For the same reason that KVM writes to CR2 - to restore CR2 before entering
>> a guest, since CR2 not architecturally loaded from the VMCS. I suspect there
>> are additional use-cases which are not covered by the kernel interfaces.
>
> So I'm not much of a virt guy (clearly), and *groan*, that's horrible.
> I'll go make an exception for CR2.
Clearly you are not a virt guy if you think that this is the horrible part
in x86 virtualization ;-)
Anyhow, I do not think it is the only use-case which is not covered by your
patches (even considering CRs/DRs alone). For example, there is no kernel
function to turn on CR4.VMXE, which is required to run hypervisors on x86.
I think a thorough analysis of existing software is needed to figure out
which use-cases are valid, and to exclude them during module scanning or to
provide alternative kernel interfaces to enable them. This may require a
transition phase in which module scanning would only issue warnings and
would not prevent the module from being loaded.
On Tue, Apr 07, 2020 at 02:22:11PM -0700, Nadav Amit wrote:
> Anyhow, I do not think it is the only use-case which is not covered by your
> patches (even considering CRs/DRs alone). For example, there is no kernel
> function to turn on CR4.VMXE, which is required to run hypervisors on x86.
That needs an exported function; there is no way we'll allow random
writes to CR4, there's too much dodgy stuff in there.
On Tue, 07 Apr 2020 13:02:40 +0200
Peter Zijlstra <[email protected]> wrote:
> Since we now have infrastructure to analyze module text, disallow
> modules that write to CRn and DRn registers.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> arch/x86/kernel/module.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -266,6 +266,22 @@ static bool insn_is_vmx(struct insn *ins
> return false;
> }
>
> +static bool insn_is_mov_CRn(struct insn *insn)
> +{
> + if (insn->opcode.bytes[0] == 0x0f && insn->opcode.bytes[1] == 0x22)
> + return true;
> +
> + return false;
> +}
> +
> +static bool insn_is_mov_DRn(struct insn *insn)
> +{
> + if (insn->opcode.bytes[0] == 0x0f && insn->opcode.bytes[1] == 0x23)
> + return true;
> +
> + return false;
> +}
> +
> static int decode_module(struct module *mod, void *text, void *text_end, bool sld_safe)
> {
> bool allow_vmx = sld_safe || !split_lock_enabled();
> @@ -285,6 +301,11 @@ static int decode_module(struct module *
> return -ENOEXEC;
> }
>
> + if (insn_is_mov_CRn(&insn) || insn_is_mov_DRn(&insn)) {
> + pr_err("Module writes to CRn or DRn, please use the proper accessors: %s\n", mod->name);
> + return -ENOEXEC;
> + }
Hmm, wont this break jailhouse?
-- Steve
> +
> text += insn.length;
> }
>
>
On 07/04/20 23:27, Peter Zijlstra wrote:
> On Tue, Apr 07, 2020 at 02:22:11PM -0700, Nadav Amit wrote:
>> Anyhow, I do not think it is the only use-case which is not covered by your
>> patches (even considering CRs/DRs alone). For example, there is no kernel
>> function to turn on CR4.VMXE, which is required to run hypervisors on x86.
> That needs an exported function; there is no way we'll allow random
> writes to CR4, there's too much dodgy stuff in there.
native_write_cr4 and pv_ops (through which you can do write_cr4) are
both exported, and so is cpu_tlbstate which is used by __cr4_set_bits
and friends. Am I missing something glaringly obvious?
Paolo
On 07/04/2020 22:22, Nadav Amit wrote:
>> On Apr 7, 2020, at 1:50 PM, Peter Zijlstra <[email protected]> wrote:
>>
>> On Tue, Apr 07, 2020 at 01:27:45PM -0700, Nadav Amit wrote:
>>>> On Apr 7, 2020, at 12:38 PM, Peter Zijlstra <[email protected]> wrote:
>>>>
>>>> On Tue, Apr 07, 2020 at 11:55:21AM -0700, Nadav Amit wrote:
>>>>>> On Apr 7, 2020, at 4:02 AM, Peter Zijlstra <[email protected]> wrote:
>>>>>>
>>>>>> Since we now have infrastructure to analyze module text, disallow
>>>>>> modules that write to CRn and DRn registers.
>>>>> Assuming the kernel is built without CONFIG_PARAVIRT, what is the right way
>>>>> for out-of-tree modules to write to CRs? Let’s say CR2?
>>>> Most of them there is no real justification for ever writing to. CR2 I
>>>> suppose we can have an exception for given a sane rationale for why
>>>> you'd need to rewrite the fault address.
>>> For the same reason that KVM writes to CR2 - to restore CR2 before entering
>>> a guest, since CR2 not architecturally loaded from the VMCS. I suspect there
>>> are additional use-cases which are not covered by the kernel interfaces.
>> So I'm not much of a virt guy (clearly), and *groan*, that's horrible.
>> I'll go make an exception for CR2.
> Clearly you are not a virt guy if you think that this is the horrible part
> in x86 virtualization ;-)
Very definitely seconded :)
> Anyhow, I do not think it is the only use-case which is not covered by your
> patches (even considering CRs/DRs alone). For example, there is no kernel
> function to turn on CR4.VMXE, which is required to run hypervisors on x86.
How about taking this opportunity to see if there is a way to improve on
the status quo for co-existing hypervisor modules?
ISTR there are a large number of hoops to jump through if you're not
sure if anything else in the system is using VMX, going as far as
needing to do a full VMXON/VMXOFF cycle each context.
Enabling CR4.VMXE might want to be done proactively by the kernel.
Amongst other things, it gives protection against stray INIT IPIs in the
system, although the interaction with SMX (tboot / trenchboot) wants
considering carefully.
~Andrew
> On Apr 7, 2020, at 3:12 PM, Paolo Bonzini <[email protected]> wrote:
>
> On 07/04/20 23:27, Peter Zijlstra wrote:
>> On Tue, Apr 07, 2020 at 02:22:11PM -0700, Nadav Amit wrote:
>>> Anyhow, I do not think it is the only use-case which is not covered by your
>>> patches (even considering CRs/DRs alone). For example, there is no kernel
>>> function to turn on CR4.VMXE, which is required to run hypervisors on x86.
>> That needs an exported function; there is no way we'll allow random
>> writes to CR4, there's too much dodgy stuff in there.
>
> native_write_cr4 and pv_ops (through which you can do write_cr4) are
> both exported, and so is cpu_tlbstate which is used by __cr4_set_bits
> and friends. Am I missing something glaringly obvious?
No, I was the one who missed the obvious thing.
Having said that, I still think there are additional cases that need to be
handled. For instance, I see that is_erratum_383() in KVM (AMD) flushes the
local TLB by writing to CR3 the previous value. I am not familiar with the
erratum. Maybe I am missing something again, but I do not see an appropriate
exported alternative in the kernel.
On 08/04/20 01:15, Andrew Cooper wrote:
>> Anyhow, I do not think it is the only use-case which is not covered by your
>> patches (even considering CRs/DRs alone). For example, there is no kernel
>> function to turn on CR4.VMXE, which is required to run hypervisors on x86.
> How about taking this opportunity to see if there is a way to improve on
> the status quo for co-existing hypervisor modules?
Almost serious question: why? I can understand VMware, but why can't at
least VirtualBox use KVM on Linux? I am not sure if they are still
running device emulation in ring zero, but if so do you really want to
do that these days?
Paolo
On Tue, Apr 07, 2020 at 11:27:54PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 07, 2020 at 02:22:11PM -0700, Nadav Amit wrote:
> > Anyhow, I do not think it is the only use-case which is not covered by your
> > patches (even considering CRs/DRs alone). For example, there is no kernel
> > function to turn on CR4.VMXE, which is required to run hypervisors on x86.
>
> That needs an exported function; there is no way we'll allow random
> writes to CR4, there's too much dodgy stuff in there.
And this clearly shows while trying to cater to anyone doing hardware
virt out of tree is just a disaster and we need to stick to our
traditional line that out of tree modules don't matter, if you care
about your module bring it upstream. Especially as we have a perfectly
fine upstream module for just about every variant of hardware
virtualization that can be extended for the needs of other hypervisors.
On 07.04.20 23:48, Steven Rostedt wrote:
> On Tue, 07 Apr 2020 13:02:40 +0200
> Peter Zijlstra <[email protected]> wrote:
>
>> Since we now have infrastructure to analyze module text, disallow
>> modules that write to CRn and DRn registers.
>>
>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>> ---
>> arch/x86/kernel/module.c | 21 +++++++++++++++++++++
>> 1 file changed, 21 insertions(+)
>>
>> --- a/arch/x86/kernel/module.c
>> +++ b/arch/x86/kernel/module.c
>> @@ -266,6 +266,22 @@ static bool insn_is_vmx(struct insn *ins
>> return false;
>> }
>>
>> +static bool insn_is_mov_CRn(struct insn *insn)
>> +{
>> + if (insn->opcode.bytes[0] == 0x0f && insn->opcode.bytes[1] == 0x22)
>> + return true;
>> +
>> + return false;
>> +}
>> +
>> +static bool insn_is_mov_DRn(struct insn *insn)
>> +{
>> + if (insn->opcode.bytes[0] == 0x0f && insn->opcode.bytes[1] == 0x23)
>> + return true;
>> +
>> + return false;
>> +}
>> +
>> static int decode_module(struct module *mod, void *text, void *text_end, bool sld_safe)
>> {
>> bool allow_vmx = sld_safe || !split_lock_enabled();
>> @@ -285,6 +301,11 @@ static int decode_module(struct module *
>> return -ENOEXEC;
>> }
>>
>> + if (insn_is_mov_CRn(&insn) || insn_is_mov_DRn(&insn)) {
>> + pr_err("Module writes to CRn or DRn, please use the proper accessors: %s\n", mod->name);
>> + return -ENOEXEC;
>> + }
>
> Hmm, wont this break jailhouse?
Yes, possibly. We load the hypervisor binary via request_firmware into
executable memory and then jump into it. So most of the "suspicious"
code is there - except two cr4_init_shadow() calls to propagate the
non-transparent update of VMXE into that shadow. We could hide that CR4
flag, but that could mislead root Linux to try to use VMX while in jail.
Jan
--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux
On 08/04/20 07:58, Jan Kiszka wrote:
>>>
>>> + if (insn_is_mov_CRn(&insn) || insn_is_mov_DRn(&insn)) {
>>> + pr_err("Module writes to CRn or DRn, please use the
>>> proper accessors: %s\n", mod->name);
>>> + return -ENOEXEC;
>>> + }
>>
>> Hmm, wont this break jailhouse?
>
> Yes, possibly. We load the hypervisor binary via request_firmware into
> executable memory and then jump into it. So most of the "suspicious"
> code is there - except two cr4_init_shadow() calls to propagate the
> non-transparent update of VMXE into that shadow. We could hide that CR4
> flag, but that could mislead root Linux to try to use VMX while in jail.
Why not contribute the Jailhouse loader into Linux?
Paolo
On Wed, Apr 08, 2020 at 02:22:45AM +0200, Paolo Bonzini wrote:
> On 08/04/20 01:15, Andrew Cooper wrote:
> >> Anyhow, I do not think it is the only use-case which is not covered by your
> >> patches (even considering CRs/DRs alone). For example, there is no kernel
> >> function to turn on CR4.VMXE, which is required to run hypervisors on x86.
> > How about taking this opportunity to see if there is a way to improve on
> > the status quo for co-existing hypervisor modules?
>
> Almost serious question: why? I can understand VMware, but why can't at
> least VirtualBox use KVM on Linux? I am not sure if they are still
> running device emulation in ring zero, but if so do you really want to
> do that these days?
Having had the 'joy' of looking at the virtual-puke^Wbox code recently,
nobody with half a hair of sense on their head will want to ever touch
that thing again.
It's a security nightmare, and that's the best part of it.
On Wed, Apr 08, 2020 at 12:12:14AM +0200, Paolo Bonzini wrote:
> On 07/04/20 23:27, Peter Zijlstra wrote:
> > On Tue, Apr 07, 2020 at 02:22:11PM -0700, Nadav Amit wrote:
> >> Anyhow, I do not think it is the only use-case which is not covered by your
> >> patches (even considering CRs/DRs alone). For example, there is no kernel
> >> function to turn on CR4.VMXE, which is required to run hypervisors on x86.
> > That needs an exported function; there is no way we'll allow random
> > writes to CR4, there's too much dodgy stuff in there.
>
> native_write_cr4 and pv_ops (through which you can do write_cr4) are
> both exported, and so is cpu_tlbstate which is used by __cr4_set_bits
> and friends. Am I missing something glaringly obvious?
cpu_tlbstate is going away, but yes, native_write_cr4() is the right
interface to use, or rather the cr4_{set,clear,toggle}_bits() things
are.
That gives us control over which CR4 bits are available, and, a possible
means of arbitrating that VMX bit.
On Wed, Apr 08, 2020 at 07:58:53AM +0200, Jan Kiszka wrote:
> On 07.04.20 23:48, Steven Rostedt wrote:
> > Hmm, wont this break jailhouse?
Breaking it isn't a problem, it's out of tree and it should be fixable.
> Yes, possibly. We load the hypervisor binary via request_firmware into
> executable memory and then jump into it. So most of the "suspicious" code is
W.T.H. does the firmware loader have the ability to give executable
memory? We need to kill that too. /me goes find.
On 08/04/20 10:58, Jan Kiszka wrote:
>> Why not contribute the Jailhouse loader into Linux?
>
> Definitely planned. But right now it would add the burden of managing
> the interface between loader and hypervisor carefully. Currently it is
> internal to Jailhouse and maintained in lock-step, without any backward
> compatibility.
How often does that change?
Paolo
On 08.04.20 10:03, Paolo Bonzini wrote:
> On 08/04/20 07:58, Jan Kiszka wrote:
>>>>
>>>> + if (insn_is_mov_CRn(&insn) || insn_is_mov_DRn(&insn)) {
>>>> + pr_err("Module writes to CRn or DRn, please use the
>>>> proper accessors: %s\n", mod->name);
>>>> + return -ENOEXEC;
>>>> + }
>>>
>>> Hmm, wont this break jailhouse?
>>
>> Yes, possibly. We load the hypervisor binary via request_firmware into
>> executable memory and then jump into it. So most of the "suspicious"
>> code is there - except two cr4_init_shadow() calls to propagate the
>> non-transparent update of VMXE into that shadow. We could hide that CR4
>> flag, but that could mislead root Linux to try to use VMX while in jail.
>
> Why not contribute the Jailhouse loader into Linux?
>
Definitely planned. But right now it would add the burden of managing
the interface between loader and hypervisor carefully. Currently it is
internal to Jailhouse and maintained in lock-step, without any backward
compatibility.
Jan
--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux
On 08.04.20 10:51, Peter Zijlstra wrote:
> On Wed, Apr 08, 2020 at 07:58:53AM +0200, Jan Kiszka wrote:
>> On 07.04.20 23:48, Steven Rostedt wrote:
>
>>> Hmm, wont this break jailhouse?
>
> Breaking it isn't a problem, it's out of tree and it should be fixable.
>
>> Yes, possibly. We load the hypervisor binary via request_firmware into
>> executable memory and then jump into it. So most of the "suspicious" code is
>
> W.T.H. does the firmware loader have the ability to give executable
> memory? We need to kill that too. /me goes find.
>
At the risk of cutting our own branch off: That's not the firmware
loader, it's ioremap with PAGE_KERNEL_EXEC.
Jan
--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux
On Wed, Apr 08, 2020 at 10:51:38AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 08, 2020 at 07:58:53AM +0200, Jan Kiszka wrote:
> > On 07.04.20 23:48, Steven Rostedt wrote:
>
> > > Hmm, wont this break jailhouse?
>
> Breaking it isn't a problem, it's out of tree and it should be fixable.
>
> > Yes, possibly. We load the hypervisor binary via request_firmware into
> > executable memory and then jump into it. So most of the "suspicious" code is
>
> W.T.H. does the firmware loader have the ability to give executable
> memory? We need to kill that too. /me goes find.
AFAICT the firmware loader only provides PAGE_KERNEL_RO, so how do you
get it executable?
I'm thinking the patches Christoph has lined up will take care of this.
On 08.04.20 11:04, Paolo Bonzini wrote:
> On 08/04/20 10:58, Jan Kiszka wrote:
>>> Why not contribute the Jailhouse loader into Linux?
>>
>> Definitely planned. But right now it would add the burden of managing
>> the interface between loader and hypervisor carefully. Currently it is
>> internal to Jailhouse and maintained in lock-step, without any backward
>> compatibility.
>
> How often does that change?
Not often, but right now we are at a critical point, starting to explore
booting Jailhouse before Linux [1]. That may actually help to settle the
interface and move things forward.
Another to-do is overcoming the need for having to map the hypervisor at
a fixed location into the kernel address space. Not needed on arm64,
still required on 32-bit ARM (well...) and x86 (more important). I would
dislike pushing such legacy to upstream.
Jan
[1]
https://www.mail-archive.com/[email protected]/msg08389.html
--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux
From: Jan Kiszka
> Sent: 08 April 2020 09:59
...
> At the risk of cutting our own branch off: That's not the firmware
> loader, it's ioremap with PAGE_KERNEL_EXEC.
You could link the 'blob' into the .text part of a normal
kernel module and then load that.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On 08/04/2020 01:22, Paolo Bonzini wrote:
> On 08/04/20 01:15, Andrew Cooper wrote:
>>> Anyhow, I do not think it is the only use-case which is not covered by your
>>> patches (even considering CRs/DRs alone). For example, there is no kernel
>>> function to turn on CR4.VMXE, which is required to run hypervisors on x86.
>> How about taking this opportunity to see if there is a way to improve on
>> the status quo for co-existing hypervisor modules?
> Almost serious question: why? I can understand VMware, but why can't at
> least VirtualBox use KVM on Linux? I am not sure if they are still
> running device emulation in ring zero, but if so do you really want to
> do that these days?
I see a lot of good reasons not to use the VirtualBox out-of-tree module
specifically, but there are plenty of other out-of-tree hypervisors,
including Jailhouse and Bareflank which come to mind.
I'm not suggesting bending over backwards for them, but at the point
you're already breaking all of them anyway, it seems silly not to try
and address some of the other robustness issues.
~Andrew
From: Jan Kiszka
> Sent: 08 April 2020 12:13
>
> On 08.04.20 11:25, David Laight wrote:
> > From: Jan Kiszka
> >> Sent: 08 April 2020 09:59
> > ...
> >> At the risk of cutting our own branch off: That's not the firmware
> >> loader, it's ioremap with PAGE_KERNEL_EXEC.
> >
> > You could link the 'blob' into the .text part of a normal
> > kernel module and then load that.
>
> Sure, also possible. Was just more convenient so far to replace the
> hypervisor binary without having to recompile the driver module.
I was thinking of a separate module that contained nothing else.
If it depended on the driver it's 'init' function could call back
into the driver code to pass an entrypoint array (etc).
You can easily to a version check at that point as well.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On 08.04.20 11:13, Peter Zijlstra wrote:
> On Wed, Apr 08, 2020 at 10:51:38AM +0200, Peter Zijlstra wrote:
>> On Wed, Apr 08, 2020 at 07:58:53AM +0200, Jan Kiszka wrote:
>>> On 07.04.20 23:48, Steven Rostedt wrote:
>>
>>>> Hmm, wont this break jailhouse?
>>
>> Breaking it isn't a problem, it's out of tree and it should be fixable.
>>
>>> Yes, possibly. We load the hypervisor binary via request_firmware into
>>> executable memory and then jump into it. So most of the "suspicious" code is
>>
>> W.T.H. does the firmware loader have the ability to give executable
>> memory? We need to kill that too. /me goes find.
>
> AFAICT the firmware loader only provides PAGE_KERNEL_RO, so how do you
> get it executable?
memcpy(ioremapped_exec_region, firmware_image)
We only use the loader for getting the blob, not for running it. It has
to be put at a location that Linux will lose control over anyway.
>
> I'm thinking the patches Christoph has lined up will take care of this.
>
It would make sense from a certain POV...
Jan
--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux
On 08.04.20 11:25, David Laight wrote:
> From: Jan Kiszka
>> Sent: 08 April 2020 09:59
> ...
>> At the risk of cutting our own branch off: That's not the firmware
>> loader, it's ioremap with PAGE_KERNEL_EXEC.
>
> You could link the 'blob' into the .text part of a normal
> kernel module and then load that.
Sure, also possible. Was just more convenient so far to replace the
hypervisor binary without having to recompile the driver module.
Jan
--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux
On Tue, 07 Apr 2020 13:02:40 +0200
Peter Zijlstra <[email protected]> wrote:
> Since we now have infrastructure to analyze module text, disallow
> modules that write to CRn and DRn registers.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> arch/x86/kernel/module.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -266,6 +266,22 @@ static bool insn_is_vmx(struct insn *ins
> return false;
> }
>
> +static bool insn_is_mov_CRn(struct insn *insn)
> +{
> + if (insn->opcode.bytes[0] == 0x0f && insn->opcode.bytes[1] == 0x22)
> + return true;
> +
> + return false;
> +}
> +
> +static bool insn_is_mov_DRn(struct insn *insn)
> +{
> + if (insn->opcode.bytes[0] == 0x0f && insn->opcode.bytes[1] == 0x23)
> + return true;
> +
> + return false;
> +}
> +
> static int decode_module(struct module *mod, void *text, void *text_end, bool sld_safe)
> {
> bool allow_vmx = sld_safe || !split_lock_enabled();
> @@ -285,6 +301,11 @@ static int decode_module(struct module *
> return -ENOEXEC;
> }
>
> + if (insn_is_mov_CRn(&insn) || insn_is_mov_DRn(&insn)) {
> + pr_err("Module writes to CRn or DRn, please use the proper accessors: %s\n", mod->name);
> + return -ENOEXEC;
> + }
> +
Something like this should be done for all modules, not just out of tree
modules.
-- Steve
On Wed, Apr 08, 2020 at 09:27:26AM -0400, Steven Rostedt wrote:
> On Tue, 07 Apr 2020 13:02:40 +0200
> Peter Zijlstra <[email protected]> wrote:
> > + if (insn_is_mov_CRn(&insn) || insn_is_mov_DRn(&insn)) {
> > + pr_err("Module writes to CRn or DRn, please use the proper accessors: %s\n", mod->name);
> > + return -ENOEXEC;
> > + }
> > +
>
> Something like this should be done for all modules, not just out of tree
> modules.
I'm all for it; but people were worried scanning all modules was too
expensive (I don't really believe it is, module loading just can't be a
hot-path). Also, in-tree modules are audited a lot more than out of tree
magic voodoo crap.
On Wed, Apr 08, 2020 at 05:44:19PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 08, 2020 at 09:27:26AM -0400, Steven Rostedt wrote:
> > On Tue, 07 Apr 2020 13:02:40 +0200
> > Peter Zijlstra <[email protected]> wrote:
>
> > > + if (insn_is_mov_CRn(&insn) || insn_is_mov_DRn(&insn)) {
> > > + pr_err("Module writes to CRn or DRn, please use the proper accessors: %s\n", mod->name);
> > > + return -ENOEXEC;
> > > + }
> > > +
> >
> > Something like this should be done for all modules, not just out of tree
> > modules.
>
> I'm all for it; but people were worried scanning all modules was too
> expensive (I don't really believe it is, module loading just can't be a
> hot-path). Also, in-tree modules are audited a lot more than out of tree
> magic voodoo crap.
Scanning all modules seems safer. While we're at it - can be move the
kvm bits using VMX to be always in the core kernel and just forbid
modules from using those instructions entirely?
+++ Peter Zijlstra [08/04/20 17:44 +0200]:
>On Wed, Apr 08, 2020 at 09:27:26AM -0400, Steven Rostedt wrote:
>> On Tue, 07 Apr 2020 13:02:40 +0200
>> Peter Zijlstra <[email protected]> wrote:
>
>> > + if (insn_is_mov_CRn(&insn) || insn_is_mov_DRn(&insn)) {
>> > + pr_err("Module writes to CRn or DRn, please use the proper accessors: %s\n", mod->name);
>> > + return -ENOEXEC;
>> > + }
>> > +
>>
>> Something like this should be done for all modules, not just out of tree
>> modules.
>
>I'm all for it; but people were worried scanning all modules was too
>expensive (I don't really believe it is, module loading just can't be a
>hot-path). Also, in-tree modules are audited a lot more than out of tree
>magic voodoo crap.
The intention of the original patches was to do the text scan to catch
a handful of out-of-tree hypervisor modules - but now that
decode_module() is being generalized to more cases, I don't mind
scanning all modules.
Thanks,
Jessica
On Wed, Apr 08, 2020 at 08:46:02AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 08, 2020 at 05:44:19PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 08, 2020 at 09:27:26AM -0400, Steven Rostedt wrote:
> > > On Tue, 07 Apr 2020 13:02:40 +0200
> > > Peter Zijlstra <[email protected]> wrote:
> >
> > > > + if (insn_is_mov_CRn(&insn) || insn_is_mov_DRn(&insn)) {
> > > > + pr_err("Module writes to CRn or DRn, please use the proper accessors: %s\n", mod->name);
> > > > + return -ENOEXEC;
> > > > + }
> > > > +
> > >
> > > Something like this should be done for all modules, not just out of tree
> > > modules.
> >
> > I'm all for it; but people were worried scanning all modules was too
> > expensive (I don't really believe it is, module loading just can't be a
> > hot-path). Also, in-tree modules are audited a lot more than out of tree
> > magic voodoo crap.
>
> Scanning all modules seems safer. While we're at it - can be move the
> kvm bits using VMX to be always in the core kernel and just forbid
> modules from using those instructions entirely?
Practically speaking, no. Turning VMX on and off (literally VMXON/VMXOFF)
could be moved to helpers in the kernel, but KVM relies on inlining all
post-VMXON instructions (except for VMLAUNCH/VMRESUME) for performance.
VMLAUNCH/VMRESUME have their own caveats, moving them out of KVM would be
messy, to say the least.
On 08/04/20 17:46, Christoph Hellwig wrote:
> On Wed, Apr 08, 2020 at 05:44:19PM +0200, Peter Zijlstra wrote:
>> On Wed, Apr 08, 2020 at 09:27:26AM -0400, Steven Rostedt wrote:
>>> On Tue, 07 Apr 2020 13:02:40 +0200
>>> Peter Zijlstra <[email protected]> wrote:
>>
>>>> + if (insn_is_mov_CRn(&insn) || insn_is_mov_DRn(&insn)) {
>>>> + pr_err("Module writes to CRn or DRn, please use the proper accessors: %s\n", mod->name);
>>>> + return -ENOEXEC;
>>>> + }
>>>> +
>>>
>>> Something like this should be done for all modules, not just out of tree
>>> modules.
>>
>> I'm all for it; but people were worried scanning all modules was too
>> expensive (I don't really believe it is, module loading just can't be a
>> hot-path). Also, in-tree modules are audited a lot more than out of tree
>> magic voodoo crap.
>
> Scanning all modules seems safer. While we're at it - can be move the
> kvm bits using VMX to be always in the core kernel and just forbid
> modules from using those instructions entirely?
I suppose we could use PVOPS-style patching for the more
performance-critical cases, but VMREAD/VMWRITE does not seem like a
particularly bad thing to allow modules and VMLAUNCH/VMRESUME have very
peculiar calling conventions around them.
However, I wouldn't mind it if VMCLEAR/VMPTRLD and the associated kdump
cleanup code were moved to core kernel code.
Paolo
On Wed, Apr 08, 2020 at 06:15:48PM +0200, Paolo Bonzini wrote:
> On 08/04/20 17:46, Christoph Hellwig wrote:
> > Scanning all modules seems safer. While we're at it - can be move the
> > kvm bits using VMX to be always in the core kernel and just forbid
> > modules from using those instructions entirely?
>
> I suppose we could use PVOPS-style patching for the more
> performance-critical cases, but VMREAD/VMWRITE does not seem like a
> particularly bad thing to allow modules and VMLAUNCH/VMRESUME have very
> peculiar calling conventions around them.
>
> However, I wouldn't mind it if VMCLEAR/VMPTRLD and the associated kdump
> cleanup code were moved to core kernel code.
Speaking with my virt ignorance hat on, how impossible is it to provide
generic/useful VMLAUNCH/VMRESUME wrappers?
Because a lot of what happens around VMEXIT/VMENTER is very much like
the userspace entry crud, as per that series from Thomas that fixes all
that. And surely we don't need various broken copies of that in all the
out-of-tree hypervisors.
Also, I suppose if you have this, we no longer need to excempt CR2.
> On Apr 9, 2020, at 1:56 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Apr 08, 2020 at 06:15:48PM +0200, Paolo Bonzini wrote:
>> On 08/04/20 17:46, Christoph Hellwig wrote:
>
>>> Scanning all modules seems safer. While we're at it - can be move the
>>> kvm bits using VMX to be always in the core kernel and just forbid
>>> modules from using those instructions entirely?
>>
>> I suppose we could use PVOPS-style patching for the more
>> performance-critical cases, but VMREAD/VMWRITE does not seem like a
>> particularly bad thing to allow modules and VMLAUNCH/VMRESUME have very
>> peculiar calling conventions around them.
>>
>> However, I wouldn't mind it if VMCLEAR/VMPTRLD and the associated kdump
>> cleanup code were moved to core kernel code.
>
> Speaking with my virt ignorance hat on, how impossible is it to provide
> generic/useful VMLAUNCH/VMRESUME wrappers?
>
> Because a lot of what happens around VMEXIT/VMENTER is very much like
> the userspace entry crud, as per that series from Thomas that fixes all
> that. And surely we don't need various broken copies of that in all the
> out-of-tree hypervisors.
>
> Also, I suppose if you have this, we no longer need to excempt CR2.
It depends on what you mean by “VMLAUNCH/VMRESUME”. If you only consider the
instructions themselves, as Sean did in vmx_vmenter() and vmx_vmexit(),
there is no problem. Even if you consider saving the general purpose
registers as done in __vmx_vcpu_run() - that’s relatively easy.
I think that anything that is greater than that, for instance
vmx_vcpu_run(), would require more thought and effort. KVM data-structures,
specifically kvm_vcpu and friends, would need to be broken into general and
KVM specific structures. I am really not sure that the end result would be
much better than using KVM user-space interfaces.
I can ask someone from the VMware hypervisor developers to provide VMware
point-of-view, but it would help to know when do you plan to make such a
change take, and whether there would be some transition stage. Adapting a
hypervisor to use different low-level interfaces would require quite some
development & testing effort.
Nadav Amit <[email protected]> writes:
>> On Apr 9, 2020, at 1:56 AM, Peter Zijlstra <[email protected]> wrote:
>> Speaking with my virt ignorance hat on, how impossible is it to provide
>> generic/useful VMLAUNCH/VMRESUME wrappers?
>>
>> Because a lot of what happens around VMEXIT/VMENTER is very much like
>> the userspace entry crud, as per that series from Thomas that fixes all
>> that. And surely we don't need various broken copies of that in all the
>> out-of-tree hypervisors.
>>
>> Also, I suppose if you have this, we no longer need to excempt CR2.
>
> It depends on what you mean by “VMLAUNCH/VMRESUME”. If you only consider the
> instructions themselves, as Sean did in vmx_vmenter() and vmx_vmexit(),
> there is no problem. Even if you consider saving the general purpose
> registers as done in __vmx_vcpu_run() - that’s relatively easy.
__vmx_vcpu_run() is roughly the scope, but that wont work.
Looking at the vmmon source:
Task_Switch()
1) Mask all APIC LVTs which have NMI delivery mode enabled, e.g. PERF
2) Disable interrupts
3) Disable PEBS
4) Disable PT
5) Load a magic IDT
According to comments these are stubs to catch any exception which
happens while switching over.
6) Write CR0 and CR4 directly which is "safe" as the the IDT is
redirected to the monitor stubs.
7) VMXON()
8) Invoke monitor on some magic page which switches CR3 and GDT and
clears CR4.PCIDE (at least thats what the comments claim)
The monitor code is loaded from a binary only blob and that does
the actual vmlaunch/vmresume ...
And as this runs with a completely different CR3 sharing that
code is impossible.
When returning the above is undone in reverse order and any catched
exceptions / interrupts are replayed via "int $NR".
So it's pretty much the same mess as with vbox just different and
binary. Oh well...
The "good" news is that it's not involved in any of the context tracking
stuff so RCU wont ever be affected when a vmware vCPU runs. It's not
pretty, but TBH I don't care.
Thanks,
tglx
On Thu, 09 Apr 2020 23:13:22 +0200
Thomas Gleixner <[email protected]> wrote:
> Nadav Amit <[email protected]> writes:
> >> On Apr 9, 2020, at 1:56 AM, Peter Zijlstra <[email protected]> wrote:
> >> Speaking with my virt ignorance hat on, how impossible is it to provide
> >> generic/useful VMLAUNCH/VMRESUME wrappers?
> >>
> >> Because a lot of what happens around VMEXIT/VMENTER is very much like
> >> the userspace entry crud, as per that series from Thomas that fixes all
> >> that. And surely we don't need various broken copies of that in all the
> >> out-of-tree hypervisors.
> >>
> >> Also, I suppose if you have this, we no longer need to excempt CR2.
> >
> > It depends on what you mean by “VMLAUNCH/VMRESUME”. If you only consider the
> > instructions themselves, as Sean did in vmx_vmenter() and vmx_vmexit(),
> > there is no problem. Even if you consider saving the general purpose
> > registers as done in __vmx_vcpu_run() - that’s relatively easy.
>
> __vmx_vcpu_run() is roughly the scope, but that wont work.
>
> Looking at the vmmon source:
>
> Task_Switch()
>
> 1) Mask all APIC LVTs which have NMI delivery mode enabled, e.g. PERF
>
> 2) Disable interrupts
>
> 3) Disable PEBS
>
> 4) Disable PT
>
> 5) Load a magic IDT
>
> According to comments these are stubs to catch any exception which
> happens while switching over.
>
> 6) Write CR0 and CR4 directly which is "safe" as the the IDT is
> redirected to the monitor stubs.
>
> 7) VMXON()
>
> 8) Invoke monitor on some magic page which switches CR3 and GDT and
> clears CR4.PCIDE (at least thats what the comments claim)
>
> The monitor code is loaded from a binary only blob and that does
> the actual vmlaunch/vmresume ...
From what I understand (never looked at the code), is that this binary blob
is the same for Windows and Apple. It's basically its own operating system
that does all the work and vmmon is the way to switch to and from it. When
this blob gets an interrupt that it doesn't know about, it assumes it
belongs to the operating system its sharing the machine with and exits back
to it, whether that's Linux, Windows or OSX.
It's not too unlike what jailhouse does with its hypervisor, to take over
the machine and place the running Linux into its own "cell", except that it
will switch full control of the machine back to Linux.
-- Steve
>
> And as this runs with a completely different CR3 sharing that
> code is impossible.
>
> When returning the above is undone in reverse order and any catched
> exceptions / interrupts are replayed via "int $NR".
>
> So it's pretty much the same mess as with vbox just different and
> binary. Oh well...
>
> The "good" news is that it's not involved in any of the context tracking
> stuff so RCU wont ever be affected when a vmware vCPU runs. It's not
> pretty, but TBH I don't care.
>
> Thanks,
>
> tglx
>
> On Apr 9, 2020, at 3:18 PM, Steven Rostedt <[email protected]> wrote:
>
> On Thu, 09 Apr 2020 23:13:22 +0200
> Thomas Gleixner <[email protected]> wrote:
>>
>> __vmx_vcpu_run() is roughly the scope, but that wont work.
>>
>> Looking at the vmmon source:
>>
>>
[ Snip ]
>>
>> 8) Invoke monitor on some magic page which switches CR3 and GDT and
>> clears CR4.PCIDE (at least thats what the comments claim)
>>
>> The monitor code is loaded from a binary only blob and that does
>> the actual vmlaunch/vmresume ...
>
> From what I understand (never looked at the code), is that this binary blob
> is the same for Windows and Apple. It's basically its own operating system
> that does all the work and vmmon is the way to switch to and from it. When
> this blob gets an interrupt that it doesn't know about, it assumes it
> belongs to the operating system its sharing the machine with and exits back
> to it, whether that's Linux, Windows or OSX.
>
> It's not too unlike what jailhouse does with its hypervisor, to take over
> the machine and place the running Linux into its own "cell", except that it
> will switch full control of the machine back to Linux.
>
> -- Steve
>
>
>> And as this runs with a completely different CR3 sharing that
>> code is impossible.
To complement Steven’s response, we are fully aware in VMware about this
issue since the patches were sent, and the hypervisor developers consider
how to address it. Indeed, vmmon does something similar to VirtualBox and
Jailhouse. The planned restriction on the creation of executable mappings by
modules would require changes in all of these hypervisors.
Actually, IIUC, even Mircosoft Hyper-V might be broken by the proposed
changes that prevent the creation of executable mappings. It may appear as
if there is no issue since the __vmalloc() that hyperv_init() uses in the
kernel to setup the hypercall pages will be changed into vmalloc_exec() [1].
But, AFAICT Microsoft can also setup the hypercall page in a module as part
of its Linux Integration Services. This code uses
__vmalloc(…PAGE_KERNEL_RX), and would therefore be broken. [2]
It seems to me that the proposed restrictions on which instructions a module
is allowed to run are completely new, have significant implications, and
therefore would hopefully not be rushed in. Alternatively, a transition
period in which the checks only trigger warnings would be beneficial.
I am not a developer of the VMware hypervisor, but my understanding is that
developers have every intention to fully comply with the new restrictions on
memory mappings and instructions that a module is allowed to run. They can
consider sharing VM-entry/exit code with KVM even if requires code changes
for VMware. They just ask for sufficient time to make the required
adaptations. It would help to have a reasonable timeline for when each
of the proposed changes (__vmalloc(), MOV-CR/DR, other instructions)
is expected to be merged.
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://github.com/LIS/lis-next/blob/master/hv-rhel7.x/hv/arch/x86/hyperv/hv_init.c#L187