2024-02-06 12:04:52

by Sumit Gupta

[permalink] [raw]
Subject: [Patch] memory: tegra: Skip SID override from Guest VM

MC SID register access is restricted for Guest VM.
So, skip the SID override programming from the Guest VM.

Signed-off-by: Sumit Gupta <[email protected]>
---
drivers/memory/tegra/tegra186.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/drivers/memory/tegra/tegra186.c b/drivers/memory/tegra/tegra186.c
index 1b3183951bfe..df441896b69d 100644
--- a/drivers/memory/tegra/tegra186.c
+++ b/drivers/memory/tegra/tegra186.c
@@ -10,6 +10,7 @@
#include <linux/of.h>
#include <linux/of_platform.h>
#include <linux/platform_device.h>
+#include <asm/virt.h>

#include <soc/tegra/mc.h>

@@ -118,6 +119,11 @@ static int tegra186_mc_probe_device(struct tegra_mc *mc, struct device *dev)
unsigned int i, index = 0;
u32 sid;

+ if (!is_kernel_in_hyp_mode()) {
+ dev_dbg(mc->dev, "Register access not allowed\n");
+ return 0;
+ }
+
if (!tegra_dev_iommu_get_stream_id(dev, &sid))
return 0;

@@ -146,6 +152,11 @@ static int tegra186_mc_resume(struct tegra_mc *mc)
#if IS_ENABLED(CONFIG_IOMMU_API)
unsigned int i;

+ if (!is_kernel_in_hyp_mode()) {
+ dev_dbg(mc->dev, "Register access not allowed\n");
+ return 0;
+ }
+
for (i = 0; i < mc->soc->num_clients; i++) {
const struct tegra_mc_client *client = &mc->soc->clients[i];

--
2.17.1



2024-02-06 12:07:02

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [Patch] memory: tegra: Skip SID override from Guest VM

On 06/02/2024 12:48, Sumit Gupta wrote:
> MC SID register access is restricted for Guest VM.
> So, skip the SID override programming from the Guest VM.
>
> Signed-off-by: Sumit Gupta <[email protected]>
> ---
> drivers/memory/tegra/tegra186.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/drivers/memory/tegra/tegra186.c b/drivers/memory/tegra/tegra186.c
> index 1b3183951bfe..df441896b69d 100644
> --- a/drivers/memory/tegra/tegra186.c
> +++ b/drivers/memory/tegra/tegra186.c
> @@ -10,6 +10,7 @@
> #include <linux/of.h>
> #include <linux/of_platform.h>
> #include <linux/platform_device.h>
> +#include <asm/virt.h>

Are you sure this still compile tests?

>
> #include <soc/tegra/mc.h>
>
> @@ -118,6 +119,11 @@ static int tegra186_mc_probe_device(struct tegra_mc *mc, struct device *dev)
> unsigned int i, index = 0;
> u32 sid;
>
> + if (!is_kernel_in_hyp_mode()) {
> + dev_dbg(mc->dev, "Register access not allowed\n");

Doesn't this depend on hypervisor?

> + return 0;
> + }
> +
> if (!tegra_dev_iommu_get_stream_id(dev, &sid))
> return 0;

Best regards,
Krzysztof


2024-02-06 12:19:41

by Mark Rutland

[permalink] [raw]
Subject: Re: [Patch] memory: tegra: Skip SID override from Guest VM

On Tue, Feb 06, 2024 at 05:18:52PM +0530, Sumit Gupta wrote:
> MC SID register access is restricted for Guest VM.
> So, skip the SID override programming from the Guest VM.
>
> Signed-off-by: Sumit Gupta <[email protected]>

Surely this is probed from the DT?

Why would the hypervisor put this in the guest's DT if that hypervisor isn't
exposing it to the guest?

Mark.

> ---
> drivers/memory/tegra/tegra186.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/drivers/memory/tegra/tegra186.c b/drivers/memory/tegra/tegra186.c
> index 1b3183951bfe..df441896b69d 100644
> --- a/drivers/memory/tegra/tegra186.c
> +++ b/drivers/memory/tegra/tegra186.c
> @@ -10,6 +10,7 @@
> #include <linux/of.h>
> #include <linux/of_platform.h>
> #include <linux/platform_device.h>
> +#include <asm/virt.h>
>
> #include <soc/tegra/mc.h>
>
> @@ -118,6 +119,11 @@ static int tegra186_mc_probe_device(struct tegra_mc *mc, struct device *dev)
> unsigned int i, index = 0;
> u32 sid;
>
> + if (!is_kernel_in_hyp_mode()) {
> + dev_dbg(mc->dev, "Register access not allowed\n");
> + return 0;
> + }
> +
> if (!tegra_dev_iommu_get_stream_id(dev, &sid))
> return 0;
>
> @@ -146,6 +152,11 @@ static int tegra186_mc_resume(struct tegra_mc *mc)
> #if IS_ENABLED(CONFIG_IOMMU_API)
> unsigned int i;
>
> + if (!is_kernel_in_hyp_mode()) {
> + dev_dbg(mc->dev, "Register access not allowed\n");
> + return 0;
> + }
> +
> for (i = 0; i < mc->soc->num_clients; i++) {
> const struct tegra_mc_client *client = &mc->soc->clients[i];
>
> --
> 2.17.1
>

2024-02-06 12:31:57

by Marc Zyngier

[permalink] [raw]
Subject: Re: [Patch] memory: tegra: Skip SID override from Guest VM

Hi Sumit,

On Tue, 06 Feb 2024 11:48:52 +0000,
Sumit Gupta <[email protected]> wrote:
>
> MC SID register access is restricted for Guest VM.
> So, skip the SID override programming from the Guest VM.
>
> Signed-off-by: Sumit Gupta <[email protected]>
> ---
> drivers/memory/tegra/tegra186.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/drivers/memory/tegra/tegra186.c b/drivers/memory/tegra/tegra186.c
> index 1b3183951bfe..df441896b69d 100644
> --- a/drivers/memory/tegra/tegra186.c
> +++ b/drivers/memory/tegra/tegra186.c
> @@ -10,6 +10,7 @@
> #include <linux/of.h>
> #include <linux/of_platform.h>
> #include <linux/platform_device.h>
> +#include <asm/virt.h>
>
> #include <soc/tegra/mc.h>
>
> @@ -118,6 +119,11 @@ static int tegra186_mc_probe_device(struct tegra_mc *mc, struct device *dev)
> unsigned int i, index = 0;
> u32 sid;
>
> + if (!is_kernel_in_hyp_mode()) {
> + dev_dbg(mc->dev, "Register access not allowed\n");
> + return 0;
> + }
> +
> if (!tegra_dev_iommu_get_stream_id(dev, &sid))
> return 0;
>
> @@ -146,6 +152,11 @@ static int tegra186_mc_resume(struct tegra_mc *mc)
> #if IS_ENABLED(CONFIG_IOMMU_API)
> unsigned int i;
>
> + if (!is_kernel_in_hyp_mode()) {
> + dev_dbg(mc->dev, "Register access not allowed\n");
> + return 0;
> + }
> +
> for (i = 0; i < mc->soc->num_clients; i++) {
> const struct tegra_mc_client *client = &mc->soc->clients[i];
>

This doesn't look right. Multiple reasons:

- This helper really has nothing to do in a driver. This is
architectural stuff that is not intended for use outside of arch
core code.

- My own tegra186 HW doesn't have VHE, since it is ARMv8.0, and this
helper will always return 'false'. How could this result in
something that still works? Can I get a free CPU upgrade?

- If you assign this device to a VM and that the hypervisor doesn't
correctly virtualise it, then it is a different device and you
should simply advertise it something else. Or even better, fix your
hypervisor.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2024-02-06 12:34:18

by Jon Hunter

[permalink] [raw]
Subject: Re: [Patch] memory: tegra: Skip SID override from Guest VM

Hi Marc,

On 06/02/2024 12:17, Marc Zyngier wrote:
> Hi Sumit,
>
> On Tue, 06 Feb 2024 11:48:52 +0000,
> Sumit Gupta <[email protected]> wrote:
>>
>> MC SID register access is restricted for Guest VM.
>> So, skip the SID override programming from the Guest VM.
>>
>> Signed-off-by: Sumit Gupta <[email protected]>
>> ---
>> drivers/memory/tegra/tegra186.c | 11 +++++++++++
>> 1 file changed, 11 insertions(+)
>>
>> diff --git a/drivers/memory/tegra/tegra186.c b/drivers/memory/tegra/tegra186.c
>> index 1b3183951bfe..df441896b69d 100644
>> --- a/drivers/memory/tegra/tegra186.c
>> +++ b/drivers/memory/tegra/tegra186.c
>> @@ -10,6 +10,7 @@
>> #include <linux/of.h>
>> #include <linux/of_platform.h>
>> #include <linux/platform_device.h>
>> +#include <asm/virt.h>
>>
>> #include <soc/tegra/mc.h>
>>
>> @@ -118,6 +119,11 @@ static int tegra186_mc_probe_device(struct tegra_mc *mc, struct device *dev)
>> unsigned int i, index = 0;
>> u32 sid;
>>
>> + if (!is_kernel_in_hyp_mode()) {
>> + dev_dbg(mc->dev, "Register access not allowed\n");
>> + return 0;
>> + }
>> +
>> if (!tegra_dev_iommu_get_stream_id(dev, &sid))
>> return 0;
>>
>> @@ -146,6 +152,11 @@ static int tegra186_mc_resume(struct tegra_mc *mc)
>> #if IS_ENABLED(CONFIG_IOMMU_API)
>> unsigned int i;
>>
>> + if (!is_kernel_in_hyp_mode()) {
>> + dev_dbg(mc->dev, "Register access not allowed\n");
>> + return 0;
>> + }
>> +
>> for (i = 0; i < mc->soc->num_clients; i++) {
>> const struct tegra_mc_client *client = &mc->soc->clients[i];
>>
>
> This doesn't look right. Multiple reasons:
>
> - This helper really has nothing to do in a driver. This is
> architectural stuff that is not intended for use outside of arch
> core code.

We see a few other drivers using this ...

drivers/perf/arm_pmuv3.c
drivers/clocksource/arm_arch_timer.c
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
drivers/hwtracing/coresight/coresight-etm4x-core.c
drivers/hwtracing/coresight/coresight-etm4x-core.c
drivers/irqchip/irq-apple-aic.c

We were looking for a way to determine if the OS is a guest OS or not.
However, I can see that this is a ARM64 specific API and so probably the
above are only compiled for ARM64. Interestingly, the AMD driver
implements the following ...

static inline bool is_virtual_machine(void)
{
#if defined(CONFIG_X86)
return boot_cpu_has(X86_FEATURE_HYPERVISOR);
#elif defined(CONFIG_ARM64)
return !is_kernel_in_hyp_mode();
#else
return false;
#endif
}

> - My own tegra186 HW doesn't have VHE, since it is ARMv8.0, and this
> helper will always return 'false'. How could this result in
> something that still works? Can I get a free CPU upgrade?

I thought this API just checks to see if we are in EL2?

> - If you assign this device to a VM and that the hypervisor doesn't
> correctly virtualise it, then it is a different device and you
> should simply advertise it something else. Or even better, fix your
> hypervisor.

Sumit can add some more details on why we don't completely disable the
device for guest OSs.

Jon

--
nvpublic

2024-02-06 12:53:12

by Marc Zyngier

[permalink] [raw]
Subject: Re: [Patch] memory: tegra: Skip SID override from Guest VM

Hi Jon,

On Tue, 06 Feb 2024 12:28:27 +0000,
Jon Hunter <[email protected]> wrote:
>
> Hi Marc,
>
> On 06/02/2024 12:17, Marc Zyngier wrote:
> > Hi Sumit,
> >
> > On Tue, 06 Feb 2024 11:48:52 +0000,
> > Sumit Gupta <[email protected]> wrote:
> >>
> >> MC SID register access is restricted for Guest VM.
> >> So, skip the SID override programming from the Guest VM.
> >>
> >> Signed-off-by: Sumit Gupta <[email protected]>
> >> ---
> >> drivers/memory/tegra/tegra186.c | 11 +++++++++++
> >> 1 file changed, 11 insertions(+)
> >>
> >> diff --git a/drivers/memory/tegra/tegra186.c b/drivers/memory/tegra/tegra186.c
> >> index 1b3183951bfe..df441896b69d 100644
> >> --- a/drivers/memory/tegra/tegra186.c
> >> +++ b/drivers/memory/tegra/tegra186.c
> >> @@ -10,6 +10,7 @@
> >> #include <linux/of.h>
> >> #include <linux/of_platform.h>
> >> #include <linux/platform_device.h>
> >> +#include <asm/virt.h>
> >> #include <soc/tegra/mc.h>
> >> @@ -118,6 +119,11 @@ static int tegra186_mc_probe_device(struct
> >> tegra_mc *mc, struct device *dev)
> >> unsigned int i, index = 0;
> >> u32 sid;
> >> + if (!is_kernel_in_hyp_mode()) {
> >> + dev_dbg(mc->dev, "Register access not allowed\n");
> >> + return 0;
> >> + }
> >> +
> >> if (!tegra_dev_iommu_get_stream_id(dev, &sid))
> >> return 0;
> >> @@ -146,6 +152,11 @@ static int tegra186_mc_resume(struct
> >> tegra_mc *mc)
> >> #if IS_ENABLED(CONFIG_IOMMU_API)
> >> unsigned int i;
> >> + if (!is_kernel_in_hyp_mode()) {
> >> + dev_dbg(mc->dev, "Register access not allowed\n");
> >> + return 0;
> >> + }
> >> +
> >> for (i = 0; i < mc->soc->num_clients; i++) {
> >> const struct tegra_mc_client *client = &mc->soc->clients[i];
> >>
> >
> > This doesn't look right. Multiple reasons:
> >
> > - This helper really has nothing to do in a driver. This is
> > architectural stuff that is not intended for use outside of arch
> > core code.
>
> We see a few other drivers using this ...
>
> drivers/perf/arm_pmuv3.c
> drivers/clocksource/arm_arch_timer.c

These two are definitely part of the CPU architecture.

> drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h

This is just a bug. Please don't copy this stuff.

> drivers/hwtracing/coresight/coresight-etm4x-core.c
> drivers/hwtracing/coresight/coresight-etm4x-core.c
> drivers/irqchip/irq-apple-aic.c

These are also part of the CPU architecture.

>
> We were looking for a way to determine if the OS is a guest OS or
> not. However, I can see that this is a ARM64 specific API and so
> probably the above are only compiled for ARM64. Interestingly, the AMD
> driver implements the following ...
>
> static inline bool is_virtual_machine(void)
> {
> #if defined(CONFIG_X86)
> return boot_cpu_has(X86_FEATURE_HYPERVISOR);
> #elif defined(CONFIG_ARM64)
> return !is_kernel_in_hyp_mode();
> #else
> return false;
> #endif
> }

This stuff should simply be ripped out and burned. Whoever wrote it
didn't understand how this works.

>
> > - My own tegra186 HW doesn't have VHE, since it is ARMv8.0, and this
> > helper will always return 'false'. How could this result in
> > something that still works? Can I get a free CPU upgrade?
>
> I thought this API just checks to see if we are in EL2?

It does. And that's the problem. On ARMv8.0, we run the Linux kernel
at EL1. Tegra186 is ARMv8.0 (Denver + A57). So as written, this change
breaks the very platform it intends to support.

>
> > - If you assign this device to a VM and that the hypervisor doesn't
> > correctly virtualise it, then it is a different device and you
> > should simply advertise it something else. Or even better, fix your
> > hypervisor.
>
> Sumit can add some more details on why we don't completely disable the
> device for guest OSs.

It's not about disabling it. It is about correctly supporting it
(providing full emulation for it), or advertising it as something
different so that SW can handle it differently.

Poking into the internals of how the kernel is booted for a driver
that isn't tied to the core architecture (because it would need to
access system registers, for example) is not an acceptable outcome.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2024-02-06 13:03:38

by Marc Zyngier

[permalink] [raw]
Subject: Re: [Patch] memory: tegra: Skip SID override from Guest VM

On Tue, 06 Feb 2024 12:51:35 +0000,
Jon Hunter <[email protected]> wrote:
>
>
> On 06/02/2024 12:28, Jon Hunter wrote:
>
> ...
>
> >> - My own tegra186 HW doesn't have VHE, since it is ARMv8.0, and this
> >>    helper will always return 'false'. How could this result in
> >>    something that still works? Can I get a free CPU upgrade?
> >
> > I thought this API just checks to see if we are in EL2?
>
>
> Sorry to add a bit more info, I see EL2 is used for hypervisor [0],
> but on my Tegra186 with no hypervisor I see ...
>
> CPU: All CPU(s) started at EL2

Yes. and yet the kernel runs at EL1 (on ARMv8.0, we can't run the
kernel at EL2 at all).

M.

--
Without deviation from the norm, progress is not possible.

2024-02-06 14:07:23

by Thierry Reding

[permalink] [raw]
Subject: Re: [Patch] memory: tegra: Skip SID override from Guest VM

On Tue Feb 6, 2024 at 1:53 PM CET, Marc Zyngier wrote:
> On Tue, 06 Feb 2024 12:28:27 +0000, Jon Hunter <[email protected]> wrote:
> > On 06/02/2024 12:17, Marc Zyngier wrote:
[...]
> > > - My own tegra186 HW doesn't have VHE, since it is ARMv8.0, and this
> > > helper will always return 'false'. How could this result in
> > > something that still works? Can I get a free CPU upgrade?
> >
> > I thought this API just checks to see if we are in EL2?
>
> It does. And that's the problem. On ARMv8.0, we run the Linux kernel
> at EL1. Tegra186 is ARMv8.0 (Denver + A57). So as written, this change
> breaks the very platform it intends to support.

To clarify, the code that accesses these registers is shared across
Tegra186 and later chips. Tegra194 and later do support ARMv8.1 VHE.

Granted, if it always returns false on Tegra186 that's not what we
want.

> > > - If you assign this device to a VM and that the hypervisor doesn't
> > > correctly virtualise it, then it is a different device and you
> > > should simply advertise it something else. Or even better, fix your
> > > hypervisor.
> >
> > Sumit can add some more details on why we don't completely disable the
> > device for guest OSs.
>
> It's not about disabling it. It is about correctly supporting it
> (providing full emulation for it), or advertising it as something
> different so that SW can handle it differently.

It's really not a different device. It's exactly the same device except
that accessing some registers isn't permitted. We also can't easily
remove parts of the register region from device tree because these are
intermixed with other registers that we do want access to.

> Poking into the internals of how the kernel is booted for a driver
> that isn't tied to the core architecture (because it would need to
> access system registers, for example) is not an acceptable outcome.

So what would be the better option? Use a different compatible string to
make the driver handle the device differently? Or adding a custom
property to the device tree node to mark this as running in a
virtualized environment?

Perhaps we can reuse the top-level hypervisor node? That seems to only
ever have been used for Xen on 32-bit ARM, so not sure if that'd still
be appropriate.

Thierry


Attachments:
signature.asc (849.00 B)

2024-02-06 15:28:31

by Marc Zyngier

[permalink] [raw]
Subject: Re: [Patch] memory: tegra: Skip SID override from Guest VM

On Tue, 06 Feb 2024 14:07:10 +0000,
"Thierry Reding" <[email protected]> wrote:
>
> [1 <text/plain; UTF-8 (quoted-printable)>]
> On Tue Feb 6, 2024 at 1:53 PM CET, Marc Zyngier wrote:
> > On Tue, 06 Feb 2024 12:28:27 +0000, Jon Hunter <[email protected]> wrote:
> > > On 06/02/2024 12:17, Marc Zyngier wrote:
> [...]
> > > > - My own tegra186 HW doesn't have VHE, since it is ARMv8.0, and this
> > > > helper will always return 'false'. How could this result in
> > > > something that still works? Can I get a free CPU upgrade?
> > >
> > > I thought this API just checks to see if we are in EL2?
> >
> > It does. And that's the problem. On ARMv8.0, we run the Linux kernel
> > at EL1. Tegra186 is ARMv8.0 (Denver + A57). So as written, this change
> > breaks the very platform it intends to support.
>
> To clarify, the code that accesses these registers is shared across
> Tegra186 and later chips. Tegra194 and later do support ARMv8.1 VHE.

But even on these machines that are VHE-capable, not running at EL2
doesn't mean we're running as a guest. The user can force the kernel
to stick to EL1, using a command-line option such as kvm-arm.mode=nvhe
which will force the kernel to stay at EL1 while deploying KVM at EL2.

> Granted, if it always returns false on Tegra186 that's not what we
> want.

I'm glad we agree here.

> > > > - If you assign this device to a VM and that the hypervisor doesn't
> > > > correctly virtualise it, then it is a different device and you
> > > > should simply advertise it something else. Or even better, fix your
> > > > hypervisor.
> > >
> > > Sumit can add some more details on why we don't completely disable the
> > > device for guest OSs.
> >
> > It's not about disabling it. It is about correctly supporting it
> > (providing full emulation for it), or advertising it as something
> > different so that SW can handle it differently.
>
> It's really not a different device. It's exactly the same device except
> that accessing some registers isn't permitted. We also can't easily
> remove parts of the register region from device tree because these are
> intermixed with other registers that we do want access to.

But that's the definition of being a different device. It has a
different programming interface, hence it is different. The fact that
it is the same HW block mediated by a hypervisor doesn't really change
that.

> > Poking into the internals of how the kernel is booted for a driver
> > that isn't tied to the core architecture (because it would need to
> > access system registers, for example) is not an acceptable outcome.
>
> So what would be the better option? Use a different compatible string to
> make the driver handle the device differently? Or adding a custom
> property to the device tree node to mark this as running in a
> virtualized environment?

A different compatible string would be my preferred option. An extra
property would work as well. As far as I am concerned, these two
options are the right way to express the fact that you have something
that isn't quite like the real thing.

> Perhaps we can reuse the top-level hypervisor node? That seems to only
> ever have been used for Xen on 32-bit ARM, so not sure if that'd still
> be appropriate.

I'd shy away from this. You would be deriving properties from a
hypervisor implementation, instead of expressing those properties
directly. In my experience, the direct method is always preferable.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2024-02-06 17:09:14

by Thierry Reding

[permalink] [raw]
Subject: Re: [Patch] memory: tegra: Skip SID override from Guest VM

On Tue Feb 6, 2024 at 3:54 PM CET, Marc Zyngier wrote:
> On Tue, 06 Feb 2024 14:07:10 +0000,
> "Thierry Reding" <[email protected]> wrote:
> >
> > [1 <text/plain; UTF-8 (quoted-printable)>]
> > On Tue Feb 6, 2024 at 1:53 PM CET, Marc Zyngier wrote:
> > > On Tue, 06 Feb 2024 12:28:27 +0000, Jon Hunter <[email protected]> wrote:
> > > > On 06/02/2024 12:17, Marc Zyngier wrote:
> > [...]
> > > > > - My own tegra186 HW doesn't have VHE, since it is ARMv8.0, and this
> > > > > helper will always return 'false'. How could this result in
> > > > > something that still works? Can I get a free CPU upgrade?
> > > >
> > > > I thought this API just checks to see if we are in EL2?
> > >
> > > It does. And that's the problem. On ARMv8.0, we run the Linux kernel
> > > at EL1. Tegra186 is ARMv8.0 (Denver + A57). So as written, this change
> > > breaks the very platform it intends to support.
> >
> > To clarify, the code that accesses these registers is shared across
> > Tegra186 and later chips. Tegra194 and later do support ARMv8.1 VHE.
>
> But even on these machines that are VHE-capable, not running at EL2
> doesn't mean we're running as a guest. The user can force the kernel
> to stick to EL1, using a command-line option such as kvm-arm.mode=nvhe
> which will force the kernel to stay at EL1 while deploying KVM at EL2.
>
> > Granted, if it always returns false on Tegra186 that's not what we
> > want.
>
> I'm glad we agree here.
>
> > > > > - If you assign this device to a VM and that the hypervisor doesn't
> > > > > correctly virtualise it, then it is a different device and you
> > > > > should simply advertise it something else. Or even better, fix your
> > > > > hypervisor.
> > > >
> > > > Sumit can add some more details on why we don't completely disable the
> > > > device for guest OSs.
> > >
> > > It's not about disabling it. It is about correctly supporting it
> > > (providing full emulation for it), or advertising it as something
> > > different so that SW can handle it differently.
> >
> > It's really not a different device. It's exactly the same device except
> > that accessing some registers isn't permitted. We also can't easily
> > remove parts of the register region from device tree because these are
> > intermixed with other registers that we do want access to.
>
> But that's the definition of being a different device. It has a
> different programming interface, hence it is different. The fact that
> it is the same HW block mediated by a hypervisor doesn't really change
> that.

The programming model isn't really different in these cases, but rather
restricted. I think a compatible string is a suboptimal way to describe
this.

> > > Poking into the internals of how the kernel is booted for a driver
> > > that isn't tied to the core architecture (because it would need to
> > > access system registers, for example) is not an acceptable outcome.
> >
> > So what would be the better option? Use a different compatible string to
> > make the driver handle the device differently? Or adding a custom
> > property to the device tree node to mark this as running in a
> > virtualized environment?
>
> A different compatible string would be my preferred option. An extra
> property would work as well. As far as I am concerned, these two
> options are the right way to express the fact that you have something
> that isn't quite like the real thing.

Coincidentally there's another discussion with a lot of similarities
regarding simulated platforms. For these it's usually less about the
register set being restricted and more about certain quirks that are
needed which will not ultimately be necessary for silicon.

This could be a timeout that's longer in simulation, or it could be
certain programming that would be needed in silicon but isn't necessary
or functional in simulation (think I/O calibration, that sort of thing).
One could argue that these are also different devices when in simulation
but they really aren't. They're more like an approximation of the actual
device that will be in silicon chips.

Another problem that both of the cases have in common is that they are
parameters that usually apply to the entire system. For some devices it
is easier to parameterize via DT (for example for certain devices we
have bindings with special register regions that are only available in
host OS mode), but for others this may not be true. Adding extra
compatible strings for virtualization/simulation is going to get quite
complex very quickly if we need to differentiate between all of these
scenarios.

> > Perhaps we can reuse the top-level hypervisor node? That seems to only
> > ever have been used for Xen on 32-bit ARM, so not sure if that'd still
> > be appropriate.
>
> I'd shy away from this. You would be deriving properties from a
> hypervisor implementation, instead of expressing those properties
> directly. In my experience, the direct method is always preferable.

I would generally agree. However, I think especially the compatible
string solution could turn very ugly for this. If we express these
properties via compatible strings we may very well end up with many
different compatible strings to cover all cases.

Say you've got one hypervisor that changes the programming model in a
certain way and a second hypervisor that constrains in a different way.
Do we now need one compatible string for each hypervisor? Do we add
compatible strings for each restriction and have potentially very long
compatible string lists? Separate properties would work slightly better
for this.

There are some cases where we can use register contents to determine
what the OS is allowed to do, but these registers don't exist for all HW
blocks. We may be able to get more added to new chips, but we obviously
can't retroactively add them for existing ones.

A central node or property would at least allow broad parameterization.
I would hope that at least hypervisor implementations don't vary too
much in terms of what they restrict and what they don't, so perhaps it
wouldn't be that bad. Perhaps that's also overly optimistic.

Thierry


Attachments:
signature.asc (849.00 B)

2024-02-07 12:03:54

by Marc Zyngier

[permalink] [raw]
Subject: Re: [Patch] memory: tegra: Skip SID override from Guest VM

On Tue, 06 Feb 2024 17:08:42 +0000,
"Thierry Reding" <[email protected]> wrote:
>
> [1 <text/plain; UTF-8 (quoted-printable)>]
> On Tue Feb 6, 2024 at 3:54 PM CET, Marc Zyngier wrote:
> > On Tue, 06 Feb 2024 14:07:10 +0000,
> > "Thierry Reding" <[email protected]> wrote:
> > >
> > > [1 <text/plain; UTF-8 (quoted-printable)>]
> > > On Tue Feb 6, 2024 at 1:53 PM CET, Marc Zyngier wrote:
> > > > On Tue, 06 Feb 2024 12:28:27 +0000, Jon Hunter <[email protected]> wrote:
> > > > > On 06/02/2024 12:17, Marc Zyngier wrote:
> > > [...]
> > > > > > - My own tegra186 HW doesn't have VHE, since it is ARMv8.0, and this
> > > > > > helper will always return 'false'. How could this result in
> > > > > > something that still works? Can I get a free CPU upgrade?
> > > > >
> > > > > I thought this API just checks to see if we are in EL2?
> > > >
> > > > It does. And that's the problem. On ARMv8.0, we run the Linux kernel
> > > > at EL1. Tegra186 is ARMv8.0 (Denver + A57). So as written, this change
> > > > breaks the very platform it intends to support.
> > >
> > > To clarify, the code that accesses these registers is shared across
> > > Tegra186 and later chips. Tegra194 and later do support ARMv8.1 VHE.
> >
> > But even on these machines that are VHE-capable, not running at EL2
> > doesn't mean we're running as a guest. The user can force the kernel
> > to stick to EL1, using a command-line option such as kvm-arm.mode=nvhe
> > which will force the kernel to stay at EL1 while deploying KVM at EL2.
> >
> > > Granted, if it always returns false on Tegra186 that's not what we
> > > want.
> >
> > I'm glad we agree here.
> >
> > > > > > - If you assign this device to a VM and that the hypervisor doesn't
> > > > > > correctly virtualise it, then it is a different device and you
> > > > > > should simply advertise it something else. Or even better, fix your
> > > > > > hypervisor.
> > > > >
> > > > > Sumit can add some more details on why we don't completely disable the
> > > > > device for guest OSs.
> > > >
> > > > It's not about disabling it. It is about correctly supporting it
> > > > (providing full emulation for it), or advertising it as something
> > > > different so that SW can handle it differently.
> > >
> > > It's really not a different device. It's exactly the same device except
> > > that accessing some registers isn't permitted. We also can't easily
> > > remove parts of the register region from device tree because these are
> > > intermixed with other registers that we do want access to.
> >
> > But that's the definition of being a different device. It has a
> > different programming interface, hence it is different. The fact that
> > it is the same HW block mediated by a hypervisor doesn't really change
> > that.
>
> The programming model isn't really different in these cases, but rather
> restricted. I think a compatible string is a suboptimal way to describe
> this.

It *is* different. If it wasn't different, you wouldn't need this
patch. I'm puzzled that we have to argue on *that*. You can call it
restricted, I call it broken. In both case, it is a *different*
programming interface as you can't use existing SW for it.

>
> > > > Poking into the internals of how the kernel is booted for a driver
> > > > that isn't tied to the core architecture (because it would need to
> > > > access system registers, for example) is not an acceptable outcome.
> > >
> > > So what would be the better option? Use a different compatible string to
> > > make the driver handle the device differently? Or adding a custom
> > > property to the device tree node to mark this as running in a
> > > virtualized environment?
> >
> > A different compatible string would be my preferred option. An extra
> > property would work as well. As far as I am concerned, these two
> > options are the right way to express the fact that you have something
> > that isn't quite like the real thing.
>
> Coincidentally there's another discussion with a lot of similarities
> regarding simulated platforms. For these it's usually less about the
> register set being restricted and more about certain quirks that are
> needed which will not ultimately be necessary for silicon.
>
> This could be a timeout that's longer in simulation, or it could be
> certain programming that would be needed in silicon but isn't necessary
> or functional in simulation (think I/O calibration, that sort of thing).
> One could argue that these are also different devices when in simulation
> but they really aren't. They're more like an approximation of the actual
> device that will be in silicon chips.

Simulation/DV environments are a very different kettle of fish. You
generally treat passing time with a scaling factor, and you are likely
to run very hacked-up SW stack anyway.

In any case, this is not relevant to upstream stuff, unless you plan
to ship your emulation environment.

> Another problem that both of the cases have in common is that they are
> parameters that usually apply to the entire system. For some devices it
> is easier to parameterize via DT (for example for certain devices we
> have bindings with special register regions that are only available in
> host OS mode), but for others this may not be true. Adding extra
> compatible strings for virtualization/simulation is going to get quite
> complex very quickly if we need to differentiate between all of these
> scenarios.

That's the price you pay for these inconsistencies. If your "HW" has a
lot of variability and that you can't discover its capabilities from
SW, then it either badly designed, badly implemented, badly emulated,
or any combination thereof.

In any case, you get to keep the pieces.

>
> > > Perhaps we can reuse the top-level hypervisor node? That seems to only
> > > ever have been used for Xen on 32-bit ARM, so not sure if that'd still
> > > be appropriate.
> >
> > I'd shy away from this. You would be deriving properties from a
> > hypervisor implementation, instead of expressing those properties
> > directly. In my experience, the direct method is always preferable.
>
> I would generally agree. However, I think especially the compatible
> string solution could turn very ugly for this. If we express these
> properties via compatible strings we may very well end up with many
> different compatible strings to cover all cases.
>
> Say you've got one hypervisor that changes the programming model in a
> certain way and a second hypervisor that constrains in a different way.
> Do we now need one compatible string for each hypervisor? Do we add
> compatible strings for each restriction and have potentially very long
> compatible string lists? Separate properties would work slightly better
> for this.

Again, the job of a hypervisor is to offer an architecturally correct
view of some HW, emulated or not. If your hypervisors are implementing
a large variety of diverging behaviours, SW needs to be able to
distinguish between those. You can either add properties, use compat
strings, or use a discovery protocol implemented by the device.

In any case, each deviation needs to be uniquely identifiable, and be
described either in FW or by the device itself, if only because Linux
isn't the only game in town.

> There are some cases where we can use register contents to determine
> what the OS is allowed to do, but these registers don't exist for all HW
> blocks. We may be able to get more added to new chips, but we obviously
> can't retroactively add them for existing ones.
>
> A central node or property would at least allow broad parameterization.
> I would hope that at least hypervisor implementations don't vary too
> much in terms of what they restrict and what they don't, so perhaps it
> wouldn't be that bad. Perhaps that's also overly optimistic.

Top level properties are no good unless what they express is forever
immutable and described upfront. Identifying a hypervisor doesn't do
that, and most of the time there will be all sorts of *variable*
properties that need to be further discovered by a mechanism or
another. In my (surely very limited) experience at writing hypervisors
for some time, this eventually becomes an unmaintainable mess.

You are of course free to do that in the drivers you maintain as long
as you don't break my own toys, but I'd urge you to reconsider this
and explore other possibilities.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.