Hi folks,
[tl;dr a change from ~18 months ago breaks Android userspace and I don't
know what to do about it]
As part of the development work for next year's Android, we've recently
been bringing up a 5.15 KVM host and have observed that vsock no longer
works for communicating with a guest because crosvm gets an unexpected
-EFAULT back from the VHOST_VSOCK_SET_RUNNING ioctl():
| E crosvm : vpipe worker thread exited with error: VhostVsockStart(IoctlError(Os { code: 14, kind: Uncategorized, message: "Bad address" }))
The same guest payload works correctly on a 5.10 KVM host kernel.
After some digging, we narrowed this change in behaviour down to
e13a6915a03f ("vhost/vsock: add IOTLB API support") and further digging
reveals that the infamous VIRTIO_F_ACCESS_PLATFORM feature flag is to
blame. Indeed, our tests once again pass if we revert that patch (there's
a trivial conflict with the later addition of VIRTIO_VSOCK_F_SEQPACKET
but otherwise it reverts cleanly).
On Android, KVM runs in a mode where the host kernel is, by default,
unable to access guest memory [1] and so memory used for virtio (e.g.
the rings and descriptor data) needs to be shared explicitly with the
host using hypercalls issued by the guest. We implement this on top of
restricted DMA [2], whereby the guest allocates a pool of shared memory
during boot and bounces all virtio transfers through this window. In
order to get the guest to use the DMA API for virtio devices (which is
required for the SWIOTLB code to kick in and do the aforementioned
bouncing), crosvm sets the VIRTIO_F_ACCESS_PLATFORM feature flag on its
emulated devices and this is picked up by virtio_has_dma_quirk() in the
guest kernel. This has been working well for us so far.
With e13a6915a03f, the vhost backend for vsock now advertises
VIRTIO_F_ACCESS_PLATFORM in its response to the VHOST_GET_FEATURES
ioctl() and accepts it in the VHOST_SET_FEATURES as a mechanism to
enable the IOTLB feature (note: this is nothing to do with SWIOTLB!).
This feature is used for emulation of a virtual IOMMU and requires
explicit support for issuing IOTLB messages (see VHOST_IOTLB_MSG and
VHOST_IOTLB_MSG_V2) from userspace to manage address translations for
the virtio device.
Looking at how crosvm uses these vhost ioctl()s, we can see:
let avail_features = self
.vhost_handle
.get_features()
.map_err(Error::VhostGetFeatures)?;
let features: c_ulonglong = self.acked_features & avail_features;
self.vhost_handle
.set_features(features)
.map_err(Error::VhostSetFeatures)?;
where 'acked_features' is the feature set negotiated between crosvm and
the guest, while 'avail_features' is the supported feature set
advertised by vhost. The intersection of these now includes
VIRTIO_F_ACCESS_PLATFORM in the 5.15 kernel and so we quietly start
enabling IOTLB, despite not having any of the necessary support in
crosvm and therefore the vsock thread effectively grinds to a halt and
we cannot communicate with the guest.
The fundamental issue is, I think, that VIRTIO_F_ACCESS_PLATFORM is
being used for two very different things within the same device; for the
guest it basically means "use the DMA API, it knows what to do" but for
vhost it very specifically means "enable IOTLB". We've recently had
other problems with this flag [3] but in this case it used to work
reliably and now it doesn't anymore.
So how should we fix this? One possibility is for us to hack crosvm to
clear the VIRTIO_F_ACCESS_PLATFORM flag when setting the vhost features,
but others here have reasonably pointed out that they didn't expect a
kernel change to break userspace. On the flip side, the offending commit
in the kernel isn't exactly new (it's from the end of 2020!) and so it's
likely that others (e.g. QEMU) are using this feature. I also strongly
suspect that vhost net suffers from exactly the same issue, we just
don't happen to be using that (yet) in Android.
Thanks,
Will
[1] https://lwn.net/Articles/836693/
[2] https://lwn.net/Articles/841916/
[3] https://lore.kernel.org/lkml/[email protected]/
On Fri, Aug 5, 2022 at 11:11 AM Will Deacon <[email protected]> wrote:
>
> [tl;dr a change from ~18 months ago breaks Android userspace and I don't
> know what to do about it]
Augh.
I had hoped that android being "closer" to upstream would have meant
that somebody actually tests android with upstream kernels. People
occasionally talk about it, but apparently it's not actually done.
Or maybe it's done onl;y with a very limited android user space.
The whole "we notice that something that happened 18 months ago broke
our environment" is kind of broken.
> After some digging, we narrowed this change in behaviour down to
> e13a6915a03f ("vhost/vsock: add IOTLB API support") and further digging
> reveals that the infamous VIRTIO_F_ACCESS_PLATFORM feature flag is to
> blame. Indeed, our tests once again pass if we revert that patch (there's
> a trivial conflict with the later addition of VIRTIO_VSOCK_F_SEQPACKET
> but otherwise it reverts cleanly).
I have to say, this smells for *so* many reasons.
Why is "IOMMU support" called "VIRTIO_F_ACCESS_PLATFORM"?
That seems insane, but seems fundamental in that commit e13a6915a03f
("vhost/vsock: add IOTLB API support")
This code
if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) {
if (vhost_init_device_iotlb(&vsock->dev, true))
goto err;
}
just makes me go "What?" It makes no sense. Why isn't that feature
called something-something-IOTLB?
Can we please just split that flag into two, and have that odd
"platform access" be one bit, and the "enable iommu" be an entirely
different bit?
Now, since clearly nobody runs Android on newer kernels, I do think
that the actual bit number choice should probably be one that makes
the non-android use case binaries continue to work. And then the
android system binaries that use this could maybe be compiled to know
about *both* bits,. and work regardless?
I'm also hoping that maybe Google android people could actually do
some *testing*? I know, that sounds like a lot to ask, but humor me.
Even if the product team runs stuff that is 18 months old, how about
the dev team have a machine or two that actually tests current
kernels, so that it's not a "oh, a few years have passed, and now we
notice that a change doesn't work for us" situation any more.
Is that really too much to ask for a big company like google?
And hey, it's possible that the bit encoding is *so* incestuous that
it's really hard to split it into two. But it really sounds to me like
somebody mindlessly re-used a feature bit for a *completely* different
thing. Why?
Why have feature bits at all, when you then re-use the same bit for
two different features? It kind of seems to defeat the whole purpose.
Linus
Hi Will,
On Fri, Aug 05, 2022 at 07:11:06PM +0100, Will Deacon wrote:
>Hi folks,
>
>[tl;dr a change from ~18 months ago breaks Android userspace and I don't
> know what to do about it]
>
>As part of the development work for next year's Android, we've recently
>been bringing up a 5.15 KVM host and have observed that vsock no longer
>works for communicating with a guest because crosvm gets an unexpected
>-EFAULT back from the VHOST_VSOCK_SET_RUNNING ioctl():
>
> | E crosvm : vpipe worker thread exited with error: VhostVsockStart(IoctlError(Os { code: 14, kind: Uncategorized, message: "Bad address" }))
>
>The same guest payload works correctly on a 5.10 KVM host kernel.
>
>After some digging, we narrowed this change in behaviour down to
>e13a6915a03f ("vhost/vsock: add IOTLB API support") and further digging
>reveals that the infamous VIRTIO_F_ACCESS_PLATFORM feature flag is to
>blame. Indeed, our tests once again pass if we revert that patch (there's
>a trivial conflict with the later addition of VIRTIO_VSOCK_F_SEQPACKET
>but otherwise it reverts cleanly).
>
>On Android, KVM runs in a mode where the host kernel is, by default,
>unable to access guest memory [1] and so memory used for virtio (e.g.
>the rings and descriptor data) needs to be shared explicitly with the
>host using hypercalls issued by the guest. We implement this on top of
>restricted DMA [2], whereby the guest allocates a pool of shared memory
>during boot and bounces all virtio transfers through this window. In
>order to get the guest to use the DMA API for virtio devices (which is
>required for the SWIOTLB code to kick in and do the aforementioned
>bouncing), crosvm sets the VIRTIO_F_ACCESS_PLATFORM feature flag on its
>emulated devices and this is picked up by virtio_has_dma_quirk() in the
>guest kernel. This has been working well for us so far.
>
>With e13a6915a03f, the vhost backend for vsock now advertises
>VIRTIO_F_ACCESS_PLATFORM in its response to the VHOST_GET_FEATURES
>ioctl() and accepts it in the VHOST_SET_FEATURES as a mechanism to
>enable the IOTLB feature (note: this is nothing to do with SWIOTLB!).
>This feature is used for emulation of a virtual IOMMU and requires
>explicit support for issuing IOTLB messages (see VHOST_IOTLB_MSG and
>VHOST_IOTLB_MSG_V2) from userspace to manage address translations for
>the virtio device.
>
>Looking at how crosvm uses these vhost ioctl()s, we can see:
>
> let avail_features = self
> .vhost_handle
> .get_features()
> .map_err(Error::VhostGetFeatures)?;
>
> let features: c_ulonglong = self.acked_features & avail_features;
> self.vhost_handle
> .set_features(features)
> .map_err(Error::VhostSetFeatures)?;
>
>where 'acked_features' is the feature set negotiated between crosvm and
>the guest, while 'avail_features' is the supported feature set
>advertised by vhost. The intersection of these now includes
>VIRTIO_F_ACCESS_PLATFORM in the 5.15 kernel and so we quietly start
>enabling IOTLB, despite not having any of the necessary support in
>crosvm and therefore the vsock thread effectively grinds to a halt and
>we cannot communicate with the guest.
>
>The fundamental issue is, I think, that VIRTIO_F_ACCESS_PLATFORM is
>being used for two very different things within the same device; for the
>guest it basically means "use the DMA API, it knows what to do" but for
>vhost it very specifically means "enable IOTLB". We've recently had
>other problems with this flag [3] but in this case it used to work
>reliably and now it doesn't anymore.
>
>So how should we fix this? One possibility is for us to hack crosvm to
>clear the VIRTIO_F_ACCESS_PLATFORM flag when setting the vhost
Why do you consider this a hack?
If the VMM implements the translation feature, it is right in my opinion
that it does not enable the feature for the vhost device. Otherwise, if
it wants the vhost device to do the translation, enable the feature and
send the IOTLB messages to set the translation.
QEMU for example masks features when not required or supported.
crosvm should negotiate only the features it supports.
@Michael and @Jason can correct me, but if a vhost device negotiates
VIRTIO_F_ACCESS_PLATFORM, then it expects the VMM to send IOTLB messages
to set the translation.
Indeed that patch was to fix
https://bugzilla.redhat.com/show_bug.cgi?id=1894101
>features,
>but others here have reasonably pointed out that they didn't expect a
>kernel change to break userspace. On the flip side, the offending commit
>in the kernel isn't exactly new (it's from the end of 2020!) and so it's
>likely that others (e.g. QEMU) are using this feature.
Yep, QEMU can use it.
>I also strongly
>suspect that vhost net suffers from exactly the same issue, we just
>don't happen to be using that (yet) in Android.
I think so too, the implementation in vsock was done following
vhost-net.
Thanks,
Stefano
Hi Linus,
On Fri, Aug 05, 2022 at 03:57:08PM -0700, Linus Torvalds wrote:
>On Fri, Aug 5, 2022 at 11:11 AM Will Deacon <[email protected]> wrote:
>>
>> [tl;dr a change from ~18 months ago breaks Android userspace and I don't
>> know what to do about it]
>
>Augh.
>
>I had hoped that android being "closer" to upstream would have meant
>that somebody actually tests android with upstream kernels. People
>occasionally talk about it, but apparently it's not actually done.
>
>Or maybe it's done onl;y with a very limited android user space.
>
>The whole "we notice that something that happened 18 months ago broke
>our environment" is kind of broken.
>
>> After some digging, we narrowed this change in behaviour down to
>> e13a6915a03f ("vhost/vsock: add IOTLB API support") and further digging
>> reveals that the infamous VIRTIO_F_ACCESS_PLATFORM feature flag is to
>> blame. Indeed, our tests once again pass if we revert that patch (there's
>> a trivial conflict with the later addition of VIRTIO_VSOCK_F_SEQPACKET
>> but otherwise it reverts cleanly).
>
>I have to say, this smells for *so* many reasons.
>
>Why is "IOMMU support" called "VIRTIO_F_ACCESS_PLATFORM"?
>
>That seems insane, but seems fundamental in that commit e13a6915a03f
>("vhost/vsock: add IOTLB API support")
>
>This code
>
> if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) {
> if (vhost_init_device_iotlb(&vsock->dev, true))
> goto err;
> }
>
>just makes me go "What?" It makes no sense. Why isn't that feature
>called something-something-IOTLB?
I honestly don't know the reason for the name but
VIRTIO_F_ACCESS_PLATFORM comes from the virtio specification:
https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html#x1-6600006
VIRTIO_F_ACCESS_PLATFORM(33)
This feature indicates that the device can be used on a platform
where device access to data in memory is limited and/or translated.
E.g. this is the case if the device can be located behind an IOMMU
that translates bus addresses from the device into physical
addresses in memory, if the device can be limited to only access
certain memory addresses or if special commands such as a cache
flush can be needed to synchronise data in memory with the device.
Whether accesses are actually limited or translated is described by
platform-specific means. If this feature bit is set to 0, then the
device has same access to memory addresses supplied to it as the
driver has. In particular, the device will always use physical
addresses matching addresses used by the driver (typically meaning
physical addresses used by the CPU) and not translated further, and
can access any address supplied to it by the driver. When clear,
this overrides any platform-specific description of whether device
access is limited or translated in any way, e.g. whether an IOMMU
may be present.
>
>Can we please just split that flag into two, and have that odd
>"platform access" be one bit, and the "enable iommu" be an entirely
>different bit?
IIUC the problem here is that the VMM does the translation and then for
the device there is actually no need to translate, so this feature
should not be negotiated by crosvm and vhost-vsock, but just between
guest's driver and crosvm.
Perhaps the confusion is that we use VIRTIO_F_ACCESS_PLATFORM both
between guest and VMM and between VMM and vhost device.
In fact, prior to commit e13a6915a03f ("vhost/vsock: add IOTLB API
support"), vhost-vsock did not work when a VMM (e.g., QEMU) tried to
negotiate translation with the device:
https://bugzilla.redhat.com/show_bug.cgi?id=1894101
The simplest solution is that crosvm doesn't negotiate
VIRTIO_F_ACCESS_PLATFORM with the vhost-vsock device if it doesn't want
to use translation and send messages to set it.
In fact before commit e13a6915a03f ("vhost/vsock: add IOTLB API
support") this feature was not exposed by the vhost-vsock device, so it
was never negotiated. Now crosvm is enabling a new feature (not masking
guest-negotiated features) so I don't think it's a break in user space,
if the user space enable it.
I tried to explain what I understood when I made the change, Michael and
Jason surely can add more information.
Thanks,
Stefano
Hi Stefano,
On Sat, Aug 06, 2022 at 09:48:28AM +0200, Stefano Garzarella wrote:
> On Fri, Aug 05, 2022 at 07:11:06PM +0100, Will Deacon wrote:
> > The fundamental issue is, I think, that VIRTIO_F_ACCESS_PLATFORM is
> > being used for two very different things within the same device; for the
> > guest it basically means "use the DMA API, it knows what to do" but for
> > vhost it very specifically means "enable IOTLB". We've recently had
> > other problems with this flag [3] but in this case it used to work
> > reliably and now it doesn't anymore.
> >
> > So how should we fix this? One possibility is for us to hack crosvm to
> > clear the VIRTIO_F_ACCESS_PLATFORM flag when setting the vhost
>
> Why do you consider this a hack?
I think it's a hack for two reasons:
(1) We're changing userspace to avoid a breaking change in kernel behaviour
(2) I think that crosvm's approach is actually pretty reasonable
To elaborate on (2), crosvm has a set of device features that it has
negotiated with the guest. It then takes the intersection of these features
with those advertised by VHOST_GET_FEATURES and calls VHOST_SET_FEATURES
with the result. If there was a common interpretation of what these features
do, then this would work and would mean we wouldn't have to opt-in on a
per-flag basis for vhost. Since VIRTIO_F_ACCESS_PLATFORM is being overloaded
to mean two completely different things, then it breaks and I think masking
out that specific flag is a hack because it's basically crosvm saying "yeah,
I may have negotiated this with the driver but vhost _actually_ means
'IOTLB' when it says it supports this flag so I'll mask it out because I
know better".
> If the VMM implements the translation feature, it is right in my opinion
> that it does not enable the feature for the vhost device. Otherwise, if it
> wants the vhost device to do the translation, enable the feature and send
> the IOTLB messages to set the translation.
>
> QEMU for example masks features when not required or supported.
> crosvm should negotiate only the features it supports.
>
> @Michael and @Jason can correct me, but if a vhost device negotiates
> VIRTIO_F_ACCESS_PLATFORM, then it expects the VMM to send IOTLB messages to
> set the translation.
As above, the issue is that vhost now unconditionally advertises this in
VHOST_GET_FEATURES and so a VMM with no knowledge of IOTLB can end up
enabling it by accident.
Will
On Fri, Aug 05, 2022 at 03:57:08PM -0700, Linus Torvalds wrote:
> On Fri, Aug 5, 2022 at 11:11 AM Will Deacon <[email protected]> wrote:
> >
> > [tl;dr a change from ~18 months ago breaks Android userspace and I don't
> > know what to do about it]
>
> Augh.
>
> I had hoped that android being "closer" to upstream would have meant
> that somebody actually tests android with upstream kernels. People
> occasionally talk about it, but apparently it's not actually done.
>
> Or maybe it's done onl;y with a very limited android user space.
We do actually test every -rc with Android (and run a whole bunch of
regression tests), this is largely using x86 builds for convenience
but we've been bringing up arm64 recently and are getting increasingly
more coverage there. So this _will_ improve and relatively soon.
The kicker in this case is that we'd only catch it on systems using pKVM
(arm64 host only; upstreaming ongoing) with restricted DMA (requires
device-tree) and so it slipped through. This is made more challenging
for CI because arm64 devices don't tend to have support for nested
virtualisation and so we have to run bare-metal but, as I say, we're
getting there.
> > After some digging, we narrowed this change in behaviour down to
> > e13a6915a03f ("vhost/vsock: add IOTLB API support") and further digging
> > reveals that the infamous VIRTIO_F_ACCESS_PLATFORM feature flag is to
> > blame. Indeed, our tests once again pass if we revert that patch (there's
> > a trivial conflict with the later addition of VIRTIO_VSOCK_F_SEQPACKET
> > but otherwise it reverts cleanly).
>
> I have to say, this smells for *so* many reasons.
>
> Why is "IOMMU support" called "VIRTIO_F_ACCESS_PLATFORM"?
It was already renamed once (!) It used to be VIRTIO_F_IOMMU_PLATFORM...
> That seems insane, but seems fundamental in that commit e13a6915a03f
> ("vhost/vsock: add IOTLB API support")
>
> This code
>
> if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) {
> if (vhost_init_device_iotlb(&vsock->dev, true))
> goto err;
> }
>
> just makes me go "What?" It makes no sense. Why isn't that feature
> called something-something-IOTLB?
>
> Can we please just split that flag into two, and have that odd
> "platform access" be one bit, and the "enable iommu" be an entirely
> different bit?
Something along those lines makes sense to me, but it's fiddly because
the bits being used here are part of the virtio spec and we can't freely
allocate them in Linux. I reckon it would probably be better to have a
separate mechanism to enable IOTLB and not repurpose this flag for it.
Hindsight is a wonderful thing.
> And hey, it's possible that the bit encoding is *so* incestuous that
> it's really hard to split it into two. But it really sounds to me like
> somebody mindlessly re-used a feature bit for a *completely* different
> thing. Why?
>
> Why have feature bits at all, when you then re-use the same bit for
> two different features? It kind of seems to defeat the whole purpose.
No argument here, and it's a big part of the reason I made the effort to
write this up. Yes, we hit this in Android. Yes, we should've hit it
sooner. But is it specific to Android? No. Anybody wanting a guest to
use the DMA API for its virtio devices is going to be setting this flag
and if they implement the same algorithm as crosvm then they're going to
hit exactly the same problem that we did.
Will
On Sat, Aug 6, 2022 at 5:50 AM Will Deacon <[email protected]> wrote:
> On Sat, Aug 06, 2022 at 09:48:28AM +0200, Stefano Garzarella wrote:
> > On Fri, Aug 05, 2022 at 07:11:06PM +0100, Will Deacon wrote:
> > If the VMM implements the translation feature, it is right in my opinion
> > that it does not enable the feature for the vhost device. Otherwise, if it
> > wants the vhost device to do the translation, enable the feature and send
> > the IOTLB messages to set the translation.
> >
> > QEMU for example masks features when not required or supported.
> > crosvm should negotiate only the features it supports.
> >
> > @Michael and @Jason can correct me, but if a vhost device negotiates
> > VIRTIO_F_ACCESS_PLATFORM, then it expects the VMM to send IOTLB messages to
> > set the translation.
>
> As above, the issue is that vhost now unconditionally advertises this in
> VHOST_GET_FEATURES and so a VMM with no knowledge of IOTLB can end up
> enabling it by accident.
Unconditionally exposing all vhost feature bits to the guest is
incorrect. The emulator must filter out only the feature bits that it
supports.
For example, see QEMU's vhost-net device's vhost feature bit allowlist:
https://gitlab.com/qemu-project/qemu/-/blob/master/hw/net/vhost_net.c#L40
The reason why the emulator (crosvm/QEMU/etc) must filter out feature
bits is that vhost devices are full VIRTIO devices. They are a subset
of a VIRTIO device and the emulator is responsible for the rest of the
device. Some features will require both vhost and emulator support.
Therefore it is incorrect to expect the device to work correctly if
the vhost feature bits are passed through to the guest.
Stefan
On Sat, Aug 06, 2022 at 10:42:40AM +0100, Will Deacon wrote:
>Hi Stefano,
>
>On Sat, Aug 06, 2022 at 09:48:28AM +0200, Stefano Garzarella wrote:
>> On Fri, Aug 05, 2022 at 07:11:06PM +0100, Will Deacon wrote:
>> > The fundamental issue is, I think, that VIRTIO_F_ACCESS_PLATFORM is
>> > being used for two very different things within the same device; for the
>> > guest it basically means "use the DMA API, it knows what to do" but for
>> > vhost it very specifically means "enable IOTLB". We've recently had
>> > other problems with this flag [3] but in this case it used to work
>> > reliably and now it doesn't anymore.
>> >
>> > So how should we fix this? One possibility is for us to hack crosvm to
>> > clear the VIRTIO_F_ACCESS_PLATFORM flag when setting the vhost
>>
>> Why do you consider this a hack?
>
>I think it's a hack for two reasons:
>
> (1) We're changing userspace to avoid a breaking change in kernel behaviour
> (2) I think that crosvm's approach is actually pretty reasonable
>
>To elaborate on (2), crosvm has a set of device features that it has
>negotiated with the guest. It then takes the intersection of these features
>with those advertised by VHOST_GET_FEATURES and calls VHOST_SET_FEATURES
>with the result. If there was a common interpretation of what these features
>do, then this would work and would mean we wouldn't have to opt-in on a
>per-flag basis for vhost. Since VIRTIO_F_ACCESS_PLATFORM is being overloaded
>to mean two completely different things, then it breaks and I think masking
>out that specific flag is a hack because it's basically crosvm saying "yeah,
>I may have negotiated this with the driver but vhost _actually_ means
>'IOTLB' when it says it supports this flag so I'll mask it out because I
>know better".
Thanks for elaborating, now I think I get your point!
If I understand you correctly, what you would like is that GET_FEATURES
should return only the data path features (thus exposed to the guest)
and not the features for the VMM, right?
In that case, since we also negotiate backend features (with
SET|GET_BACKEND_FEATURES ioctls) for IOTLB messages to work, maybe we
could only expose that feature if VHOST_BACKEND_F_IOTLB_MSG_V2 has been
negotiated
@Michael, @Jason, do you think this could be doable?
>
>> If the VMM implements the translation feature, it is right in my opinion
>> that it does not enable the feature for the vhost device. Otherwise, if it
>> wants the vhost device to do the translation, enable the feature and send
>> the IOTLB messages to set the translation.
>>
>> QEMU for example masks features when not required or supported.
>> crosvm should negotiate only the features it supports.
>>
>> @Michael and @Jason can correct me, but if a vhost device negotiates
>> VIRTIO_F_ACCESS_PLATFORM, then it expects the VMM to send IOTLB messages to
>> set the translation.
>
>As above, the issue is that vhost now unconditionally advertises this in
>VHOST_GET_FEATURES and so a VMM with no knowledge of IOTLB can end up
>enabling it by accident.
I honestly don't know what the initial design was, though, from what
I've seen in QEMU, it only enables the known features, in fact for
example when we added F_SEQPACKET for vhost-vsock, we had to update QEMU
to pass the feature to the guest, so I think the initial idea was to not
unconditionally accept all the features exposed by the vhost device.
Maybe this part should be clarified.
Thanks,
Stefano
s/vhost devices are full VIRTIO devices/vhost devices are not full
VIRTIO devices/
On Sat, Aug 06, 2022 at 06:52:15AM -0400, Stefan Hajnoczi wrote:
> On Sat, Aug 6, 2022 at 5:50 AM Will Deacon <[email protected]> wrote:
> > On Sat, Aug 06, 2022 at 09:48:28AM +0200, Stefano Garzarella wrote:
> > > On Fri, Aug 05, 2022 at 07:11:06PM +0100, Will Deacon wrote:
> > > If the VMM implements the translation feature, it is right in my opinion
> > > that it does not enable the feature for the vhost device. Otherwise, if it
> > > wants the vhost device to do the translation, enable the feature and send
> > > the IOTLB messages to set the translation.
> > >
> > > QEMU for example masks features when not required or supported.
> > > crosvm should negotiate only the features it supports.
> > >
> > > @Michael and @Jason can correct me, but if a vhost device negotiates
> > > VIRTIO_F_ACCESS_PLATFORM, then it expects the VMM to send IOTLB messages to
> > > set the translation.
> >
> > As above, the issue is that vhost now unconditionally advertises this in
> > VHOST_GET_FEATURES and so a VMM with no knowledge of IOTLB can end up
> > enabling it by accident.
>
> Unconditionally exposing all vhost feature bits to the guest is
> incorrect. The emulator must filter out only the feature bits that it
> supports.
I've evidently done a bad job of explaining this, sorry.
crosvm _does_ filter the feature bits which it passes to vhost. It takes the
feature set which it has negotiated with the guest and then takes the
intersection of this set with the set of features which vhost advertises.
The result is what is passed to VHOST_SET_FEATURES. I included the rust
for this in my initial mail, but in C it might look something like:
u64 features = negotiate_features_with_guest(dev);
ioctl(vhost_fd, VHOST_GET_FEATURES, &vhost_features);
vhost_features &= features; /* Mask out unsupported features */
ioctl(vhost_fd, VHOST_SET_FEATURES, &vhost_features);
The problem is that crosvm has negotiated VIRTIO_F_ACCESS_PLATFORM with
the guest so that restricted DMA is used for the virtio devices. With
e13a6915a03f, VIRTIO_F_ACCESS_PLATFORM is now advertised by
VHOST_GET_FEATURES and so IOTLB is enabled by the sequence above.
> For example, see QEMU's vhost-net device's vhost feature bit allowlist:
> https://gitlab.com/qemu-project/qemu/-/blob/master/hw/net/vhost_net.c#L40
I agree that changing crosvm to use an allowlist would fix the problem,
I'm just questioning whether we should be changing userspace at all to
resolve this issue.
> The reason why the emulator (crosvm/QEMU/etc) must filter out feature
> bits is that vhost devices are full VIRTIO devices. They are a subset
> of a VIRTIO device and the emulator is responsible for the rest of the
> device. Some features will require both vhost and emulator support.
> Therefore it is incorrect to expect the device to work correctly if
> the vhost feature bits are passed through to the guest.
I think crosvm is trying to cater for this by masking out the features
it doesn't know about.
Will
On Sat, Aug 6, 2022 at 10:35 AM Will Deacon <[email protected]> wrote:
>
> On Sat, Aug 06, 2022 at 06:52:15AM -0400, Stefan Hajnoczi wrote:
> > On Sat, Aug 6, 2022 at 5:50 AM Will Deacon <[email protected]> wrote:
> > > On Sat, Aug 06, 2022 at 09:48:28AM +0200, Stefano Garzarella wrote:
> > > > On Fri, Aug 05, 2022 at 07:11:06PM +0100, Will Deacon wrote:
> > > > If the VMM implements the translation feature, it is right in my opinion
> > > > that it does not enable the feature for the vhost device. Otherwise, if it
> > > > wants the vhost device to do the translation, enable the feature and send
> > > > the IOTLB messages to set the translation.
> > > >
> > > > QEMU for example masks features when not required or supported.
> > > > crosvm should negotiate only the features it supports.
> > > >
> > > > @Michael and @Jason can correct me, but if a vhost device negotiates
> > > > VIRTIO_F_ACCESS_PLATFORM, then it expects the VMM to send IOTLB messages to
> > > > set the translation.
> > >
> > > As above, the issue is that vhost now unconditionally advertises this in
> > > VHOST_GET_FEATURES and so a VMM with no knowledge of IOTLB can end up
> > > enabling it by accident.
> >
> > Unconditionally exposing all vhost feature bits to the guest is
> > incorrect. The emulator must filter out only the feature bits that it
> > supports.
>
> I've evidently done a bad job of explaining this, sorry.
>
> crosvm _does_ filter the feature bits which it passes to vhost. It takes the
> feature set which it has negotiated with the guest and then takes the
> intersection of this set with the set of features which vhost advertises.
> The result is what is passed to VHOST_SET_FEATURES. I included the rust
> for this in my initial mail, but in C it might look something like:
>
> u64 features = negotiate_features_with_guest(dev);
>
> ioctl(vhost_fd, VHOST_GET_FEATURES, &vhost_features);
> vhost_features &= features; /* Mask out unsupported features */
> ioctl(vhost_fd, VHOST_SET_FEATURES, &vhost_features);
This is unrelated to the current issue, but this code looks wrong.
VHOST_GET_FEATURES must be called before negotiating with the guest.
The device features must be restricted by vhost before advertising
them to the guest. For example, if a new crosvm binary runs on an old
kernel then feature bits crosvm negotiated with the guest may not be
supported by the vhost kernel module and the device is broken.
> The problem is that crosvm has negotiated VIRTIO_F_ACCESS_PLATFORM with
> the guest so that restricted DMA is used for the virtio devices. With
> e13a6915a03f, VIRTIO_F_ACCESS_PLATFORM is now advertised by
> VHOST_GET_FEATURES and so IOTLB is enabled by the sequence above.
>
> > For example, see QEMU's vhost-net device's vhost feature bit allowlist:
> > https://gitlab.com/qemu-project/qemu/-/blob/master/hw/net/vhost_net.c#L40
>
> I agree that changing crosvm to use an allowlist would fix the problem,
> I'm just questioning whether we should be changing userspace at all to
> resolve this issue.
>
> > The reason why the emulator (crosvm/QEMU/etc) must filter out feature
> > bits is that vhost devices are full VIRTIO devices. They are a subset
> > of a VIRTIO device and the emulator is responsible for the rest of the
> > device. Some features will require both vhost and emulator support.
> > Therefore it is incorrect to expect the device to work correctly if
> > the vhost feature bits are passed through to the guest.
>
> I think crosvm is trying to cater for this by masking out the features
> it doesn't know about.
Can you point to the guest driver code for restricted DMA? It's
unclear to me what the guest drivers are doing and whether that is
VIRTIO spec compliant. Is the driver compliant with VIRTIO 1.2 "6.1
Driver Requirements: Reserved Feature Bits":
A driver SHOULD accept VIRTIO_F_ACCESS_PLATFORM if it is offered,
and it MUST then either disable the IOMMU or configure the IOMMU to
translate bus addresses passed to the device into physical addresses
in memory. If VIRTIO_F_ACCESS_PLATFORM is not offered, then a driver
MUST pass only physical addresses to the device.
Stefan
On Fri, Aug 05, 2022 at 03:57:08PM -0700, Linus Torvalds wrote:
> Why is "IOMMU support" called "VIRTIO_F_ACCESS_PLATFORM"?
Because, as far as the virtio spec and virtio "guest" implementation is
concerned it is not about IOMMU support at all. It is about treating
virtio DMA as real DMA by the platform, which lets the platform let
whatever method of DMA mapping it needs to the virtio device. This
is needed to make sure harware virtio device are treated like actual
hardware and not like a magic thing bypassing the normal PCIe rules.
Using an IOMMU if one is present for bus is just one thing, others
are using offets of DMAs that are very common on non-x86 platforms,
or doing the horrible cache flushing needed on devices where PCIe
is not cache coherent.
It really is vhost that seems to abuse it so that if the guest
claims it can handle VIRTIO_F_ACCESS_PLATFORM (which every modern
guest should) it enables magic behavior, which I don't think is what
the virtio spec intended.
Will, thanks very much for the analysis and the writeup!
On Fri, Aug 05, 2022 at 07:11:06PM +0100, Will Deacon wrote:
> So how should we fix this? One possibility is for us to hack crosvm to
> clear the VIRTIO_F_ACCESS_PLATFORM flag when setting the vhost features,
> but others here have reasonably pointed out that they didn't expect a
> kernel change to break userspace. On the flip side, the offending commit
> in the kernel isn't exactly new (it's from the end of 2020!) and so it's
> likely that others (e.g. QEMU) are using this feature.
Exactly, that's the problem.
vhost is reusing the virtio bits and it's only natural that
what you are doing would happen.
To be precise, this is what we expected people to do (and what QEMU does):
#define QEMU_VHOST_FEATURES ((1 << VIRTIO_F_VERSION_1) |
(1 << VIRTIO_NET_F_RX_MRG) | .... )
VHOST_GET_FEATURES(... &host_features);
host_features &= QEMU_VHOST_FEATURES
VHOST_SET_FEATURES(host_features & guest_features)
Here QEMU_VHOST_FEATURES are the bits userspace knows about.
Our assumption was that whatever userspace enables, it
knows what the effect on vhost is going to be.
But yes, I understand absolutely how someone would instead just use the
guest features. It is unfortunate that we did not catch this in time.
In hindsight, we should have just created vhost level macros
instead of reusing virtio ones. Would address the concern
about naming: PLATFORM_ACCESS makes sense for the
guest since there it means "whatever access rules platform has",
but for vhost a better name would be VHOST_F_IOTLB.
We should have also taken greater pains to document what
we expect userspace to do. I remember now how I thought about something
like this but after coding this up in QEMU I forgot to document this :(
Also, I suspect given the history the GET/SET features ioctl and burned
wrt extending it and we have to use a new when we add new features.
All this we can do going forward.
But what can we do about the specific issue?
I am not 100% sure since as Will points out, QEMU and other
userspace already rely on the current behaviour.
Looking at QEMU specifically, it always sends some translations at
startup, this in order to handle device rings.
So, *maybe* we can get away with assuming that if no IOTLB ioctl was
ever invoked then this userspace does not know about IOTLB and
translation should ignore IOTLB completely.
I am a bit nervous about breaking some *other* userspace which actually
wants device to be blocked from accessing memory until IOTLB
has been setup. If we get it wrong we are making guest
and possibly even host vulnerable.
And of course just revering is not an option either since there
are now whole stacks depending on the feature.
Will I'd like your input on whether you feel a hack in the kernel
is justified here.
Also yes, I think it's a good idea to change crosvm anyway. While the
work around I describe might make sense upstream I don't think it's a
reasonable thing to do in stable kernels.
I think I'll prepare a patch documenting the legal vhost features
as a 1st step even though crosvm is rust so it's not importing
the header directly, right?
--
MST
On Fri, Aug 05, 2022 at 03:57:08PM -0700, Linus Torvalds wrote:
> And hey, it's possible that the bit encoding is *so* incestuous that
> it's really hard to split it into two. But it really sounds to me like
> somebody mindlessly re-used a feature bit for a *completely* different
> thing. Why?
>
> Why have feature bits at all, when you then re-use the same bit for
> two different features? It kind of seems to defeat the whole purpose.
What can I say? Hindsight is 20/20. The two things are
*related* in that IOTLB in vhost is a way for userspace
(the platform) to limit device access to guest memory.
So we reused the feature bits (it's not the only one,
just the one we changed most recently).
It bothered me a bit but everyone seemed happy and
was able to refer to virtio spec for documentation so there
was less documentation to write for Linux.
It's not that it's hard to split it generally, it's just that
it's been there like this for a while so it's hard to change
now - we need to find a way that does not break existing userspace.
--
MST
On Sat, Aug 06, 2022 at 11:52:13PM -0700, Christoph Hellwig wrote:
> It really is vhost that seems to abuse it so that if the guest
> claims it can handle VIRTIO_F_ACCESS_PLATFORM (which every modern
> guest should) it enables magic behavior, which I don't think is what
> the virtio spec intended.
Well the magic behavour happens to be used by QEMU to
implement a virtual IOMMU. And when you have a virtual
IOMMU you generally want VIRTIO_F_ACCESS_PLATFORM.
This is how it came to be reused for that.
And since QEMU never passed guest features to vhost
unfiltered we never saw the issue even with old QEMU
versions on new kernels.
It seems natural to pass features unfiltered and we never even said
userspace should not do it, so it's quite understandable that this is
what corsvm did.
--
MST
Hi Michael,
On Sun, Aug 07, 2022 at 09:14:43AM -0400, Michael S. Tsirkin wrote:
> Will, thanks very much for the analysis and the writeup!
No problem, and thanks for following up.
> On Fri, Aug 05, 2022 at 07:11:06PM +0100, Will Deacon wrote:
> > So how should we fix this? One possibility is for us to hack crosvm to
> > clear the VIRTIO_F_ACCESS_PLATFORM flag when setting the vhost features,
> > but others here have reasonably pointed out that they didn't expect a
> > kernel change to break userspace. On the flip side, the offending commit
> > in the kernel isn't exactly new (it's from the end of 2020!) and so it's
> > likely that others (e.g. QEMU) are using this feature.
>
> Exactly, that's the problem.
>
> vhost is reusing the virtio bits and it's only natural that
> what you are doing would happen.
>
> To be precise, this is what we expected people to do (and what QEMU does):
>
>
> #define QEMU_VHOST_FEATURES ((1 << VIRTIO_F_VERSION_1) |
> (1 << VIRTIO_NET_F_RX_MRG) | .... )
>
> VHOST_GET_FEATURES(... &host_features);
> host_features &= QEMU_VHOST_FEATURES
> VHOST_SET_FEATURES(host_features & guest_features)
>
>
> Here QEMU_VHOST_FEATURES are the bits userspace knows about.
>
> Our assumption was that whatever userspace enables, it
> knows what the effect on vhost is going to be.
>
> But yes, I understand absolutely how someone would instead just use the
> guest features. It is unfortunate that we did not catch this in time.
>
>
> In hindsight, we should have just created vhost level macros
> instead of reusing virtio ones. Would address the concern
> about naming: PLATFORM_ACCESS makes sense for the
> guest since there it means "whatever access rules platform has",
> but for vhost a better name would be VHOST_F_IOTLB.
> We should have also taken greater pains to document what
> we expect userspace to do. I remember now how I thought about something
> like this but after coding this up in QEMU I forgot to document this :(
> Also, I suspect given the history the GET/SET features ioctl and burned
> wrt extending it and we have to use a new when we add new features.
> All this we can do going forward.
Makes sense. The crosvm developers are also pretty friendly in my
experience, so I'm sure they wouldn't mind being involved in discussions
around any future ABI extensions. Just be aware that they _very_ recently
moved their mailing lists, so I think it lives here now:
https://groups.google.com/a/chromium.org/g/crosvm-dev
> But what can we do about the specific issue?
> I am not 100% sure since as Will points out, QEMU and other
> userspace already rely on the current behaviour.
>
> Looking at QEMU specifically, it always sends some translations at
> startup, this in order to handle device rings.
>
> So, *maybe* we can get away with assuming that if no IOTLB ioctl was
> ever invoked then this userspace does not know about IOTLB and
> translation should ignore IOTLB completely.
There was a similar suggestion from Stefano:
https://lore.kernel.org/r/20220806105225.crkui6nw53kbm5ge@sgarzare-redhat
about spotting the backend ioctl for IOTLB and using that to enable
the negotiation of F_ACCESS_PLATFORM. Would that work for qemu?
> I am a bit nervous about breaking some *other* userspace which actually
> wants device to be blocked from accessing memory until IOTLB
> has been setup. If we get it wrong we are making guest
> and possibly even host vulnerable.
> And of course just revering is not an option either since there
> are now whole stacks depending on the feature.
Absolutely, I'm not seriously suggesting the revert. I just did it locally
to confirm the issue I was seeing.
> Will I'd like your input on whether you feel a hack in the kernel
> is justified here.
If we can come up with something that we have confidence in and won't be a
pig to maintain, then I think we should do it, but otherwise we can go ahead
and change crosvm to mask out this feature flag on the vhost side for now.
We mainly wanted to raise the issue to illustrate that this flag continues
to attract problems in the hope that it might inform further usage and/or
spec work in this area.
In any case, I'm happy to test any kernel patches with our setup if you
want to give it a shot.
> Also yes, I think it's a good idea to change crosvm anyway. While the
> work around I describe might make sense upstream I don't think it's a
> reasonable thing to do in stable kernels.
> I think I'll prepare a patch documenting the legal vhost features
> as a 1st step even though crosvm is rust so it's not importing
> the header directly, right?
Documentation is a good idea regardless, so thanks for that. Even though
crosvm has its own bindings for the vhost ioctl()s, the documentation
can be reference or duplicated once it's available in the kernel headers.
Will
On Mon, Aug 08, 2022 at 11:18:50AM +0100, Will Deacon wrote:
> Hi Michael,
>
> On Sun, Aug 07, 2022 at 09:14:43AM -0400, Michael S. Tsirkin wrote:
> > Will, thanks very much for the analysis and the writeup!
>
> No problem, and thanks for following up.
>
> > On Fri, Aug 05, 2022 at 07:11:06PM +0100, Will Deacon wrote:
> > > So how should we fix this? One possibility is for us to hack crosvm to
> > > clear the VIRTIO_F_ACCESS_PLATFORM flag when setting the vhost features,
> > > but others here have reasonably pointed out that they didn't expect a
> > > kernel change to break userspace. On the flip side, the offending commit
> > > in the kernel isn't exactly new (it's from the end of 2020!) and so it's
> > > likely that others (e.g. QEMU) are using this feature.
> >
> > Exactly, that's the problem.
> >
> > vhost is reusing the virtio bits and it's only natural that
> > what you are doing would happen.
> >
> > To be precise, this is what we expected people to do (and what QEMU does):
> >
> >
> > #define QEMU_VHOST_FEATURES ((1 << VIRTIO_F_VERSION_1) |
> > (1 << VIRTIO_NET_F_RX_MRG) | .... )
> >
> > VHOST_GET_FEATURES(... &host_features);
> > host_features &= QEMU_VHOST_FEATURES
> > VHOST_SET_FEATURES(host_features & guest_features)
> >
> >
> > Here QEMU_VHOST_FEATURES are the bits userspace knows about.
> >
> > Our assumption was that whatever userspace enables, it
> > knows what the effect on vhost is going to be.
> >
> > But yes, I understand absolutely how someone would instead just use the
> > guest features. It is unfortunate that we did not catch this in time.
> >
> >
> > In hindsight, we should have just created vhost level macros
> > instead of reusing virtio ones. Would address the concern
> > about naming: PLATFORM_ACCESS makes sense for the
> > guest since there it means "whatever access rules platform has",
> > but for vhost a better name would be VHOST_F_IOTLB.
> > We should have also taken greater pains to document what
> > we expect userspace to do. I remember now how I thought about something
> > like this but after coding this up in QEMU I forgot to document this :(
> > Also, I suspect given the history the GET/SET features ioctl and burned
> > wrt extending it and we have to use a new when we add new features.
> > All this we can do going forward.
>
> Makes sense. The crosvm developers are also pretty friendly in my
> experience, so I'm sure they wouldn't mind being involved in discussions
> around any future ABI extensions. Just be aware that they _very_ recently
> moved their mailing lists, so I think it lives here now:
>
> https://groups.google.com/a/chromium.org/g/crosvm-dev
>
> > But what can we do about the specific issue?
> > I am not 100% sure since as Will points out, QEMU and other
> > userspace already rely on the current behaviour.
> >
> > Looking at QEMU specifically, it always sends some translations at
> > startup, this in order to handle device rings.
> >
> > So, *maybe* we can get away with assuming that if no IOTLB ioctl was
> > ever invoked then this userspace does not know about IOTLB and
> > translation should ignore IOTLB completely.
>
> There was a similar suggestion from Stefano:
>
> https://lore.kernel.org/r/20220806105225.crkui6nw53kbm5ge@sgarzare-redhat
>
> about spotting the backend ioctl for IOTLB and using that to enable
> the negotiation of F_ACCESS_PLATFORM. Would that work for qemu?
Hmm I would worry that this disables the feature for old QEMU :(
> > I am a bit nervous about breaking some *other* userspace which actually
> > wants device to be blocked from accessing memory until IOTLB
> > has been setup. If we get it wrong we are making guest
> > and possibly even host vulnerable.
> > And of course just revering is not an option either since there
> > are now whole stacks depending on the feature.
>
> Absolutely, I'm not seriously suggesting the revert. I just did it locally
> to confirm the issue I was seeing.
>
> > Will I'd like your input on whether you feel a hack in the kernel
> > is justified here.
>
> If we can come up with something that we have confidence in and won't be a
> pig to maintain, then I think we should do it, but otherwise we can go ahead
> and change crosvm to mask out this feature flag on the vhost side for now.
> We mainly wanted to raise the issue to illustrate that this flag continues
> to attract problems in the hope that it might inform further usage and/or
> spec work in this area.
>
> In any case, I'm happy to test any kernel patches with our setup if you
> want to give it a shot.
Thanks!
I'm a bit concerned that the trick I proposed changes the configuration
where iotlb was not set up from "access to memory not allowed" to
"access to all memory allowed". This just might have security
implications if some application assumed the former.
And the one Stefano proposed disables IOTLB for old QEMU versions.
> > Also yes, I think it's a good idea to change crosvm anyway. While the
> > work around I describe might make sense upstream I don't think it's a
> > reasonable thing to do in stable kernels.
> > I think I'll prepare a patch documenting the legal vhost features
> > as a 1st step even though crosvm is rust so it's not importing
> > the header directly, right?
>
> Documentation is a good idea regardless, so thanks for that. Even though
> crosvm has its own bindings for the vhost ioctl()s, the documentation
> can be reference or duplicated once it's available in the kernel headers.
>
> Will
So for crosvm change, I will post the documentation change and
you guys can discuss?
--
MST
On Mon, Aug 8, 2022 at 8:46 AM Michael S. Tsirkin <[email protected]> wrote:
>
> On Mon, Aug 08, 2022 at 11:18:50AM +0100, Will Deacon wrote:
> > Hi Michael,
> >
> > On Sun, Aug 07, 2022 at 09:14:43AM -0400, Michael S. Tsirkin wrote:
> > > Will, thanks very much for the analysis and the writeup!
> >
> > No problem, and thanks for following up.
> >
> > > On Fri, Aug 05, 2022 at 07:11:06PM +0100, Will Deacon wrote:
> > > > So how should we fix this? One possibility is for us to hack crosvm to
> > > > clear the VIRTIO_F_ACCESS_PLATFORM flag when setting the vhost features,
> > > > but others here have reasonably pointed out that they didn't expect a
> > > > kernel change to break userspace. On the flip side, the offending commit
> > > > in the kernel isn't exactly new (it's from the end of 2020!) and so it's
> > > > likely that others (e.g. QEMU) are using this feature.
> > >
> > > Exactly, that's the problem.
> > >
> > > vhost is reusing the virtio bits and it's only natural that
> > > what you are doing would happen.
> > >
> > > To be precise, this is what we expected people to do (and what QEMU does):
> > >
> > >
> > > #define QEMU_VHOST_FEATURES ((1 << VIRTIO_F_VERSION_1) |
> > > (1 << VIRTIO_NET_F_RX_MRG) | .... )
> > >
> > > VHOST_GET_FEATURES(... &host_features);
> > > host_features &= QEMU_VHOST_FEATURES
> > > VHOST_SET_FEATURES(host_features & guest_features)
> > >
> > >
> > > Here QEMU_VHOST_FEATURES are the bits userspace knows about.
> > >
> > > Our assumption was that whatever userspace enables, it
> > > knows what the effect on vhost is going to be.
> > >
> > > But yes, I understand absolutely how someone would instead just use the
> > > guest features. It is unfortunate that we did not catch this in time.
> > >
> > >
> > > In hindsight, we should have just created vhost level macros
> > > instead of reusing virtio ones. Would address the concern
> > > about naming: PLATFORM_ACCESS makes sense for the
> > > guest since there it means "whatever access rules platform has",
> > > but for vhost a better name would be VHOST_F_IOTLB.
> > > We should have also taken greater pains to document what
> > > we expect userspace to do. I remember now how I thought about something
> > > like this but after coding this up in QEMU I forgot to document this :(
> > > Also, I suspect given the history the GET/SET features ioctl and burned
> > > wrt extending it and we have to use a new when we add new features.
> > > All this we can do going forward.
> >
> > Makes sense. The crosvm developers are also pretty friendly in my
> > experience, so I'm sure they wouldn't mind being involved in discussions
> > around any future ABI extensions. Just be aware that they _very_ recently
> > moved their mailing lists, so I think it lives here now:
> >
> > https://groups.google.com/a/chromium.org/g/crosvm-dev
> >
> > > But what can we do about the specific issue?
> > > I am not 100% sure since as Will points out, QEMU and other
> > > userspace already rely on the current behaviour.
> > >
> > > Looking at QEMU specifically, it always sends some translations at
> > > startup, this in order to handle device rings.
> > >
> > > So, *maybe* we can get away with assuming that if no IOTLB ioctl was
> > > ever invoked then this userspace does not know about IOTLB and
> > > translation should ignore IOTLB completely.
> >
> > There was a similar suggestion from Stefano:
> >
> > https://lore.kernel.org/r/20220806105225.crkui6nw53kbm5ge@sgarzare-redhat
> >
> > about spotting the backend ioctl for IOTLB and using that to enable
> > the negotiation of F_ACCESS_PLATFORM. Would that work for qemu?
>
> Hmm I would worry that this disables the feature for old QEMU :(
>
>
> > > I am a bit nervous about breaking some *other* userspace which actually
> > > wants device to be blocked from accessing memory until IOTLB
> > > has been setup. If we get it wrong we are making guest
> > > and possibly even host vulnerable.
> > > And of course just revering is not an option either since there
> > > are now whole stacks depending on the feature.
> >
> > Absolutely, I'm not seriously suggesting the revert. I just did it locally
> > to confirm the issue I was seeing.
> >
> > > Will I'd like your input on whether you feel a hack in the kernel
> > > is justified here.
> >
> > If we can come up with something that we have confidence in and won't be a
> > pig to maintain, then I think we should do it, but otherwise we can go ahead
> > and change crosvm to mask out this feature flag on the vhost side for now.
> > We mainly wanted to raise the issue to illustrate that this flag continues
> > to attract problems in the hope that it might inform further usage and/or
> > spec work in this area.
> >
> > In any case, I'm happy to test any kernel patches with our setup if you
> > want to give it a shot.
>
> Thanks!
> I'm a bit concerned that the trick I proposed changes the configuration
> where iotlb was not set up from "access to memory not allowed" to
> "access to all memory allowed". This just might have security
> implications if some application assumed the former.
> And the one Stefano proposed disables IOTLB for old QEMU versions.
Adding hacks to vhost in order to work around userspace applications
that misunderstand the vhost model seems like a it will lead to
problems.
Userspace applications need to follow the vhost model: vhost is
designed for virtqueue passthrough, but the rest of the vhost
interface is not suitable for pass through. It's similar to how VFIO
PCI passthrough needs to do a significant amount of stuff in userspace
to emulate a PCI configuration space and it won't work properly if you
pass through the physical PCI device's PCI configuration space.
The emulator has to mediate between the guest device and vhost device
because it still emulates the VIRTIO transport, configuration space,
device lifecycle, etc even when all virtqueues are passed through.
Let's document this for vhost and vDPA because it is not obvious.
Stefan
On Sun, Aug 7, 2022 at 9:14 PM Michael S. Tsirkin <[email protected]> wrote:
>
> Will, thanks very much for the analysis and the writeup!
>
> On Fri, Aug 05, 2022 at 07:11:06PM +0100, Will Deacon wrote:
> > So how should we fix this? One possibility is for us to hack crosvm to
> > clear the VIRTIO_F_ACCESS_PLATFORM flag when setting the vhost features,
> > but others here have reasonably pointed out that they didn't expect a
> > kernel change to break userspace. On the flip side, the offending commit
> > in the kernel isn't exactly new (it's from the end of 2020!) and so it's
> > likely that others (e.g. QEMU) are using this feature.
>
> Exactly, that's the problem.
>
> vhost is reusing the virtio bits and it's only natural that
> what you are doing would happen.
>
> To be precise, this is what we expected people to do (and what QEMU does):
>
>
> #define QEMU_VHOST_FEATURES ((1 << VIRTIO_F_VERSION_1) |
> (1 << VIRTIO_NET_F_RX_MRG) | .... )
>
> VHOST_GET_FEATURES(... &host_features);
> host_features &= QEMU_VHOST_FEATURES
> VHOST_SET_FEATURES(host_features & guest_features)
>
>
> Here QEMU_VHOST_FEATURES are the bits userspace knows about.
>
> Our assumption was that whatever userspace enables, it
> knows what the effect on vhost is going to be.
>
> But yes, I understand absolutely how someone would instead just use the
> guest features. It is unfortunate that we did not catch this in time.
>
>
> In hindsight, we should have just created vhost level macros
> instead of reusing virtio ones. Would address the concern
> about naming: PLATFORM_ACCESS makes sense for the
> guest since there it means "whatever access rules platform has",
> but for vhost a better name would be VHOST_F_IOTLB.
Yes, in the original patch it is called VHOST_F_DEVICE_IOTLB actually
to make it differ from virtio like VHOST_F_LOG_ALL etc. And I remember
I tried to post patch to avoid the bit duplication but the conclusion
is that it's too late for the changes.
> We should have also taken greater pains to document what
> we expect userspace to do. I remember now how I thought about something
> like this but after coding this up in QEMU I forgot to document this :(
> Also, I suspect given the history the GET/SET features ioctl and burned
> wrt extending it and we have to use a new when we add new features.
> All this we can do going forward.
>
>
> But what can we do about the specific issue?
> I am not 100% sure since as Will points out, QEMU and other
> userspace already rely on the current behaviour.
>
> Looking at QEMU specifically, it always sends some translations at
> startup, this in order to handle device rings.
>
> So, *maybe* we can get away with assuming that if no IOTLB ioctl was
> ever invoked then this userspace does not know about IOTLB and
> translation should ignore IOTLB completely.
I think this breaks the security assumptions of some setups.
>
> I am a bit nervous about breaking some *other* userspace which actually
> wants device to be blocked from accessing memory until IOTLB
> has been setup. If we get it wrong we are making guest
> and possibly even host vulnerable.
Yes.
> And of course just revering is not an option either since there
> are now whole stacks depending on the feature.
>
> Will I'd like your input on whether you feel a hack in the kernel
> is justified here.
>
> Also yes, I think it's a good idea to change crosvm anyway. While the
> work around I describe might make sense upstream I don't think it's a
> reasonable thing to do in stable kernels.
+1
Thanks
> I think I'll prepare a patch documenting the legal vhost features
> as a 1st step even though crosvm is rust so it's not importing
> the header directly, right?
>
> --
> MST
>
On Mon, Aug 8, 2022 at 8:45 PM Michael S. Tsirkin <[email protected]> wrote:
>
> On Mon, Aug 08, 2022 at 11:18:50AM +0100, Will Deacon wrote:
> > Hi Michael,
> >
> > On Sun, Aug 07, 2022 at 09:14:43AM -0400, Michael S. Tsirkin wrote:
> > > Will, thanks very much for the analysis and the writeup!
> >
> > No problem, and thanks for following up.
> >
> > > On Fri, Aug 05, 2022 at 07:11:06PM +0100, Will Deacon wrote:
> > > > So how should we fix this? One possibility is for us to hack crosvm to
> > > > clear the VIRTIO_F_ACCESS_PLATFORM flag when setting the vhost features,
> > > > but others here have reasonably pointed out that they didn't expect a
> > > > kernel change to break userspace. On the flip side, the offending commit
> > > > in the kernel isn't exactly new (it's from the end of 2020!) and so it's
> > > > likely that others (e.g. QEMU) are using this feature.
> > >
> > > Exactly, that's the problem.
> > >
> > > vhost is reusing the virtio bits and it's only natural that
> > > what you are doing would happen.
> > >
> > > To be precise, this is what we expected people to do (and what QEMU does):
> > >
> > >
> > > #define QEMU_VHOST_FEATURES ((1 << VIRTIO_F_VERSION_1) |
> > > (1 << VIRTIO_NET_F_RX_MRG) | .... )
> > >
> > > VHOST_GET_FEATURES(... &host_features);
> > > host_features &= QEMU_VHOST_FEATURES
> > > VHOST_SET_FEATURES(host_features & guest_features)
> > >
> > >
> > > Here QEMU_VHOST_FEATURES are the bits userspace knows about.
> > >
> > > Our assumption was that whatever userspace enables, it
> > > knows what the effect on vhost is going to be.
> > >
> > > But yes, I understand absolutely how someone would instead just use the
> > > guest features. It is unfortunate that we did not catch this in time.
> > >
> > >
> > > In hindsight, we should have just created vhost level macros
> > > instead of reusing virtio ones. Would address the concern
> > > about naming: PLATFORM_ACCESS makes sense for the
> > > guest since there it means "whatever access rules platform has",
> > > but for vhost a better name would be VHOST_F_IOTLB.
> > > We should have also taken greater pains to document what
> > > we expect userspace to do. I remember now how I thought about something
> > > like this but after coding this up in QEMU I forgot to document this :(
> > > Also, I suspect given the history the GET/SET features ioctl and burned
> > > wrt extending it and we have to use a new when we add new features.
> > > All this we can do going forward.
> >
> > Makes sense. The crosvm developers are also pretty friendly in my
> > experience, so I'm sure they wouldn't mind being involved in discussions
> > around any future ABI extensions. Just be aware that they _very_ recently
> > moved their mailing lists, so I think it lives here now:
> >
> > https://groups.google.com/a/chromium.org/g/crosvm-dev
> >
> > > But what can we do about the specific issue?
> > > I am not 100% sure since as Will points out, QEMU and other
> > > userspace already rely on the current behaviour.
> > >
> > > Looking at QEMU specifically, it always sends some translations at
> > > startup, this in order to handle device rings.
> > >
> > > So, *maybe* we can get away with assuming that if no IOTLB ioctl was
> > > ever invoked then this userspace does not know about IOTLB and
> > > translation should ignore IOTLB completely.
> >
> > There was a similar suggestion from Stefano:
> >
> > https://lore.kernel.org/r/20220806105225.crkui6nw53kbm5ge@sgarzare-redhat
> >
> > about spotting the backend ioctl for IOTLB and using that to enable
> > the negotiation of F_ACCESS_PLATFORM. Would that work for qemu?
>
> Hmm I would worry that this disables the feature for old QEMU :(
>
>
> > > I am a bit nervous about breaking some *other* userspace which actually
> > > wants device to be blocked from accessing memory until IOTLB
> > > has been setup. If we get it wrong we are making guest
> > > and possibly even host vulnerable.
> > > And of course just revering is not an option either since there
> > > are now whole stacks depending on the feature.
> >
> > Absolutely, I'm not seriously suggesting the revert. I just did it locally
> > to confirm the issue I was seeing.
> >
> > > Will I'd like your input on whether you feel a hack in the kernel
> > > is justified here.
> >
> > If we can come up with something that we have confidence in and won't be a
> > pig to maintain, then I think we should do it, but otherwise we can go ahead
> > and change crosvm to mask out this feature flag on the vhost side for now.
> > We mainly wanted to raise the issue to illustrate that this flag continues
> > to attract problems in the hope that it might inform further usage and/or
> > spec work in this area.
> >
> > In any case, I'm happy to test any kernel patches with our setup if you
> > want to give it a shot.
>
> Thanks!
> I'm a bit concerned that the trick I proposed changes the configuration
> where iotlb was not set up from "access to memory not allowed" to
> "access to all memory allowed". This just might have security
> implications if some application assumed the former.
> And the one Stefano proposed disables IOTLB for old QEMU versions.
Yes, if there is no choice, having some known cases that are broken is
better than silently breaking unknown setups.
Thanks
>
>
>
> > > Also yes, I think it's a good idea to change crosvm anyway. While the
> > > work around I describe might make sense upstream I don't think it's a
> > > reasonable thing to do in stable kernels.
> > > I think I'll prepare a patch documenting the legal vhost features
> > > as a 1st step even though crosvm is rust so it's not importing
> > > the header directly, right?
> >
> > Documentation is a good idea regardless, so thanks for that. Even though
> > crosvm has its own bindings for the vhost ioctl()s, the documentation
> > can be reference or duplicated once it's available in the kernel headers.
> >
> > Will
>
> So for crosvm change, I will post the documentation change and
> you guys can discuss?
>
> --
> MST
>
On Mon, Aug 08, 2022 at 08:45:48AM -0400, Michael S. Tsirkin wrote:
> > > Also yes, I think it's a good idea to change crosvm anyway. While the
> > > work around I describe might make sense upstream I don't think it's a
> > > reasonable thing to do in stable kernels.
> > > I think I'll prepare a patch documenting the legal vhost features
> > > as a 1st step even though crosvm is rust so it's not importing
> > > the header directly, right?
> >
> > Documentation is a good idea regardless, so thanks for that. Even though
> > crosvm has its own bindings for the vhost ioctl()s, the documentation
> > can be reference or duplicated once it's available in the kernel headers.
> >
> So for crosvm change, I will post the documentation change and
> you guys can discuss?
FYI, the crosvm patch is merged here:
https://github.com/google/crosvm/commit/4e7d00be2e135b0a2d964320ea4276e5d896f426
Will
On Wed, Aug 17, 2022 at 02:48:22PM +0100, Will Deacon wrote:
> On Mon, Aug 08, 2022 at 08:45:48AM -0400, Michael S. Tsirkin wrote:
> > > > Also yes, I think it's a good idea to change crosvm anyway. While the
> > > > work around I describe might make sense upstream I don't think it's a
> > > > reasonable thing to do in stable kernels.
> > > > I think I'll prepare a patch documenting the legal vhost features
> > > > as a 1st step even though crosvm is rust so it's not importing
> > > > the header directly, right?
> > >
> > > Documentation is a good idea regardless, so thanks for that. Even though
> > > crosvm has its own bindings for the vhost ioctl()s, the documentation
> > > can be reference or duplicated once it's available in the kernel headers.
> > >
> > So for crosvm change, I will post the documentation change and
> > you guys can discuss?
>
> FYI, the crosvm patch is merged here:
>
> https://github.com/google/crosvm/commit/4e7d00be2e135b0a2d964320ea4276e5d896f426
>
> Will
Great thanks a lot!
I'm on vacation next week but will work on it afterwards.
--
MST