2020-08-07 14:15:32

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

Changes since v1:
- Better KVM_SET_USER_MEMORY_REGION flags description, minor tweaks to
the code [Drew Jones]
- BUG_ON() condition in __gfn_to_hva_memslot() adjusted.

This is a continuation of "[PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES
memory" work:
https://lore.kernel.org/kvm/[email protected]/
and pairs with Julia's "x86/PCI: Use MMCONFIG by default for KVM guests":
https://lore.kernel.org/linux-pci/[email protected]/

PCIe config space can (depending on the configuration) be quite big but
usually is sparsely populated. Guest may scan it by accessing individual
device's page which, when device is missing, is supposed to have 'pci
hole' semantics: reads return '0xff' and writes get discarded.

When testing Linux kernel boot with QEMU q35 VM and direct kernel boot
I observed 8193 accesses to PCI hole memory. When such exit is handled
in KVM without exiting to userspace, it takes roughly 0.000001 sec.
Handling the same exit in userspace is six times slower (0.000006 sec) so
the overal; difference is 0.04 sec. This may be significant for 'microvm'
ideas.

Note, the same speed can already be achieved by using KVM_MEM_READONLY
but doing this would require allocating real memory for all missing
devices and e.g. 8192 pages gives us 32mb. This will have to be allocated
for each guest separately and for 'microvm' use-cases this is likely
a no-go.

Introduce special KVM_MEM_PCI_HOLE memory: userspace doesn't need to
back it with real memory, all reads from it are handled inside KVM and
return '0xff'. Writes still go to userspace but these should be extremely
rare.

The original 'KVM_MEM_ALLONES' idea had additional optimizations: KVM
was mapping all 'PCI hole' pages to a single read-only page stuffed with
0xff. This is omitted in this submission as the benefits are unclear:
KVM will have to allocate SPTEs (either on demand or aggressively) and
this also consumes time/memory. We can always take a look at possible
optimizations later.

Vitaly Kuznetsov (3):
KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf()
KVM: x86: introduce KVM_MEM_PCI_HOLE memory
KVM: selftests: add KVM_MEM_PCI_HOLE test

Documentation/virt/kvm/api.rst | 18 ++-
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/mmu/mmu.c | 19 +--
arch/x86/kvm/mmu/paging_tmpl.h | 10 +-
arch/x86/kvm/x86.c | 10 +-
include/linux/kvm_host.h | 3 +
include/uapi/linux/kvm.h | 2 +
tools/testing/selftests/kvm/Makefile | 1 +
.../testing/selftests/kvm/include/kvm_util.h | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 81 +++++++------
.../kvm/x86_64/memory_slot_pci_hole.c | 112 ++++++++++++++++++
virt/kvm/kvm_main.c | 39 ++++--
12 files changed, 239 insertions(+), 58 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/memory_slot_pci_hole.c

--
2.25.4


2020-08-25 21:26:56

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

On Fri, Aug 07, 2020 at 04:12:29PM +0200, Vitaly Kuznetsov wrote:
> When testing Linux kernel boot with QEMU q35 VM and direct kernel boot
> I observed 8193 accesses to PCI hole memory. When such exit is handled
> in KVM without exiting to userspace, it takes roughly 0.000001 sec.
> Handling the same exit in userspace is six times slower (0.000006 sec) so
> the overal; difference is 0.04 sec. This may be significant for 'microvm'
> ideas.

Sorry to comment so late, but just curious... have you looked at what's those
8000+ accesses to PCI holes and what they're used for? What I can think of are
some port IO reads (e.g. upon vendor ID field) during BIOS to scan the devices
attached. Though those should be far less than 8000+, and those should also be
pio rather than mmio.

If this is only an overhead for virt (since baremetal mmios should be fast),
I'm also thinking whether we can make it even better to skip those pci hole
reads. Because we know we're virt, so it also gives us possibility that we may
provide those information in a better way than reading PCI holes in the guest?

Thanks,

--
Peter Xu

2020-09-01 14:47:52

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

Peter Xu <[email protected]> writes:

> On Fri, Aug 07, 2020 at 04:12:29PM +0200, Vitaly Kuznetsov wrote:
>> When testing Linux kernel boot with QEMU q35 VM and direct kernel boot
>> I observed 8193 accesses to PCI hole memory. When such exit is handled
>> in KVM without exiting to userspace, it takes roughly 0.000001 sec.
>> Handling the same exit in userspace is six times slower (0.000006 sec) so
>> the overal; difference is 0.04 sec. This may be significant for 'microvm'
>> ideas.
>
> Sorry to comment so late, but just curious... have you looked at what's those
> 8000+ accesses to PCI holes and what they're used for? What I can think of are
> some port IO reads (e.g. upon vendor ID field) during BIOS to scan the devices
> attached. Though those should be far less than 8000+, and those should also be
> pio rather than mmio.

And sorry for replying late)

We explicitly want MMIO instead of PIO to speed things up, afaiu PIO
requires two exits per device (and we exit all the way to
QEMU). Julia/Michael know better about the size of the space.

>
> If this is only an overhead for virt (since baremetal mmios should be fast),
> I'm also thinking whether we can make it even better to skip those pci hole
> reads. Because we know we're virt, so it also gives us possibility that we may
> provide those information in a better way than reading PCI holes in the guest?

This means let's invent a PV interface and if we decide to go down this
road, I'd even argue for abandoning PCI completely. E.g. we can do
something similar to Hyper-V's Vmbus.

--
Vitaly

2020-09-01 20:01:57

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

On Tue, Sep 01, 2020 at 04:43:25PM +0200, Vitaly Kuznetsov wrote:
> Peter Xu <[email protected]> writes:
>
> > On Fri, Aug 07, 2020 at 04:12:29PM +0200, Vitaly Kuznetsov wrote:
> >> When testing Linux kernel boot with QEMU q35 VM and direct kernel boot
> >> I observed 8193 accesses to PCI hole memory. When such exit is handled
> >> in KVM without exiting to userspace, it takes roughly 0.000001 sec.
> >> Handling the same exit in userspace is six times slower (0.000006 sec) so
> >> the overal; difference is 0.04 sec. This may be significant for 'microvm'
> >> ideas.
> >
> > Sorry to comment so late, but just curious... have you looked at what's those
> > 8000+ accesses to PCI holes and what they're used for? What I can think of are
> > some port IO reads (e.g. upon vendor ID field) during BIOS to scan the devices
> > attached. Though those should be far less than 8000+, and those should also be
> > pio rather than mmio.
>
> And sorry for replying late)
>
> We explicitly want MMIO instead of PIO to speed things up, afaiu PIO
> requires two exits per device (and we exit all the way to
> QEMU). Julia/Michael know better about the size of the space.
>
> >
> > If this is only an overhead for virt (since baremetal mmios should be fast),
> > I'm also thinking whether we can make it even better to skip those pci hole
> > reads. Because we know we're virt, so it also gives us possibility that we may
> > provide those information in a better way than reading PCI holes in the guest?
>
> This means let's invent a PV interface and if we decide to go down this
> road, I'd even argue for abandoning PCI completely. E.g. we can do
> something similar to Hyper-V's Vmbus.

My whole point was more about trying to understand the problem behind.
Providing a fast path for reading pci holes seems to be reasonable as is,
however it's just that I'm confused on why there're so many reads on the pci
holes after all. Another important question is I'm wondering how this series
will finally help the use case of microvm. I'm not sure I get the whole point
of it, but... if microvm is the major use case of this, it would be good to
provide some quick numbers on those if possible.

For example, IIUC microvm uses qboot (as a better alternative than seabios) for
fast boot, and qboot has:

https://github.com/bonzini/qboot/blob/master/pci.c#L20

I'm kind of curious whether qboot will still be used when this series is used
with microvm VMs? Since those are still at least PIO based.

Thanks,

--
Peter Xu

2020-09-02 09:00:49

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

Peter Xu <[email protected]> writes:

> On Tue, Sep 01, 2020 at 04:43:25PM +0200, Vitaly Kuznetsov wrote:
>> Peter Xu <[email protected]> writes:
>>
>> > On Fri, Aug 07, 2020 at 04:12:29PM +0200, Vitaly Kuznetsov wrote:
>> >> When testing Linux kernel boot with QEMU q35 VM and direct kernel boot
>> >> I observed 8193 accesses to PCI hole memory. When such exit is handled
>> >> in KVM without exiting to userspace, it takes roughly 0.000001 sec.
>> >> Handling the same exit in userspace is six times slower (0.000006 sec) so
>> >> the overal; difference is 0.04 sec. This may be significant for 'microvm'
>> >> ideas.
>> >
>> > Sorry to comment so late, but just curious... have you looked at what's those
>> > 8000+ accesses to PCI holes and what they're used for? What I can think of are
>> > some port IO reads (e.g. upon vendor ID field) during BIOS to scan the devices
>> > attached. Though those should be far less than 8000+, and those should also be
>> > pio rather than mmio.
>>
>> And sorry for replying late)
>>
>> We explicitly want MMIO instead of PIO to speed things up, afaiu PIO
>> requires two exits per device (and we exit all the way to
>> QEMU). Julia/Michael know better about the size of the space.
>>
>> >
>> > If this is only an overhead for virt (since baremetal mmios should be fast),
>> > I'm also thinking whether we can make it even better to skip those pci hole
>> > reads. Because we know we're virt, so it also gives us possibility that we may
>> > provide those information in a better way than reading PCI holes in the guest?
>>
>> This means let's invent a PV interface and if we decide to go down this
>> road, I'd even argue for abandoning PCI completely. E.g. we can do
>> something similar to Hyper-V's Vmbus.
>
> My whole point was more about trying to understand the problem behind.
> Providing a fast path for reading pci holes seems to be reasonable as is,
> however it's just that I'm confused on why there're so many reads on the pci
> holes after all. Another important question is I'm wondering how this series
> will finally help the use case of microvm. I'm not sure I get the whole point
> of it, but... if microvm is the major use case of this, it would be good to
> provide some quick numbers on those if possible.
>
> For example, IIUC microvm uses qboot (as a better alternative than seabios) for
> fast boot, and qboot has:
>
> https://github.com/bonzini/qboot/blob/master/pci.c#L20
>
> I'm kind of curious whether qboot will still be used when this series is used
> with microvm VMs? Since those are still at least PIO based.

I'm afraid there is no 'grand plan' for everything at this moment :-(
For traditional VMs 0.04 sec per boot is negligible and definitely not
worth adding a feature, memory requirements are also very
different. When it comes to microvm-style usage things change.

'8193' PCI hole accesses I mention in the PATCH0 blurb are just from
Linux as I was doing direct kernel boot, we can't get better than that
(if PCI is in the game of course). Firmware (qboot, seabios,...) can
only add more. I *think* the plan is to eventually switch them all to
MMCFG, at least for KVM guests, by default but we need something to put
to the advertisement.

We can, in theory, short circuit PIO in KVM instead but:
- We will need a complete different API
- We will never be able to reach the speed of the exit-less 'single 0xff
page' solution (see my RFC).

--
Vitaly

2020-09-04 06:16:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

On Wed, Sep 02, 2020 at 10:59:20AM +0200, Vitaly Kuznetsov wrote:
> Peter Xu <[email protected]> writes:
> > My whole point was more about trying to understand the problem behind.
> > Providing a fast path for reading pci holes seems to be reasonable as is,
> > however it's just that I'm confused on why there're so many reads on the pci
> > holes after all. Another important question is I'm wondering how this series
> > will finally help the use case of microvm. I'm not sure I get the whole point
> > of it, but... if microvm is the major use case of this, it would be good to
> > provide some quick numbers on those if possible.
> >
> > For example, IIUC microvm uses qboot (as a better alternative than seabios) for
> > fast boot, and qboot has:
> >
> > https://github.com/bonzini/qboot/blob/master/pci.c#L20
> >
> > I'm kind of curious whether qboot will still be used when this series is used
> > with microvm VMs? Since those are still at least PIO based.
>
> I'm afraid there is no 'grand plan' for everything at this moment :-(
> For traditional VMs 0.04 sec per boot is negligible and definitely not
> worth adding a feature, memory requirements are also very
> different. When it comes to microvm-style usage things change.
>
> '8193' PCI hole accesses I mention in the PATCH0 blurb are just from
> Linux as I was doing direct kernel boot, we can't get better than that
> (if PCI is in the game of course). Firmware (qboot, seabios,...) can
> only add more. I *think* the plan is to eventually switch them all to
> MMCFG, at least for KVM guests, by default but we need something to put
> to the advertisement.

I see a similar ~8k PCI hole reads with a -kernel boot w/ OVMF. All but 60
of those are from pcibios_fixup_peer_bridges(), and all are from the kernel.
My understanding is that pcibios_fixup_peer_bridges() is useful if and only
if there multiple root buses. And AFAICT, when running under QEMU, the only
way for there to be multiple buses in is if there is an explicit bridge
created ("pxb" or "pxb-pcie"). Based on the cover letter from those[*], the
main reason for creating a bridge is to handle pinned CPUs on a NUMA system
with pass-through devices. That use case seems highly unlikely to cross
paths with micro VMs, i.e. micro VMs will only ever have a single bus.
Unless I'm mistaken, microvm doesn't even support PCI, does it?

If all of the above is true, this can be handled by adding "pci=lastbus=0"
as a guest kernel param to override its scanning of buses. And couldn't
that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
to the end user?

[*] https://www.redhat.com/archives/libvir-list/2016-March/msg01213.html

2020-09-04 07:22:51

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

Hi,

> attached. Though those should be far less than 8000+, and those should also be
> pio rather than mmio.

Well, seabios 1.14 (qemu 5.1) added mmio support (to speed up boot a
little bit, mmio is one and pio is two vmexits).

Depends on q35 obviously, and a few pio accesses remain b/c seabios has
to first setup mmconfig.

take care,
Gerd

2020-09-04 07:30:28

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

Hi,

> Unless I'm mistaken, microvm doesn't even support PCI, does it?

Correct, no pci support right now.

We could probably wire up ecam (arm/virt style) for pcie support, once
the acpi support for mictovm finally landed (we need acpi for that
because otherwise the kernel wouldn't find the pcie bus).

Question is whenever there is a good reason to do so. Why would someone
prefer microvm with pcie support over q35?

> If all of the above is true, this can be handled by adding "pci=lastbus=0"
> as a guest kernel param to override its scanning of buses. And couldn't
> that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
> to the end user?

microvm_fix_kernel_cmdline() is a hack, not a solution.

Beside that I doubt this has much of an effect on microvm because
it doesn't support pcie in the first place.

take care,
Gerd

2020-09-04 16:01:22

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

On Fri, Sep 04, 2020 at 09:29:05AM +0200, Gerd Hoffmann wrote:
> Hi,
>
> > Unless I'm mistaken, microvm doesn't even support PCI, does it?
>
> Correct, no pci support right now.
>
> We could probably wire up ecam (arm/virt style) for pcie support, once
> the acpi support for mictovm finally landed (we need acpi for that
> because otherwise the kernel wouldn't find the pcie bus).
>
> Question is whenever there is a good reason to do so. Why would someone
> prefer microvm with pcie support over q35?
>
> > If all of the above is true, this can be handled by adding "pci=lastbus=0"
> > as a guest kernel param to override its scanning of buses. And couldn't
> > that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
> > to the end user?
>
> microvm_fix_kernel_cmdline() is a hack, not a solution.
>
> Beside that I doubt this has much of an effect on microvm because
> it doesn't support pcie in the first place.

I am so confused. Vitaly, can you clarify exactly what QEMU VM type this
series is intended to help? If this is for microvm, then why is the guest
doing PCI scanning in the first place? If it's for q35, why is the
justification for microvm-like workloads?

Either way, I think it makes sense explore other options before throwing
something into KVM, e.g. modifying guest command line, adding a KVM hint,
"fixing" QEMU, etc...

2020-09-07 08:38:47

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

Sean Christopherson <[email protected]> writes:

> On Fri, Sep 04, 2020 at 09:29:05AM +0200, Gerd Hoffmann wrote:
>> Hi,
>>
>> > Unless I'm mistaken, microvm doesn't even support PCI, does it?
>>
>> Correct, no pci support right now.
>>
>> We could probably wire up ecam (arm/virt style) for pcie support, once
>> the acpi support for mictovm finally landed (we need acpi for that
>> because otherwise the kernel wouldn't find the pcie bus).
>>
>> Question is whenever there is a good reason to do so. Why would someone
>> prefer microvm with pcie support over q35?
>>
>> > If all of the above is true, this can be handled by adding "pci=lastbus=0"
>> > as a guest kernel param to override its scanning of buses. And couldn't
>> > that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
>> > to the end user?
>>
>> microvm_fix_kernel_cmdline() is a hack, not a solution.
>>
>> Beside that I doubt this has much of an effect on microvm because
>> it doesn't support pcie in the first place.
>
> I am so confused. Vitaly, can you clarify exactly what QEMU VM type this
> series is intended to help? If this is for microvm, then why is the guest
> doing PCI scanning in the first place? If it's for q35, why is the
> justification for microvm-like workloads?

I'm not exactly sure about the plans for particular machine types, the
intention was to use this for pcie in QEMU in general so whatever
machine type uses pcie will benefit.

Now, it seems that we have a more sophisticated landscape. The
optimization will only make sense to speed up boot so all 'traditional'
VM types with 'traditional' firmware are out of
question. 'Container-like' VMs seem to avoid PCI for now, I'm not sure
if it's because they're in early stages of their development, because
they can get away without PCI or, actually, because of slowness at boot
(which we're trying to tackle with this feature). I'd definitely like to
hear more what people think about this.

>
> Either way, I think it makes sense explore other options before throwing
> something into KVM, e.g. modifying guest command line, adding a KVM hint,
> "fixing" QEMU, etc...
>

Initially, this feature looked like a small and straitforward
(micro-)optimization to me: memory regions with 'PCI hole' semantics do
exist and we can speed up access to them. Ideally, I'd like to find
other 'constant memory' regions requiring fast access and come up with
an interface to create them in KVM but so far nothing interesting came
up...

--
Vitaly

2020-09-07 10:53:52

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

On Fri, Sep 04, 2020 at 09:29:05AM +0200, Gerd Hoffmann wrote:
> Hi,
>
> > Unless I'm mistaken, microvm doesn't even support PCI, does it?
>
> Correct, no pci support right now.
>
> We could probably wire up ecam (arm/virt style) for pcie support, once
> the acpi support for mictovm finally landed (we need acpi for that
> because otherwise the kernel wouldn't find the pcie bus).
>
> Question is whenever there is a good reason to do so. Why would someone
> prefer microvm with pcie support over q35?

The usual reasons to use pcie apply to microvm just the same.
E.g.: pass through of pcie devices?

> > If all of the above is true, this can be handled by adding "pci=lastbus=0"
> > as a guest kernel param to override its scanning of buses. And couldn't
> > that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
> > to the end user?
>
> microvm_fix_kernel_cmdline() is a hack, not a solution.
>
> Beside that I doubt this has much of an effect on microvm because
> it doesn't support pcie in the first place.
>
> take care,
> Gerd

2020-09-07 10:56:55

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

On Thu, Sep 03, 2020 at 11:12:12PM -0700, Sean Christopherson wrote:
> On Wed, Sep 02, 2020 at 10:59:20AM +0200, Vitaly Kuznetsov wrote:
> > Peter Xu <[email protected]> writes:
> > > My whole point was more about trying to understand the problem behind.
> > > Providing a fast path for reading pci holes seems to be reasonable as is,
> > > however it's just that I'm confused on why there're so many reads on the pci
> > > holes after all. Another important question is I'm wondering how this series
> > > will finally help the use case of microvm. I'm not sure I get the whole point
> > > of it, but... if microvm is the major use case of this, it would be good to
> > > provide some quick numbers on those if possible.
> > >
> > > For example, IIUC microvm uses qboot (as a better alternative than seabios) for
> > > fast boot, and qboot has:
> > >
> > > https://github.com/bonzini/qboot/blob/master/pci.c#L20
> > >
> > > I'm kind of curious whether qboot will still be used when this series is used
> > > with microvm VMs? Since those are still at least PIO based.
> >
> > I'm afraid there is no 'grand plan' for everything at this moment :-(
> > For traditional VMs 0.04 sec per boot is negligible and definitely not
> > worth adding a feature, memory requirements are also very
> > different. When it comes to microvm-style usage things change.
> >
> > '8193' PCI hole accesses I mention in the PATCH0 blurb are just from
> > Linux as I was doing direct kernel boot, we can't get better than that
> > (if PCI is in the game of course). Firmware (qboot, seabios,...) can
> > only add more. I *think* the plan is to eventually switch them all to
> > MMCFG, at least for KVM guests, by default but we need something to put
> > to the advertisement.
>
> I see a similar ~8k PCI hole reads with a -kernel boot w/ OVMF. All but 60
> of those are from pcibios_fixup_peer_bridges(), and all are from the kernel.
> My understanding is that pcibios_fixup_peer_bridges() is useful if and only
> if there multiple root buses. And AFAICT, when running under QEMU, the only
> way for there to be multiple buses in is if there is an explicit bridge
> created ("pxb" or "pxb-pcie"). Based on the cover letter from those[*], the
> main reason for creating a bridge is to handle pinned CPUs on a NUMA system
> with pass-through devices. That use case seems highly unlikely to cross
> paths with micro VMs, i.e. micro VMs will only ever have a single bus.

My position is it's not all black and white, workloads do not
cleanly partition to these that care about boot speed and those
that don't. So IMHO we care about boot speed with pcie even if
microvm does not use it at the moment.


> Unless I'm mistaken, microvm doesn't even support PCI, does it?
>
> If all of the above is true, this can be handled by adding "pci=lastbus=0"
> as a guest kernel param to override its scanning of buses. And couldn't
> that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
> to the end user?
>
> [*] https://www.redhat.com/archives/libvir-list/2016-March/msg01213.html

2020-09-07 11:37:37

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

On Mon, Sep 07, 2020 at 10:37:39AM +0200, Vitaly Kuznetsov wrote:
> Sean Christopherson <[email protected]> writes:
>
> > On Fri, Sep 04, 2020 at 09:29:05AM +0200, Gerd Hoffmann wrote:
> >> Hi,
> >>
> >> > Unless I'm mistaken, microvm doesn't even support PCI, does it?
> >>
> >> Correct, no pci support right now.
> >>
> >> We could probably wire up ecam (arm/virt style) for pcie support, once
> >> the acpi support for mictovm finally landed (we need acpi for that
> >> because otherwise the kernel wouldn't find the pcie bus).
> >>
> >> Question is whenever there is a good reason to do so. Why would someone
> >> prefer microvm with pcie support over q35?
> >>
> >> > If all of the above is true, this can be handled by adding "pci=lastbus=0"
> >> > as a guest kernel param to override its scanning of buses. And couldn't
> >> > that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
> >> > to the end user?
> >>
> >> microvm_fix_kernel_cmdline() is a hack, not a solution.
> >>
> >> Beside that I doubt this has much of an effect on microvm because
> >> it doesn't support pcie in the first place.
> >
> > I am so confused. Vitaly, can you clarify exactly what QEMU VM type this
> > series is intended to help? If this is for microvm, then why is the guest
> > doing PCI scanning in the first place? If it's for q35, why is the
> > justification for microvm-like workloads?
>
> I'm not exactly sure about the plans for particular machine types, the
> intention was to use this for pcie in QEMU in general so whatever
> machine type uses pcie will benefit.
>
> Now, it seems that we have a more sophisticated landscape. The
> optimization will only make sense to speed up boot so all 'traditional'
> VM types with 'traditional' firmware are out of
> question. 'Container-like' VMs seem to avoid PCI for now, I'm not sure
> if it's because they're in early stages of their development, because
> they can get away without PCI or, actually, because of slowness at boot
> (which we're trying to tackle with this feature). I'd definitely like to
> hear more what people think about this.

I suspect microvms will need pci eventually. I would much rather KVM
had an exit-less discovery mechanism in place by then because
learning from history if it doesn't they will do some kind of
hack on the kernel command line, and everyone will be stuck
supporting that for years ...

> >
> > Either way, I think it makes sense explore other options before throwing
> > something into KVM, e.g. modifying guest command line, adding a KVM hint,
> > "fixing" QEMU, etc...
> >
>
> Initially, this feature looked like a small and straitforward
> (micro-)optimization to me: memory regions with 'PCI hole' semantics do
> exist and we can speed up access to them. Ideally, I'd like to find
> other 'constant memory' regions requiring fast access and come up with
> an interface to create them in KVM but so far nothing interesting came
> up...

True, me neither.

> --
> Vitaly

2020-09-11 17:01:49

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

On Mon, Sep 07, 2020 at 07:32:23AM -0400, Michael S. Tsirkin wrote:
> On Mon, Sep 07, 2020 at 10:37:39AM +0200, Vitaly Kuznetsov wrote:
> > Sean Christopherson <[email protected]> writes:
> >
> > > On Fri, Sep 04, 2020 at 09:29:05AM +0200, Gerd Hoffmann wrote:
> > >> Hi,
> > >>
> > >> > Unless I'm mistaken, microvm doesn't even support PCI, does it?
> > >>
> > >> Correct, no pci support right now.
> > >>
> > >> We could probably wire up ecam (arm/virt style) for pcie support, once
> > >> the acpi support for mictovm finally landed (we need acpi for that
> > >> because otherwise the kernel wouldn't find the pcie bus).
> > >>
> > >> Question is whenever there is a good reason to do so. Why would someone
> > >> prefer microvm with pcie support over q35?
> > >>
> > >> > If all of the above is true, this can be handled by adding "pci=lastbus=0"
> > >> > as a guest kernel param to override its scanning of buses. And couldn't
> > >> > that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
> > >> > to the end user?
> > >>
> > >> microvm_fix_kernel_cmdline() is a hack, not a solution.
> > >>
> > >> Beside that I doubt this has much of an effect on microvm because
> > >> it doesn't support pcie in the first place.
> > >
> > > I am so confused. Vitaly, can you clarify exactly what QEMU VM type this
> > > series is intended to help? If this is for microvm, then why is the guest
> > > doing PCI scanning in the first place? If it's for q35, why is the
> > > justification for microvm-like workloads?
> >
> > I'm not exactly sure about the plans for particular machine types, the
> > intention was to use this for pcie in QEMU in general so whatever
> > machine type uses pcie will benefit.
> >
> > Now, it seems that we have a more sophisticated landscape. The
> > optimization will only make sense to speed up boot so all 'traditional'
> > VM types with 'traditional' firmware are out of
> > question. 'Container-like' VMs seem to avoid PCI for now, I'm not sure
> > if it's because they're in early stages of their development, because
> > they can get away without PCI or, actually, because of slowness at boot
> > (which we're trying to tackle with this feature). I'd definitely like to
> > hear more what people think about this.
>
> I suspect microvms will need pci eventually. I would much rather KVM
> had an exit-less discovery mechanism in place by then because
> learning from history if it doesn't they will do some kind of
> hack on the kernel command line, and everyone will be stuck
> supporting that for years ...

Is it not an option for the VMM to "accurately" enumerate the number of buses?
E.g. if the VMM has devices on only bus 0, then enumerate that there is one
bus so that the guest doesn't try and probe devices that can't possibly exist.
Or is that completely non-sensical and/or violate PCIe spec?

2020-09-18 08:54:35

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

Hi,

> I see a similar ~8k PCI hole reads with a -kernel boot w/ OVMF. All but 60
> of those are from pcibios_fixup_peer_bridges(), and all are from the kernel.

pcibios_fixup_peer_bridges() looks at pcibios_last_bus, and that in turn
seems to be set according to the mmconfig size (in
arch/x86/pci/mmconfig-shared.c).

So, maybe we just need to declare a smaller mmconfig window in the acpi
tables, depending on the number of pci busses actually used ...

> If all of the above is true, this can be handled by adding "pci=lastbus=0"

... so we don't need manual quirks like this?

take care,
Gerd

2020-09-18 09:37:54

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

Hi,

> > We could probably wire up ecam (arm/virt style) for pcie support, once
> > the acpi support for mictovm finally landed (we need acpi for that
> > because otherwise the kernel wouldn't find the pcie bus).
> >
> > Question is whenever there is a good reason to do so. Why would someone
> > prefer microvm with pcie support over q35?
>
> The usual reasons to use pcie apply to microvm just the same.
> E.g.: pass through of pcie devices?

Playground:
https://git.kraxel.org/cgit/qemu/log/?h=sirius/microvm-usb

Adds support for usb and pcie (use -machine microvm,usb=on,pcie=on
to enable). Reuses the gpex used on arm/aarch64. Seems to work ok
on a quick test.

Not fully sure how to deal correctly with ioports. The gpex device
has a mmio window for the io address space. Will that approach work
on x86 too? Anyway, just not having a ioport range seems to be a
valid configuation, so I've just disabled them for now ...

take care,
Gerd

2020-09-18 12:36:37

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

On Fri, Sep 11, 2020 at 10:00:31AM -0700, Sean Christopherson wrote:
> On Mon, Sep 07, 2020 at 07:32:23AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Sep 07, 2020 at 10:37:39AM +0200, Vitaly Kuznetsov wrote:
> > > Sean Christopherson <[email protected]> writes:
> > >
> > > > On Fri, Sep 04, 2020 at 09:29:05AM +0200, Gerd Hoffmann wrote:
> > > >> Hi,
> > > >>
> > > >> > Unless I'm mistaken, microvm doesn't even support PCI, does it?
> > > >>
> > > >> Correct, no pci support right now.
> > > >>
> > > >> We could probably wire up ecam (arm/virt style) for pcie support, once
> > > >> the acpi support for mictovm finally landed (we need acpi for that
> > > >> because otherwise the kernel wouldn't find the pcie bus).
> > > >>
> > > >> Question is whenever there is a good reason to do so. Why would someone
> > > >> prefer microvm with pcie support over q35?
> > > >>
> > > >> > If all of the above is true, this can be handled by adding "pci=lastbus=0"
> > > >> > as a guest kernel param to override its scanning of buses. And couldn't
> > > >> > that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
> > > >> > to the end user?
> > > >>
> > > >> microvm_fix_kernel_cmdline() is a hack, not a solution.
> > > >>
> > > >> Beside that I doubt this has much of an effect on microvm because
> > > >> it doesn't support pcie in the first place.
> > > >
> > > > I am so confused. Vitaly, can you clarify exactly what QEMU VM type this
> > > > series is intended to help? If this is for microvm, then why is the guest
> > > > doing PCI scanning in the first place? If it's for q35, why is the
> > > > justification for microvm-like workloads?
> > >
> > > I'm not exactly sure about the plans for particular machine types, the
> > > intention was to use this for pcie in QEMU in general so whatever
> > > machine type uses pcie will benefit.
> > >
> > > Now, it seems that we have a more sophisticated landscape. The
> > > optimization will only make sense to speed up boot so all 'traditional'
> > > VM types with 'traditional' firmware are out of
> > > question. 'Container-like' VMs seem to avoid PCI for now, I'm not sure
> > > if it's because they're in early stages of their development, because
> > > they can get away without PCI or, actually, because of slowness at boot
> > > (which we're trying to tackle with this feature). I'd definitely like to
> > > hear more what people think about this.
> >
> > I suspect microvms will need pci eventually. I would much rather KVM
> > had an exit-less discovery mechanism in place by then because
> > learning from history if it doesn't they will do some kind of
> > hack on the kernel command line, and everyone will be stuck
> > supporting that for years ...
>
> Is it not an option for the VMM to "accurately" enumerate the number of buses?
> E.g. if the VMM has devices on only bus 0, then enumerate that there is one
> bus so that the guest doesn't try and probe devices that can't possibly exist.
> Or is that completely non-sensical and/or violate PCIe spec?


There is some tension here, in that one way to make guest boot faster
is to defer hotplug of devices until after it booted.

--
MST

2020-09-21 17:22:58

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

On Fri, Sep 18, 2020 at 08:34:37AM -0400, Michael S. Tsirkin wrote:
> On Fri, Sep 11, 2020 at 10:00:31AM -0700, Sean Christopherson wrote:
> > On Mon, Sep 07, 2020 at 07:32:23AM -0400, Michael S. Tsirkin wrote:
> > > I suspect microvms will need pci eventually. I would much rather KVM
> > > had an exit-less discovery mechanism in place by then because
> > > learning from history if it doesn't they will do some kind of
> > > hack on the kernel command line, and everyone will be stuck
> > > supporting that for years ...
> >
> > Is it not an option for the VMM to "accurately" enumerate the number of buses?
> > E.g. if the VMM has devices on only bus 0, then enumerate that there is one
> > bus so that the guest doesn't try and probe devices that can't possibly exist.
> > Or is that completely non-sensical and/or violate PCIe spec?
>
>
> There is some tension here, in that one way to make guest boot faster
> is to defer hotplug of devices until after it booted.

Sorry, I didn't follow that, probably because my PCI knowledge is lacking.
What does device hotplug have to do with the number of buses enumerated to
the guest?