I have a question about a problem introduced by this commit:
c275a57f5ec3056f732843b11659d892235faff7
"xen/balloon: Set balloon's initial state to number of existing RAM pages"
It changed the xen balloon current_pages calculation to start with the
number of physical pages in the system, instead of max_pfn. Since
get_num_physpages() does not include holes, it's always less than the
e820 map's max_pfn.
However, the problem that commit introduced is, if the hypervisor sets
the balloon target to equal to the e820 map's max_pfn, then the
balloon target will *always* be higher than the initial current pages.
Even if the hypervisor sets the target to (e820 max_pfn - holes), if
the OS adds any holes, the balloon target will be higher than the
current pages. This is the situation, for example, for Amazon AWS
instances. The result is, the xen balloon will always immediately
hotplug some memory at boot, but then make only (max_pfn -
get_num_physpages()) available to the system.
This balloon-hotplugged memory can cause problems, if the hypervisor
wasn't expecting it; specifically, the system's physical page
addresses now will exceed the e820 map's max_pfn, due to the
balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
DMA to/from those physical pages above the e820 max_pfn, it causes
problems. For example:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
The additional small amount of balloon memory can cause other problems
as well, for example:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
Anyway, I'd like to ask, was the original commit added because
hypervisors are supposed to set their balloon target to the guest
system's number of phys pages (max_pfn - holes)? The mailing list
discussion and commit description seem to indicate that. However I'm
not sure how that is possible, because the kernel reserves its own
holes, regardless of any predefined holes in the e820 map; for
example, the kernel reserves 64k (by default) at phys addr 0 (the
amount of reservation is configurable via CONFIG_X86_RESERVE_LOW). So
the hypervisor really has no way to know what the "right" target to
specify is; unless it knows the exact guest OS and kernel version, and
kernel config values, it will never be able to correctly specify its
target to be exactly (e820 max_pfn - all holes).
Should this commit be reverted? Should the xen balloon target be
adjusted based on kernel-added e820 holes? Should something else be
done?
For context, Amazon Linux has simply disabled Xen ballooning
completely. Likewise, we're planning to disable Xen ballooning in the
Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS
Ubuntu kernels). However, if reverting this patch makes sense in a
bigger context (i.e. Xen users besides AWS), that would allow more
Ubuntu kernels to work correctly in AWS instances.
On 03/22/2017 05:16 PM, Dan Streetman wrote:
> I have a question about a problem introduced by this commit:
> c275a57f5ec3056f732843b11659d892235faff7
> "xen/balloon: Set balloon's initial state to number of existing RAM pages"
>
> It changed the xen balloon current_pages calculation to start with the
> number of physical pages in the system, instead of max_pfn. Since
> get_num_physpages() does not include holes, it's always less than the
> e820 map's max_pfn.
>
> However, the problem that commit introduced is, if the hypervisor sets
> the balloon target to equal to the e820 map's max_pfn, then the
> balloon target will *always* be higher than the initial current pages.
> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
> the OS adds any holes, the balloon target will be higher than the
> current pages. This is the situation, for example, for Amazon AWS
> instances. The result is, the xen balloon will always immediately
> hotplug some memory at boot, but then make only (max_pfn -
> get_num_physpages()) available to the system.
>
> This balloon-hotplugged memory can cause problems, if the hypervisor
> wasn't expecting it; specifically, the system's physical page
> addresses now will exceed the e820 map's max_pfn, due to the
> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
> DMA to/from those physical pages above the e820 max_pfn, it causes
> problems. For example:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
>
> The additional small amount of balloon memory can cause other problems
> as well, for example:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
>
> Anyway, I'd like to ask, was the original commit added because
> hypervisors are supposed to set their balloon target to the guest
> system's number of phys pages (max_pfn - holes)? The mailing list
> discussion and commit description seem to indicate that.
IIRC the problem that this was trying to fix was that since max_pfn
includes holes, upon booting we'd immediately balloon down by the
(typically, MMIO) hole size.
If you boot a guest with ~4+GB memory you should see this.
> However I'm
> not sure how that is possible, because the kernel reserves its own
> holes, regardless of any predefined holes in the e820 map; for
> example, the kernel reserves 64k (by default) at phys addr 0 (the
> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW). So
> the hypervisor really has no way to know what the "right" target to
> specify is; unless it knows the exact guest OS and kernel version, and
> kernel config values, it will never be able to correctly specify its
> target to be exactly (e820 max_pfn - all holes).
>
> Should this commit be reverted? Should the xen balloon target be
> adjusted based on kernel-added e820 holes?
I think the second one but shouldn't current_pages be updated, and not
the target? The latter is set by Xen (toolstack, via xenstore usually).
Also, the bugs above (at least one of them) talk about NVMe and I wonder
whether the memory that they add is of RAM type --- I believe it has its
own type and so perhaps that introduces additional inconsistencies. AWS
may have added their own support for that, which we don't have upstream yet.
-boris
> Should something else be
> done?
>
> For context, Amazon Linux has simply disabled Xen ballooning
> completely. Likewise, we're planning to disable Xen ballooning in the
> Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS
> Ubuntu kernels). However, if reverting this patch makes sense in a
> bigger context (i.e. Xen users besides AWS), that would allow more
> Ubuntu kernels to work correctly in AWS instances.
>
On 23/03/17 03:13, Boris Ostrovsky wrote:
>
>
> On 03/22/2017 05:16 PM, Dan Streetman wrote:
>> I have a question about a problem introduced by this commit:
>> c275a57f5ec3056f732843b11659d892235faff7
>> "xen/balloon: Set balloon's initial state to number of existing RAM
>> pages"
>>
>> It changed the xen balloon current_pages calculation to start with the
>> number of physical pages in the system, instead of max_pfn. Since
>> get_num_physpages() does not include holes, it's always less than the
>> e820 map's max_pfn.
>>
>> However, the problem that commit introduced is, if the hypervisor sets
>> the balloon target to equal to the e820 map's max_pfn, then the
>> balloon target will *always* be higher than the initial current pages.
>> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
>> the OS adds any holes, the balloon target will be higher than the
>> current pages. This is the situation, for example, for Amazon AWS
>> instances. The result is, the xen balloon will always immediately
>> hotplug some memory at boot, but then make only (max_pfn -
>> get_num_physpages()) available to the system.
>>
>> This balloon-hotplugged memory can cause problems, if the hypervisor
>> wasn't expecting it; specifically, the system's physical page
>> addresses now will exceed the e820 map's max_pfn, due to the
>> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
>> DMA to/from those physical pages above the e820 max_pfn, it causes
>> problems. For example:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
>>
>> The additional small amount of balloon memory can cause other problems
>> as well, for example:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
>>
>> Anyway, I'd like to ask, was the original commit added because
>> hypervisors are supposed to set their balloon target to the guest
>> system's number of phys pages (max_pfn - holes)? The mailing list
>> discussion and commit description seem to indicate that.
>
>
> IIRC the problem that this was trying to fix was that since max_pfn
> includes holes, upon booting we'd immediately balloon down by the
> (typically, MMIO) hole size.
>
> If you boot a guest with ~4+GB memory you should see this.
>
>
>> However I'm
>> not sure how that is possible, because the kernel reserves its own
>> holes, regardless of any predefined holes in the e820 map; for
>> example, the kernel reserves 64k (by default) at phys addr 0 (the
>> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW). So
>> the hypervisor really has no way to know what the "right" target to
>> specify is; unless it knows the exact guest OS and kernel version, and
>> kernel config values, it will never be able to correctly specify its
>> target to be exactly (e820 max_pfn - all holes).
>>
>> Should this commit be reverted? Should the xen balloon target be
>> adjusted based on kernel-added e820 holes?
>
> I think the second one but shouldn't current_pages be updated, and not
> the target? The latter is set by Xen (toolstack, via xenstore usually).
Right.
Looking into a HVM domU I can't see any problem related to
CONFIG_X86_RESERVE_LOW: it is set to 64 on my system. The domU is
configured with 2048 MB of RAM, 8MB being video RAM. Looking into
/sys/devices/system/xen_memory/xen_memory0 I can see the current
size and target size do match: both are 2088960 kB (2 GB - 8 MB).
Ballooning down and up to 2048 MB again doesn't change the picture.
So which additional holes are added by the kernel on AWS via which
functions?
Juergen
On Thu, Mar 23, 2017 at 3:56 AM, Juergen Gross <[email protected]> wrote:
> On 23/03/17 03:13, Boris Ostrovsky wrote:
>>
>>
>> On 03/22/2017 05:16 PM, Dan Streetman wrote:
>>> I have a question about a problem introduced by this commit:
>>> c275a57f5ec3056f732843b11659d892235faff7
>>> "xen/balloon: Set balloon's initial state to number of existing RAM
>>> pages"
>>>
>>> It changed the xen balloon current_pages calculation to start with the
>>> number of physical pages in the system, instead of max_pfn. Since
>>> get_num_physpages() does not include holes, it's always less than the
>>> e820 map's max_pfn.
>>>
>>> However, the problem that commit introduced is, if the hypervisor sets
>>> the balloon target to equal to the e820 map's max_pfn, then the
>>> balloon target will *always* be higher than the initial current pages.
>>> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
>>> the OS adds any holes, the balloon target will be higher than the
>>> current pages. This is the situation, for example, for Amazon AWS
>>> instances. The result is, the xen balloon will always immediately
>>> hotplug some memory at boot, but then make only (max_pfn -
>>> get_num_physpages()) available to the system.
>>>
>>> This balloon-hotplugged memory can cause problems, if the hypervisor
>>> wasn't expecting it; specifically, the system's physical page
>>> addresses now will exceed the e820 map's max_pfn, due to the
>>> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
>>> DMA to/from those physical pages above the e820 max_pfn, it causes
>>> problems. For example:
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
>>>
>>> The additional small amount of balloon memory can cause other problems
>>> as well, for example:
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
>>>
>>> Anyway, I'd like to ask, was the original commit added because
>>> hypervisors are supposed to set their balloon target to the guest
>>> system's number of phys pages (max_pfn - holes)? The mailing list
>>> discussion and commit description seem to indicate that.
>>
>>
>> IIRC the problem that this was trying to fix was that since max_pfn
>> includes holes, upon booting we'd immediately balloon down by the
>> (typically, MMIO) hole size.
>>
>> If you boot a guest with ~4+GB memory you should see this.
>>
>>
>>> However I'm
>>> not sure how that is possible, because the kernel reserves its own
>>> holes, regardless of any predefined holes in the e820 map; for
>>> example, the kernel reserves 64k (by default) at phys addr 0 (the
>>> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW). So
>>> the hypervisor really has no way to know what the "right" target to
>>> specify is; unless it knows the exact guest OS and kernel version, and
>>> kernel config values, it will never be able to correctly specify its
>>> target to be exactly (e820 max_pfn - all holes).
>>>
>>> Should this commit be reverted? Should the xen balloon target be
>>> adjusted based on kernel-added e820 holes?
>>
>> I think the second one but shouldn't current_pages be updated, and not
>> the target? The latter is set by Xen (toolstack, via xenstore usually).
>
> Right.
>
> Looking into a HVM domU I can't see any problem related to
> CONFIG_X86_RESERVE_LOW: it is set to 64 on my system. The domU is
sorry I brought that up; I was only giving an example. It's not
directly relevant to this and may have distracted from the actual
problem; in fact on closer inspection, the X86_RESERVE_LOW is using
memblock_reserve(), which removes it from managed memory but not the
e820 map (and thus doesn't remove it from get_num_physpages()). Only
phys page 0 is actually reserved in the e820 map.
> configured with 2048 MB of RAM, 8MB being video RAM. Looking into
> /sys/devices/system/xen_memory/xen_memory0 I can see the current
> size and target size do match: both are 2088960 kB (2 GB - 8 MB).
>
> Ballooning down and up to 2048 MB again doesn't change the picture.
>
> So which additional holes are added by the kernel on AWS via which
> functions?
I'll use two AWS types as examples, t2.micro (1G mem) and t2.large (8G mem).
In the micro, the results of ballooning are obvious, because the
hotplugged memory always goes into the Normal zone; but since the base
memory is only 1g, it's contained entirely in the DMA32/DMA zones. So
we get:
$ grep -E '(start_pfn|present|spanned|managed)' /proc/zoneinfo
spanned 4095
present 3997
managed 3976
start_pfn: 1
spanned 258048
present 258048
managed 249606
start_pfn: 4096
spanned 32768
present 32768
managed 11
start_pfn: 262144
As you can see, none of the e820 memory went into the Normal zone; the
balloon driver hotpluged 128m (32k pages), but only made 11 pages
available. Having a memory zone with only 11 pages really screwed
with kswapd, since the zone's memory watermarks were all 0. That was
the second bug I referenced in my initial email.
Anyway, if we look at the large instance, you don't really notice the
additional balloon memory:
$ grep -E '(start_pfn|present|spanned|managed)' /proc/zoneinfo
spanned 4095
present 3997
managed 3976
start_pfn: 1
spanned 1044480
present 978944
managed 958778
start_pfn: 4096
spanned 1146880
present 1146880
managed 1080666
start_pfn: 1048576
but, doing the actual math shows the problem:
$ printf "%x\n" $[ 1048576 + 1146880 ]
218000
$ printf "%x\n" $[ 1048576 + 1080666 ]
207d5a
$ dmesg|grep e820
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000020fffffff] usable
[ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[ 0.000000] e820: last_pfn = 0x210000 max_arch_pfn = 0x400000000
[ 0.000000] e820: last_pfn = 0xf0000 max_arch_pfn = 0x400000000
[ 0.000000] e820: [mem 0xf0000000-0xfbffffff] available for PCI devices
[ 0.595083] e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]
so, we can see the balloon driver hotplugged those extra 0x8000 pages,
and made some of them available.
The target has been set to:
$ printf "%x\n" $( cat /sys/devices/system/xen_memory/xen_memory0/target )
200000000
while the e820 map provides:
$ printf "%x\n" $[ 0x210000000 - 0x100000000 + 0xf0000000 - 0x100000 +
0x9e000 - 0x1000 ]
1fff9d000
and current memory is:
/sys/devices/system/xen_memory/xen_memory0$ printf "%x\n" $[ $( cat
info/current_kb ) * 1024 ]
1fffa8000
so the balloon driver has added...
$ echo $[ ( 0x1fffa8000 - 0x1fff9d000 ) / 4096 ]
11
exactly 11 pages, just like the micro instance type. I'm not sure
where the balloon driver gets that 11 page calculation, nor am I sure
why the current_kb is actually less than the balloon target.
On Wed, Mar 22, 2017 at 10:13 PM, Boris Ostrovsky
<[email protected]> wrote:
>
>
> On 03/22/2017 05:16 PM, Dan Streetman wrote:
>>
>> I have a question about a problem introduced by this commit:
>> c275a57f5ec3056f732843b11659d892235faff7
>> "xen/balloon: Set balloon's initial state to number of existing RAM pages"
>>
>> It changed the xen balloon current_pages calculation to start with the
>> number of physical pages in the system, instead of max_pfn. Since
>> get_num_physpages() does not include holes, it's always less than the
>> e820 map's max_pfn.
>>
>> However, the problem that commit introduced is, if the hypervisor sets
>> the balloon target to equal to the e820 map's max_pfn, then the
>> balloon target will *always* be higher than the initial current pages.
>> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
>> the OS adds any holes, the balloon target will be higher than the
>> current pages. This is the situation, for example, for Amazon AWS
>> instances. The result is, the xen balloon will always immediately
>> hotplug some memory at boot, but then make only (max_pfn -
>> get_num_physpages()) available to the system.
>>
>> This balloon-hotplugged memory can cause problems, if the hypervisor
>> wasn't expecting it; specifically, the system's physical page
>> addresses now will exceed the e820 map's max_pfn, due to the
>> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
>> DMA to/from those physical pages above the e820 max_pfn, it causes
>> problems. For example:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
>>
>> The additional small amount of balloon memory can cause other problems
>> as well, for example:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
>>
>> Anyway, I'd like to ask, was the original commit added because
>> hypervisors are supposed to set their balloon target to the guest
>> system's number of phys pages (max_pfn - holes)? The mailing list
>> discussion and commit description seem to indicate that.
>
>
>
> IIRC the problem that this was trying to fix was that since max_pfn includes
> holes, upon booting we'd immediately balloon down by the (typically, MMIO)
> hole size.
>
> If you boot a guest with ~4+GB memory you should see this.
>
>
>> However I'm
>> not sure how that is possible, because the kernel reserves its own
>> holes, regardless of any predefined holes in the e820 map; for
>> example, the kernel reserves 64k (by default) at phys addr 0 (the
>> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW). So
>> the hypervisor really has no way to know what the "right" target to
>> specify is; unless it knows the exact guest OS and kernel version, and
>> kernel config values, it will never be able to correctly specify its
>> target to be exactly (e820 max_pfn - all holes).
>>
>> Should this commit be reverted? Should the xen balloon target be
>> adjusted based on kernel-added e820 holes?
>
>
> I think the second one but shouldn't current_pages be updated, and not the
> target? The latter is set by Xen (toolstack, via xenstore usually).
>
> Also, the bugs above (at least one of them) talk about NVMe and I wonder
> whether the memory that they add is of RAM type --- I believe it has its own
> type and so perhaps that introduces additional inconsistencies. AWS may have
> added their own support for that, which we don't have upstream yet.
The type of memory doesn't have anything to do with it.
The problem with NVMe is it's a passthrough device, so the guest talks
directly to the NVMe controller and does DMA with it. But the
hypervisor does swiotlb translation between the guest physical memory,
and the host physical memory, so that the NVMe device can correctly
DMA to the right memory in the host.
However, the hypervisor only has the guest's physical memory up to the
max e820 pfn mapped; it didn't expect the balloon driver to hotplug
any additional memory above the e820 max pfn, so when the NVMe driver
in the guest tries to tell the NVMe controller to DMA to that
balloon-hotplugged memory, the hypervisor fails the NVMe request,
because it can't do the guest-to-host phys mem mapping, since the
guest phys address is outside the expected max range.
>
> -boris
>
>
>
>> Should something else be
>> done?
>>
>> For context, Amazon Linux has simply disabled Xen ballooning
>> completely. Likewise, we're planning to disable Xen ballooning in the
>> Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS
>> Ubuntu kernels). However, if reverting this patch makes sense in a
>> bigger context (i.e. Xen users besides AWS), that would allow more
>> Ubuntu kernels to work correctly in AWS instances.
>>
>
On Fri, Mar 24, 2017 at 04:34:23PM -0400, Dan Streetman wrote:
> On Wed, Mar 22, 2017 at 10:13 PM, Boris Ostrovsky
> <[email protected]> wrote:
> >
> >
> > On 03/22/2017 05:16 PM, Dan Streetman wrote:
> >>
> >> I have a question about a problem introduced by this commit:
> >> c275a57f5ec3056f732843b11659d892235faff7
> >> "xen/balloon: Set balloon's initial state to number of existing RAM pages"
> >>
> >> It changed the xen balloon current_pages calculation to start with the
> >> number of physical pages in the system, instead of max_pfn. Since
> >> get_num_physpages() does not include holes, it's always less than the
> >> e820 map's max_pfn.
> >>
> >> However, the problem that commit introduced is, if the hypervisor sets
> >> the balloon target to equal to the e820 map's max_pfn, then the
> >> balloon target will *always* be higher than the initial current pages.
> >> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
> >> the OS adds any holes, the balloon target will be higher than the
> >> current pages. This is the situation, for example, for Amazon AWS
> >> instances. The result is, the xen balloon will always immediately
> >> hotplug some memory at boot, but then make only (max_pfn -
> >> get_num_physpages()) available to the system.
> >>
> >> This balloon-hotplugged memory can cause problems, if the hypervisor
> >> wasn't expecting it; specifically, the system's physical page
> >> addresses now will exceed the e820 map's max_pfn, due to the
> >> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
> >> DMA to/from those physical pages above the e820 max_pfn, it causes
> >> problems. For example:
> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
> >>
> >> The additional small amount of balloon memory can cause other problems
> >> as well, for example:
> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
> >>
> >> Anyway, I'd like to ask, was the original commit added because
> >> hypervisors are supposed to set their balloon target to the guest
> >> system's number of phys pages (max_pfn - holes)? The mailing list
> >> discussion and commit description seem to indicate that.
> >
> >
> >
> > IIRC the problem that this was trying to fix was that since max_pfn includes
> > holes, upon booting we'd immediately balloon down by the (typically, MMIO)
> > hole size.
> >
> > If you boot a guest with ~4+GB memory you should see this.
> >
> >
> >> However I'm
> >> not sure how that is possible, because the kernel reserves its own
> >> holes, regardless of any predefined holes in the e820 map; for
> >> example, the kernel reserves 64k (by default) at phys addr 0 (the
> >> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW). So
> >> the hypervisor really has no way to know what the "right" target to
> >> specify is; unless it knows the exact guest OS and kernel version, and
> >> kernel config values, it will never be able to correctly specify its
> >> target to be exactly (e820 max_pfn - all holes).
> >>
> >> Should this commit be reverted? Should the xen balloon target be
> >> adjusted based on kernel-added e820 holes?
> >
> >
> > I think the second one but shouldn't current_pages be updated, and not the
> > target? The latter is set by Xen (toolstack, via xenstore usually).
> >
> > Also, the bugs above (at least one of them) talk about NVMe and I wonder
> > whether the memory that they add is of RAM type --- I believe it has its own
> > type and so perhaps that introduces additional inconsistencies. AWS may have
> > added their own support for that, which we don't have upstream yet.
>
> The type of memory doesn't have anything to do with it.
>
> The problem with NVMe is it's a passthrough device, so the guest talks
> directly to the NVMe controller and does DMA with it. But the
> hypervisor does swiotlb translation between the guest physical memory,
Um, the hypervisor does not have SWIOTLB support, only IOMMU support.
> and the host physical memory, so that the NVMe device can correctly
> DMA to the right memory in the host.
>
> However, the hypervisor only has the guest's physical memory up to the
> max e820 pfn mapped; it didn't expect the balloon driver to hotplug
> any additional memory above the e820 max pfn, so when the NVMe driver
> in the guest tries to tell the NVMe controller to DMA to that
> balloon-hotplugged memory, the hypervisor fails the NVMe request,
But when the memory hotplug happens the hypercalls are done to
raise the max pfn.
> because it can't do the guest-to-host phys mem mapping, since the
> guest phys address is outside the expected max range.
>
>
>
> >
> > -boris
> >
> >
> >
> >> Should something else be
> >> done?
> >>
> >> For context, Amazon Linux has simply disabled Xen ballooning
> >> completely. Likewise, we're planning to disable Xen ballooning in the
> >> Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS
> >> Ubuntu kernels). However, if reverting this patch makes sense in a
> >> bigger context (i.e. Xen users besides AWS), that would allow more
> >> Ubuntu kernels to work correctly in AWS instances.
> >>
> >
On Fri, Mar 24, 2017 at 5:10 PM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
> On Fri, Mar 24, 2017 at 04:34:23PM -0400, Dan Streetman wrote:
>> On Wed, Mar 22, 2017 at 10:13 PM, Boris Ostrovsky
>> <[email protected]> wrote:
>> >
>> >
>> > On 03/22/2017 05:16 PM, Dan Streetman wrote:
>> >>
>> >> I have a question about a problem introduced by this commit:
>> >> c275a57f5ec3056f732843b11659d892235faff7
>> >> "xen/balloon: Set balloon's initial state to number of existing RAM pages"
>> >>
>> >> It changed the xen balloon current_pages calculation to start with the
>> >> number of physical pages in the system, instead of max_pfn. Since
>> >> get_num_physpages() does not include holes, it's always less than the
>> >> e820 map's max_pfn.
>> >>
>> >> However, the problem that commit introduced is, if the hypervisor sets
>> >> the balloon target to equal to the e820 map's max_pfn, then the
>> >> balloon target will *always* be higher than the initial current pages.
>> >> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
>> >> the OS adds any holes, the balloon target will be higher than the
>> >> current pages. This is the situation, for example, for Amazon AWS
>> >> instances. The result is, the xen balloon will always immediately
>> >> hotplug some memory at boot, but then make only (max_pfn -
>> >> get_num_physpages()) available to the system.
>> >>
>> >> This balloon-hotplugged memory can cause problems, if the hypervisor
>> >> wasn't expecting it; specifically, the system's physical page
>> >> addresses now will exceed the e820 map's max_pfn, due to the
>> >> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
>> >> DMA to/from those physical pages above the e820 max_pfn, it causes
>> >> problems. For example:
>> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
>> >>
>> >> The additional small amount of balloon memory can cause other problems
>> >> as well, for example:
>> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
>> >>
>> >> Anyway, I'd like to ask, was the original commit added because
>> >> hypervisors are supposed to set their balloon target to the guest
>> >> system's number of phys pages (max_pfn - holes)? The mailing list
>> >> discussion and commit description seem to indicate that.
>> >
>> >
>> >
>> > IIRC the problem that this was trying to fix was that since max_pfn includes
>> > holes, upon booting we'd immediately balloon down by the (typically, MMIO)
>> > hole size.
>> >
>> > If you boot a guest with ~4+GB memory you should see this.
>> >
>> >
>> >> However I'm
>> >> not sure how that is possible, because the kernel reserves its own
>> >> holes, regardless of any predefined holes in the e820 map; for
>> >> example, the kernel reserves 64k (by default) at phys addr 0 (the
>> >> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW). So
>> >> the hypervisor really has no way to know what the "right" target to
>> >> specify is; unless it knows the exact guest OS and kernel version, and
>> >> kernel config values, it will never be able to correctly specify its
>> >> target to be exactly (e820 max_pfn - all holes).
>> >>
>> >> Should this commit be reverted? Should the xen balloon target be
>> >> adjusted based on kernel-added e820 holes?
>> >
>> >
>> > I think the second one but shouldn't current_pages be updated, and not the
>> > target? The latter is set by Xen (toolstack, via xenstore usually).
>> >
>> > Also, the bugs above (at least one of them) talk about NVMe and I wonder
>> > whether the memory that they add is of RAM type --- I believe it has its own
>> > type and so perhaps that introduces additional inconsistencies. AWS may have
>> > added their own support for that, which we don't have upstream yet.
>>
>> The type of memory doesn't have anything to do with it.
>>
>> The problem with NVMe is it's a passthrough device, so the guest talks
>> directly to the NVMe controller and does DMA with it. But the
>> hypervisor does swiotlb translation between the guest physical memory,
>
> Um, the hypervisor does not have SWIOTLB support, only IOMMU support.
heh, well I have no special insight into Amazon's hypervisor, so I
have no idea what underlying memory remapping mechanism it uses :)
>
>> and the host physical memory, so that the NVMe device can correctly
>> DMA to the right memory in the host.
>>
>> However, the hypervisor only has the guest's physical memory up to the
>> max e820 pfn mapped; it didn't expect the balloon driver to hotplug
>> any additional memory above the e820 max pfn, so when the NVMe driver
>> in the guest tries to tell the NVMe controller to DMA to that
>> balloon-hotplugged memory, the hypervisor fails the NVMe request,
>
> But when the memory hotplug happens the hypercalls are done to
> raise the max pfn.
well...all I can say is it rejects DMA above the e820 range. so this
very well may be a hypervisor bug, where it should add the balloon
memory region to whatever does the NVMe passthrough device iommu
mapping.
I think we can all agree that the *ideal* situation would be, for the
balloon driver to not immediately hotplug memory so it can add 11 more
pages, so maybe I just need to figure out why the balloon driver
thinks it needs 11 more pages, and fix that.
>
>> because it can't do the guest-to-host phys mem mapping, since the
>> guest phys address is outside the expected max range.
>>
>>
>>
>> >
>> > -boris
>> >
>> >
>> >
>> >> Should something else be
>> >> done?
>> >>
>> >> For context, Amazon Linux has simply disabled Xen ballooning
>> >> completely. Likewise, we're planning to disable Xen ballooning in the
>> >> Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS
>> >> Ubuntu kernels). However, if reverting this patch makes sense in a
>> >> bigger context (i.e. Xen users besides AWS), that would allow more
>> >> Ubuntu kernels to work correctly in AWS instances.
>> >>
>> >
>
> I think we can all agree that the *ideal* situation would be, for the
> balloon driver to not immediately hotplug memory so it can add 11 more
> pages, so maybe I just need to figure out why the balloon driver
> thinks it needs 11 more pages, and fix that.
How does the new memory appear in the guest? Via online_pages()?
Or is ballooning triggered from watch_target()?
-boris
On Fri, Mar 24, 2017 at 9:33 PM, Boris Ostrovsky
<[email protected]> wrote:
>
>>
>> I think we can all agree that the *ideal* situation would be, for the
>> balloon driver to not immediately hotplug memory so it can add 11 more
>> pages, so maybe I just need to figure out why the balloon driver
>> thinks it needs 11 more pages, and fix that.
>
>
>
> How does the new memory appear in the guest? Via online_pages()?
>
> Or is ballooning triggered from watch_target()?
yes, it's triggered from watch_target() which then calls
online_pages() with the new memory. I added some debug (all numbers
are in hex):
[ 0.500080] xen:balloon: Initialising balloon driver
[ 0.503027] xen:balloon: balloon_init: current/target pages 1fff9d
[ 0.504044] xen_balloon: Initialising balloon driver
[ 0.508046] xen_balloon: watch_target: new target 800000 kb
[ 0.508046] xen:balloon: balloon_set_new_target: target 200000
[ 0.524024] xen:balloon: current_credit: target pages 200000
current pages 1fff9d credit 63
[ 0.567055] xen:balloon: balloon_process: current_credit 63
[ 0.568005] xen:balloon: reserve_additional_memory: adding memory
resource for 8000 pages
[ 3.694443] online_pages: pfn 210000 nr_pages 8000 type 0
[ 3.701072] xen:balloon: current_credit: target pages 200000
current pages 1fff9d credit 63
[ 3.701074] xen:balloon: balloon_process: current_credit 63
[ 3.701075] xen:balloon: increase_reservation: nr_pages 63
[ 3.701170] xen:balloon: increase_reservation: done, current_pages 1fffa8
[ 3.701172] xen:balloon: current_credit: target pages 200000
current pages 1fffa8 credit 58
[ 3.701173] xen:balloon: balloon_process: current_credit 58
[ 3.701173] xen:balloon: increase_reservation: nr_pages 58
[ 3.701180] xen:balloon: increase_reservation: XENMEM_populate_physmap err 0
[ 5.708085] xen:balloon: current_credit: target pages 200000
current pages 1fffa8 credit 58
[ 5.708088] xen:balloon: balloon_process: current_credit 58
[ 5.708089] xen:balloon: increase_reservation: nr_pages 58
[ 5.708106] xen:balloon: increase_reservation: XENMEM_populate_physmap err 0
[ 9.716065] xen:balloon: current_credit: target pages 200000
current pages 1fffa8 credit 58
[ 9.716068] xen:balloon: balloon_process: current_credit 58
[ 9.716069] xen:balloon: increase_reservation: nr_pages 58
[ 9.716087] xen:balloon: increase_reservation: XENMEM_populate_physmap err 0
and that continues forever at the max interval (32), since
max_retry_count is unlimited. So I think I understand things now;
first, the current_pages is set properly based on the e820 map:
$ dmesg|grep -i e820
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000020fffffff] usable
[ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[ 0.000000] e820: last_pfn = 0x210000 max_arch_pfn = 0x400000000
[ 0.000000] e820: last_pfn = 0xf0000 max_arch_pfn = 0x400000000
[ 0.000000] e820: [mem 0xf0000000-0xfbffffff] available for PCI devices
[ 0.528007] e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]
ubuntu@ip-172-31-60-112:~$ printf "%x\n" $[ 0x210000 - 0x100000 +
0xf0000 - 0x100 + 0x9e - 1 ]
1fff9d
then, the xen balloon notices its target has been set to 200000 by the
hypervisor. That target does account for the hole at 0xf0000 to
0x100000, but it doesn't account for the hole at 0xe0 to 0x100 ( 0x20
pages), nor the hole at 0x9e to 0xa0 ( 2 pages ), nor the unlisted
hole (that the kernel removes) at 0xa0 to 0xe0 ( 0x40 pages). That's
0x62 pages, plus the 1-page hole at addr 0 that the kernel always
reserves, is 0x63 pages of holes, which aren't accounted for in the
hypervisor's target.
so the balloon driver hotplugs the memory, and tries to increase its
reservation to provide the needed pages to get the current_pages up to
the target. However, when it calls the hypervisor to populate the
physmap, the hypervisor only allows 11 (0xb) pages to be populated;
all calls after that get back 0 from the hypervisor.
Do you think the hypervisor's balloon target should account for the
e820 holes (and for the kernel's added hole at addr 0)?
Alternately/additionally, if the hypervisor doesn't want to support
ballooning, should it just return error from the call to populate the
physmap, and not allow those 11 pages?
At this point, it doesn't seem to me like the kernel is doing anything
wrong, correct?
On 03/27/2017 03:57 PM, Dan Streetman wrote:
> On Fri, Mar 24, 2017 at 9:33 PM, Boris Ostrovsky
> <[email protected]> wrote:
>>
>>>
>>> I think we can all agree that the *ideal* situation would be, for the
>>> balloon driver to not immediately hotplug memory so it can add 11 more
>>> pages, so maybe I just need to figure out why the balloon driver
>>> thinks it needs 11 more pages, and fix that.
>>
>>
>>
>> How does the new memory appear in the guest? Via online_pages()?
>>
>> Or is ballooning triggered from watch_target()?
>
> yes, it's triggered from watch_target() which then calls
> online_pages() with the new memory. I added some debug (all numbers
> are in hex):
>
> [ 0.500080] xen:balloon: Initialising balloon driver
> [ 0.503027] xen:balloon: balloon_init: current/target pages 1fff9d
> [ 0.504044] xen_balloon: Initialising balloon driver
> [ 0.508046] xen_balloon: watch_target: new target 800000 kb
> [ 0.508046] xen:balloon: balloon_set_new_target: target 200000
> [ 0.524024] xen:balloon: current_credit: target pages 200000
> current pages 1fff9d credit 63
> [ 0.567055] xen:balloon: balloon_process: current_credit 63
> [ 0.568005] xen:balloon: reserve_additional_memory: adding memory
> resource for 8000 pages
> [ 3.694443] online_pages: pfn 210000 nr_pages 8000 type 0
> [ 3.701072] xen:balloon: current_credit: target pages 200000
> current pages 1fff9d credit 63
> [ 3.701074] xen:balloon: balloon_process: current_credit 63
> [ 3.701075] xen:balloon: increase_reservation: nr_pages 63
> [ 3.701170] xen:balloon: increase_reservation: done, current_pages 1fffa8
> [ 3.701172] xen:balloon: current_credit: target pages 200000
> current pages 1fffa8 credit 58
> [ 3.701173] xen:balloon: balloon_process: current_credit 58
> [ 3.701173] xen:balloon: increase_reservation: nr_pages 58
> [ 3.701180] xen:balloon: increase_reservation: XENMEM_populate_physmap err 0
> [ 5.708085] xen:balloon: current_credit: target pages 200000
> current pages 1fffa8 credit 58
> [ 5.708088] xen:balloon: balloon_process: current_credit 58
> [ 5.708089] xen:balloon: increase_reservation: nr_pages 58
> [ 5.708106] xen:balloon: increase_reservation: XENMEM_populate_physmap err 0
> [ 9.716065] xen:balloon: current_credit: target pages 200000
> current pages 1fffa8 credit 58
> [ 9.716068] xen:balloon: balloon_process: current_credit 58
> [ 9.716069] xen:balloon: increase_reservation: nr_pages 58
> [ 9.716087] xen:balloon: increase_reservation: XENMEM_populate_physmap err 0
>
>
> and that continues forever at the max interval (32), since
> max_retry_count is unlimited. So I think I understand things now;
> first, the current_pages is set properly based on the e820 map:
>
> $ dmesg|grep -i e820
> [ 0.000000] e820: BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
> [ 0.000000] BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000020fffffff] usable
> [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
> [ 0.000000] e820: last_pfn = 0x210000 max_arch_pfn = 0x400000000
> [ 0.000000] e820: last_pfn = 0xf0000 max_arch_pfn = 0x400000000
> [ 0.000000] e820: [mem 0xf0000000-0xfbffffff] available for PCI devices
> [ 0.528007] e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]
> ubuntu@ip-172-31-60-112:~$ printf "%x\n" $[ 0x210000 - 0x100000 +
> 0xf0000 - 0x100 + 0x9e - 1 ]
> 1fff9d
>
>
> then, the xen balloon notices its target has been set to 200000 by the
> hypervisor. That target does account for the hole at 0xf0000 to
> 0x100000, but it doesn't account for the hole at 0xe0 to 0x100 ( 0x20
> pages), nor the hole at 0x9e to 0xa0 ( 2 pages ), nor the unlisted
> hole (that the kernel removes) at 0xa0 to 0xe0 ( 0x40 pages). That's
> 0x62 pages, plus the 1-page hole at addr 0 that the kernel always
> reserves, is 0x63 pages of holes, which aren't accounted for in the
> hypervisor's target.
>
> so the balloon driver hotplugs the memory, and tries to increase its
> reservation to provide the needed pages to get the current_pages up to
> the target. However, when it calls the hypervisor to populate the
> physmap, the hypervisor only allows 11 (0xb) pages to be populated;
> all calls after that get back 0 from the hypervisor.
>
> Do you think the hypervisor's balloon target should account for the
> e820 holes (and for the kernel's added hole at addr 0)?
> Alternately/additionally, if the hypervisor doesn't want to support
> ballooning, should it just return error from the call to populate the
> physmap, and not allow those 11 pages?
>
> At this point, it doesn't seem to me like the kernel is doing anything
> wrong, correct?
>
I think there is indeed a disconnect between target memory (provided by
the toolstack) and current memory (i.e actual pages available to the guest).
For example
[ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff]
reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff]
reserved
are missed in target calculation. The hvmloader marks them as RESERVED
(in build_e820_table()) but target value is not aware of this action.
And then the same problem repeats when kernel removes
0x000a0000-0x000fffff chunk.
(BTW, this is all happening before the new 0x8000 pages are onlined,
which takes places much later and is a separate and what looks to me an
unrelated event).
-boris
>>> On 28.03.17 at 03:57, <[email protected]> wrote:
> I think there is indeed a disconnect between target memory (provided by
> the toolstack) and current memory (i.e actual pages available to the guest).
>
> For example
>
> [ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff]
> reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff]
> reserved
>
> are missed in target calculation. The hvmloader marks them as RESERVED
> (in build_e820_table()) but target value is not aware of this action.
>
> And then the same problem repeats when kernel removes
> 0x000a0000-0x000fffff chunk.
But this is all in-guest behavior, i.e. nothing an entity outside the
guest (tool stack or hypervisor) should need to be aware of. That
said, there is still room for improvement in the tools I think:
Regions which architecturally aren't RAM (namely the
0xa0000-0xfffff range) would probably better not be accounted
for as RAM as far as ballooning is concerned. In the hypervisor,
otoh, all memory assigned to the guest (i.e. including such backing
ROMs) needs to be accounted.
Jan
On 03/28/2017 04:08 AM, Jan Beulich wrote:
>>>> On 28.03.17 at 03:57, <[email protected]> wrote:
>> I think there is indeed a disconnect between target memory (provided by
>> the toolstack) and current memory (i.e actual pages available to the guest).
>>
>> For example
>>
>> [ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff]
>> reserved
>> [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff]
>> reserved
>>
>> are missed in target calculation. The hvmloader marks them as RESERVED
>> (in build_e820_table()) but target value is not aware of this action.
>>
>> And then the same problem repeats when kernel removes
>> 0x000a0000-0x000fffff chunk.
> But this is all in-guest behavior, i.e. nothing an entity outside the
> guest (tool stack or hypervisor) should need to be aware of. That
> said, there is still room for improvement in the tools I think:
> Regions which architecturally aren't RAM (namely the
> 0xa0000-0xfffff range) would probably better not be accounted
> for as RAM as far as ballooning is concerned. In the hypervisor,
> otoh, all memory assigned to the guest (i.e. including such backing
> ROMs) needs to be accounted.
On the Linux side we should not include in balloon calculations pages
reserved by trim_bios_range(), i.e. (BIOS_END-BIOS_BEGIN) + 1.
Which leaves hvmloader's special pages (and possibly memory under
0xA0000 which may get reserved). Can we pass this info to guests via
xenstore?
-boris
>>> On 28.03.17 at 16:27, <[email protected]> wrote:
> Which leaves hvmloader's special pages (and possibly memory under
> 0xA0000 which may get reserved). Can we pass this info to guests via
> xenstore?
I'm perhaps the wrong one to ask regarding xenstore, but for
in-guest communication this seems an at least strange approach
to me.
Jan
On 28/03/17 16:27, Boris Ostrovsky wrote:
> On 03/28/2017 04:08 AM, Jan Beulich wrote:
>>>>> On 28.03.17 at 03:57, <[email protected]> wrote:
>>> I think there is indeed a disconnect between target memory (provided by
>>> the toolstack) and current memory (i.e actual pages available to the guest).
>>>
>>> For example
>>>
>>> [ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff]
>>> reserved
>>> [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff]
>>> reserved
>>>
>>> are missed in target calculation. The hvmloader marks them as RESERVED
>>> (in build_e820_table()) but target value is not aware of this action.
>>>
>>> And then the same problem repeats when kernel removes
>>> 0x000a0000-0x000fffff chunk.
>> But this is all in-guest behavior, i.e. nothing an entity outside the
>> guest (tool stack or hypervisor) should need to be aware of. That
>> said, there is still room for improvement in the tools I think:
>> Regions which architecturally aren't RAM (namely the
>> 0xa0000-0xfffff range) would probably better not be accounted
>> for as RAM as far as ballooning is concerned. In the hypervisor,
>> otoh, all memory assigned to the guest (i.e. including such backing
>> ROMs) needs to be accounted.
>
> On the Linux side we should not include in balloon calculations pages
> reserved by trim_bios_range(), i.e. (BIOS_END-BIOS_BEGIN) + 1.
>
> Which leaves hvmloader's special pages (and possibly memory under
> 0xA0000 which may get reserved). Can we pass this info to guests via
> xenstore?
I'd rather keep an internal difference between online pages and E820-map
count value in the balloon driver. This should work always.
Juergen
On 03/28/2017 11:30 AM, Juergen Gross wrote:
> On 28/03/17 16:27, Boris Ostrovsky wrote:
>> On 03/28/2017 04:08 AM, Jan Beulich wrote:
>>>>>> On 28.03.17 at 03:57, <[email protected]> wrote:
>>>> I think there is indeed a disconnect between target memory (provided by
>>>> the toolstack) and current memory (i.e actual pages available to the guest).
>>>>
>>>> For example
>>>>
>>>> [ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff]
>>>> reserved
>>>> [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff]
>>>> reserved
>>>>
>>>> are missed in target calculation. The hvmloader marks them as RESERVED
>>>> (in build_e820_table()) but target value is not aware of this action.
>>>>
>>>> And then the same problem repeats when kernel removes
>>>> 0x000a0000-0x000fffff chunk.
>>> But this is all in-guest behavior, i.e. nothing an entity outside the
>>> guest (tool stack or hypervisor) should need to be aware of. That
>>> said, there is still room for improvement in the tools I think:
>>> Regions which architecturally aren't RAM (namely the
>>> 0xa0000-0xfffff range) would probably better not be accounted
>>> for as RAM as far as ballooning is concerned. In the hypervisor,
>>> otoh, all memory assigned to the guest (i.e. including such backing
>>> ROMs) needs to be accounted.
>> On the Linux side we should not include in balloon calculations pages
>> reserved by trim_bios_range(), i.e. (BIOS_END-BIOS_BEGIN) + 1.
>>
>> Which leaves hvmloader's special pages (and possibly memory under
>> 0xA0000 which may get reserved). Can we pass this info to guests via
>> xenstore?
> I'd rather keep an internal difference between online pages and E820-map
> count value in the balloon driver. This should work always.
We could indeed base calculation on initial state of e820 and not count
the holes toward ballooning needs. I am not sure this will work for
memory unplug though, where a hole can be created in the map and we will
be supposed to handle disappearing memory via ballooning.
Or am I creating a problem where none exists?
-boris
On 28/03/17 18:32, Boris Ostrovsky wrote:
> On 03/28/2017 11:30 AM, Juergen Gross wrote:
>> On 28/03/17 16:27, Boris Ostrovsky wrote:
>>> On 03/28/2017 04:08 AM, Jan Beulich wrote:
>>>>>>> On 28.03.17 at 03:57, <[email protected]> wrote:
>>>>> I think there is indeed a disconnect between target memory (provided by
>>>>> the toolstack) and current memory (i.e actual pages available to the guest).
>>>>>
>>>>> For example
>>>>>
>>>>> [ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff]
>>>>> reserved
>>>>> [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff]
>>>>> reserved
>>>>>
>>>>> are missed in target calculation. The hvmloader marks them as RESERVED
>>>>> (in build_e820_table()) but target value is not aware of this action.
>>>>>
>>>>> And then the same problem repeats when kernel removes
>>>>> 0x000a0000-0x000fffff chunk.
>>>> But this is all in-guest behavior, i.e. nothing an entity outside the
>>>> guest (tool stack or hypervisor) should need to be aware of. That
>>>> said, there is still room for improvement in the tools I think:
>>>> Regions which architecturally aren't RAM (namely the
>>>> 0xa0000-0xfffff range) would probably better not be accounted
>>>> for as RAM as far as ballooning is concerned. In the hypervisor,
>>>> otoh, all memory assigned to the guest (i.e. including such backing
>>>> ROMs) needs to be accounted.
>>> On the Linux side we should not include in balloon calculations pages
>>> reserved by trim_bios_range(), i.e. (BIOS_END-BIOS_BEGIN) + 1.
>>>
>>> Which leaves hvmloader's special pages (and possibly memory under
>>> 0xA0000 which may get reserved). Can we pass this info to guests via
>>> xenstore?
>> I'd rather keep an internal difference between online pages and E820-map
>> count value in the balloon driver. This should work always.
>
> We could indeed base calculation on initial state of e820 and not count
> the holes toward ballooning needs. I am not sure this will work for
> memory unplug though, where a hole can be created in the map and we will
> be supposed to handle disappearing memory via ballooning.
>
> Or am I creating a problem where none exists?
I'm rather sure memory has to be offlined before being deleted from
the E820 map.
Juergen
On Tue, Mar 28, 2017 at 05:30:24PM +0200, Juergen Gross wrote:
> On 28/03/17 16:27, Boris Ostrovsky wrote:
> > On 03/28/2017 04:08 AM, Jan Beulich wrote:
> >>>>> On 28.03.17 at 03:57, <[email protected]> wrote:
> >>> I think there is indeed a disconnect between target memory (provided by
> >>> the toolstack) and current memory (i.e actual pages available to the guest).
> >>>
> >>> For example
> >>>
> >>> [ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff]
> >>> reserved
> >>> [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff]
> >>> reserved
> >>>
> >>> are missed in target calculation. The hvmloader marks them as RESERVED
> >>> (in build_e820_table()) but target value is not aware of this action.
> >>>
> >>> And then the same problem repeats when kernel removes
> >>> 0x000a0000-0x000fffff chunk.
> >> But this is all in-guest behavior, i.e. nothing an entity outside the
> >> guest (tool stack or hypervisor) should need to be aware of. That
> >> said, there is still room for improvement in the tools I think:
> >> Regions which architecturally aren't RAM (namely the
> >> 0xa0000-0xfffff range) would probably better not be accounted
> >> for as RAM as far as ballooning is concerned. In the hypervisor,
> >> otoh, all memory assigned to the guest (i.e. including such backing
> >> ROMs) needs to be accounted.
> >
> > On the Linux side we should not include in balloon calculations pages
> > reserved by trim_bios_range(), i.e. (BIOS_END-BIOS_BEGIN) + 1.
> >
> > Which leaves hvmloader's special pages (and possibly memory under
> > 0xA0000 which may get reserved). Can we pass this info to guests via
> > xenstore?
>
> I'd rather keep an internal difference between online pages and E820-map
> count value in the balloon driver. This should work always.
Did we ever come with a patch for this?
On 08/07/17 02:59, Konrad Rzeszutek Wilk wrote:
> On Tue, Mar 28, 2017 at 05:30:24PM +0200, Juergen Gross wrote:
>> On 28/03/17 16:27, Boris Ostrovsky wrote:
>>> On 03/28/2017 04:08 AM, Jan Beulich wrote:
>>>>>>> On 28.03.17 at 03:57, <[email protected]> wrote:
>>>>> I think there is indeed a disconnect between target memory (provided by
>>>>> the toolstack) and current memory (i.e actual pages available to the guest).
>>>>>
>>>>> For example
>>>>>
>>>>> [ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff]
>>>>> reserved
>>>>> [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff]
>>>>> reserved
>>>>>
>>>>> are missed in target calculation. The hvmloader marks them as RESERVED
>>>>> (in build_e820_table()) but target value is not aware of this action.
>>>>>
>>>>> And then the same problem repeats when kernel removes
>>>>> 0x000a0000-0x000fffff chunk.
>>>> But this is all in-guest behavior, i.e. nothing an entity outside the
>>>> guest (tool stack or hypervisor) should need to be aware of. That
>>>> said, there is still room for improvement in the tools I think:
>>>> Regions which architecturally aren't RAM (namely the
>>>> 0xa0000-0xfffff range) would probably better not be accounted
>>>> for as RAM as far as ballooning is concerned. In the hypervisor,
>>>> otoh, all memory assigned to the guest (i.e. including such backing
>>>> ROMs) needs to be accounted.
>>>
>>> On the Linux side we should not include in balloon calculations pages
>>> reserved by trim_bios_range(), i.e. (BIOS_END-BIOS_BEGIN) + 1.
>>>
>>> Which leaves hvmloader's special pages (and possibly memory under
>>> 0xA0000 which may get reserved). Can we pass this info to guests via
>>> xenstore?
>>
>> I'd rather keep an internal difference between online pages and E820-map
>> count value in the balloon driver. This should work always.
>
> Did we ever come with a patch for this?
Yes, I've sent V2 recently:
https://lists.xen.org/archives/html/xen-devel/2017-07/msg00530.html
Juergen