MIME-Version: 1.0
In-Reply-To: <0628e2af-f7e7-056a-82ec-68860f9c4f29@oracle.com>
References: <CAOZ2QJMkHqeTGGVKFjGw-Gn_zMLmM6Ls27Z=k3N-MYyzz7ihfQ@mail.gmail.com>
 <0628e2af-f7e7-056a-82ec-68860f9c4f29@oracle.com>
From: Dan Streetman <dan.streetman@canonical.com>
Date: Fri, 24 Mar 2017 16:34:23 -0400
Message-ID: <CAOZ2QJPgbTibTX+3sO1FQou_mNX-CDd0PQUbWehEEgAGVcnYMg@mail.gmail.com>
Subject: Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's
 initial state to number of existing RAM pages"
To: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
        David Vrabel <david.vrabel@citrix.com>,
        Juergen Gross <jgross@suse.com>, xen-devel@lists.xenproject.org,
        linux-kernel@vger.kernel.org
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4724
Lines: 107

On Wed, Mar 22, 2017 at 10:13 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
>
>
> On 03/22/2017 05:16 PM, Dan Streetman wrote:
>>
>> I have a question about a problem introduced by this commit:
>> c275a57f5ec3056f732843b11659d892235faff7
>> "xen/balloon: Set balloon's initial state to number of existing RAM pages"
>>
>> It changed the xen balloon current_pages calculation to start with the
>> number of physical pages in the system, instead of max_pfn.  Since
>> get_num_physpages() does not include holes, it's always less than the
>> e820 map's max_pfn.
>>
>> However, the problem that commit introduced is, if the hypervisor sets
>> the balloon target to equal to the e820 map's max_pfn, then the
>> balloon target will *always* be higher than the initial current pages.
>> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
>> the OS adds any holes, the balloon target will be higher than the
>> current pages.  This is the situation, for example, for Amazon AWS
>> instances.  The result is, the xen balloon will always immediately
>> hotplug some memory at boot, but then make only (max_pfn -
>> get_num_physpages()) available to the system.
>>
>> This balloon-hotplugged memory can cause problems, if the hypervisor
>> wasn't expecting it; specifically, the system's physical page
>> addresses now will exceed the e820 map's max_pfn, due to the
>> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
>> DMA to/from those physical pages above the e820 max_pfn, it causes
>> problems.  For example:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
>>
>> The additional small amount of balloon memory can cause other problems
>> as well, for example:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
>>
>> Anyway, I'd like to ask, was the original commit added because
>> hypervisors are supposed to set their balloon target to the guest
>> system's number of phys pages (max_pfn - holes)?  The mailing list
>> discussion and commit description seem to indicate that.
>
>
>
> IIRC the problem that this was trying to fix was that since max_pfn includes
> holes, upon booting we'd immediately balloon down by the (typically, MMIO)
> hole size.
>
> If you boot a guest with ~4+GB memory you should see this.
>
>
>> However I'm
>> not sure how that is possible, because the kernel reserves its own
>> holes, regardless of any predefined holes in the e820 map; for
>> example, the kernel reserves 64k (by default) at phys addr 0 (the
>> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW).  So
>> the hypervisor really has no way to know what the "right" target to
>> specify is; unless it knows the exact guest OS and kernel version, and
>> kernel config values, it will never be able to correctly specify its
>> target to be exactly (e820 max_pfn - all holes).
>>
>> Should this commit be reverted?  Should the xen balloon target be
>> adjusted based on kernel-added e820 holes?
>
>
> I think the second one but shouldn't current_pages be updated, and not the
> target? The latter is set by Xen (toolstack, via xenstore usually).
>
> Also, the bugs above (at least one of them) talk about NVMe and I wonder
> whether the memory that they add is of RAM type --- I believe it has its own
> type and so perhaps that introduces additional inconsistencies. AWS may have
> added their own support for that, which we don't have upstream yet.

The type of memory doesn't have anything to do with it.

The problem with NVMe is it's a passthrough device, so the guest talks
directly to the NVMe controller and does DMA with it.  But the
hypervisor does swiotlb translation between the guest physical memory,
and the host physical memory, so that the NVMe device can correctly
DMA to the right memory in the host.

However, the hypervisor only has the guest's physical memory up to the
max e820 pfn mapped; it didn't expect the balloon driver to hotplug
any additional memory above the e820 max pfn, so when the NVMe driver
in the guest tries to tell the NVMe controller to DMA to that
balloon-hotplugged memory, the hypervisor fails the NVMe request,
because it can't do the guest-to-host phys mem mapping, since the
guest phys address is outside the expected max range.


>
> -boris
>
>
>
>> Should something else be
>> done?
>>
>> For context, Amazon Linux has simply disabled Xen ballooning
>> completely.  Likewise, we're planning to disable Xen ballooning in the
>> Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS
>> Ubuntu kernels).  However, if reverting this patch makes sense in a
>> bigger context (i.e. Xen users besides AWS), that would allow more
>> Ubuntu kernels to work correctly in AWS instances.
>>
>