Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936252AbdCXV1a (ORCPT ); Fri, 24 Mar 2017 17:27:30 -0400 Received: from mail-pg0-f43.google.com ([74.125.83.43]:33826 "EHLO mail-pg0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753675AbdCXV1W (ORCPT ); Fri, 24 Mar 2017 17:27:22 -0400 MIME-Version: 1.0 In-Reply-To: <20170324211016.GG9755@char.us.oracle.com> References: <0628e2af-f7e7-056a-82ec-68860f9c4f29@oracle.com> <20170324211016.GG9755@char.us.oracle.com> From: Dan Streetman Date: Fri, 24 Mar 2017 17:26:40 -0400 Message-ID: Subject: Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages" To: Konrad Rzeszutek Wilk Cc: Boris Ostrovsky , David Vrabel , Juergen Gross , xen-devel@lists.xenproject.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5958 Lines: 132 On Fri, Mar 24, 2017 at 5:10 PM, Konrad Rzeszutek Wilk wrote: > On Fri, Mar 24, 2017 at 04:34:23PM -0400, Dan Streetman wrote: >> On Wed, Mar 22, 2017 at 10:13 PM, Boris Ostrovsky >> wrote: >> > >> > >> > On 03/22/2017 05:16 PM, Dan Streetman wrote: >> >> >> >> I have a question about a problem introduced by this commit: >> >> c275a57f5ec3056f732843b11659d892235faff7 >> >> "xen/balloon: Set balloon's initial state to number of existing RAM pages" >> >> >> >> It changed the xen balloon current_pages calculation to start with the >> >> number of physical pages in the system, instead of max_pfn. Since >> >> get_num_physpages() does not include holes, it's always less than the >> >> e820 map's max_pfn. >> >> >> >> However, the problem that commit introduced is, if the hypervisor sets >> >> the balloon target to equal to the e820 map's max_pfn, then the >> >> balloon target will *always* be higher than the initial current pages. >> >> Even if the hypervisor sets the target to (e820 max_pfn - holes), if >> >> the OS adds any holes, the balloon target will be higher than the >> >> current pages. This is the situation, for example, for Amazon AWS >> >> instances. The result is, the xen balloon will always immediately >> >> hotplug some memory at boot, but then make only (max_pfn - >> >> get_num_physpages()) available to the system. >> >> >> >> This balloon-hotplugged memory can cause problems, if the hypervisor >> >> wasn't expecting it; specifically, the system's physical page >> >> addresses now will exceed the e820 map's max_pfn, due to the >> >> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device >> >> DMA to/from those physical pages above the e820 max_pfn, it causes >> >> problems. For example: >> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129 >> >> >> >> The additional small amount of balloon memory can cause other problems >> >> as well, for example: >> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457 >> >> >> >> Anyway, I'd like to ask, was the original commit added because >> >> hypervisors are supposed to set their balloon target to the guest >> >> system's number of phys pages (max_pfn - holes)? The mailing list >> >> discussion and commit description seem to indicate that. >> > >> > >> > >> > IIRC the problem that this was trying to fix was that since max_pfn includes >> > holes, upon booting we'd immediately balloon down by the (typically, MMIO) >> > hole size. >> > >> > If you boot a guest with ~4+GB memory you should see this. >> > >> > >> >> However I'm >> >> not sure how that is possible, because the kernel reserves its own >> >> holes, regardless of any predefined holes in the e820 map; for >> >> example, the kernel reserves 64k (by default) at phys addr 0 (the >> >> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW). So >> >> the hypervisor really has no way to know what the "right" target to >> >> specify is; unless it knows the exact guest OS and kernel version, and >> >> kernel config values, it will never be able to correctly specify its >> >> target to be exactly (e820 max_pfn - all holes). >> >> >> >> Should this commit be reverted? Should the xen balloon target be >> >> adjusted based on kernel-added e820 holes? >> > >> > >> > I think the second one but shouldn't current_pages be updated, and not the >> > target? The latter is set by Xen (toolstack, via xenstore usually). >> > >> > Also, the bugs above (at least one of them) talk about NVMe and I wonder >> > whether the memory that they add is of RAM type --- I believe it has its own >> > type and so perhaps that introduces additional inconsistencies. AWS may have >> > added their own support for that, which we don't have upstream yet. >> >> The type of memory doesn't have anything to do with it. >> >> The problem with NVMe is it's a passthrough device, so the guest talks >> directly to the NVMe controller and does DMA with it. But the >> hypervisor does swiotlb translation between the guest physical memory, > > Um, the hypervisor does not have SWIOTLB support, only IOMMU support. heh, well I have no special insight into Amazon's hypervisor, so I have no idea what underlying memory remapping mechanism it uses :) > >> and the host physical memory, so that the NVMe device can correctly >> DMA to the right memory in the host. >> >> However, the hypervisor only has the guest's physical memory up to the >> max e820 pfn mapped; it didn't expect the balloon driver to hotplug >> any additional memory above the e820 max pfn, so when the NVMe driver >> in the guest tries to tell the NVMe controller to DMA to that >> balloon-hotplugged memory, the hypervisor fails the NVMe request, > > But when the memory hotplug happens the hypercalls are done to > raise the max pfn. well...all I can say is it rejects DMA above the e820 range. so this very well may be a hypervisor bug, where it should add the balloon memory region to whatever does the NVMe passthrough device iommu mapping. I think we can all agree that the *ideal* situation would be, for the balloon driver to not immediately hotplug memory so it can add 11 more pages, so maybe I just need to figure out why the balloon driver thinks it needs 11 more pages, and fix that. > >> because it can't do the guest-to-host phys mem mapping, since the >> guest phys address is outside the expected max range. >> >> >> >> > >> > -boris >> > >> > >> > >> >> Should something else be >> >> done? >> >> >> >> For context, Amazon Linux has simply disabled Xen ballooning >> >> completely. Likewise, we're planning to disable Xen ballooning in the >> >> Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS >> >> Ubuntu kernels). However, if reverting this patch makes sense in a >> >> bigger context (i.e. Xen users besides AWS), that would allow more >> >> Ubuntu kernels to work correctly in AWS instances. >> >> >> >