Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751695AbdFAL1d (ORCPT ); Thu, 1 Jun 2017 07:27:33 -0400 Received: from mx1.redhat.com ([209.132.183.28]:52330 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751505AbdFAL1b (ORCPT ); Thu, 1 Jun 2017 07:27:31 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 3C7D381250 Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=david@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 3C7D381250 Subject: Re: [PATCH RFC 0/2] KVM: s390: avoid having to enable vm.alloc_pgste To: Martin Schwidefsky Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Heiko Carstens , Thomas Huth , Christian Borntraeger References: <20170529163202.13077-1-david@redhat.com> <20170601124651.3e7969ab@mschwideX1> From: David Hildenbrand Organization: Red Hat GmbH Message-ID: Date: Thu, 1 Jun 2017 13:27:28 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.0 MIME-Version: 1.0 In-Reply-To: <20170601124651.3e7969ab@mschwideX1> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Thu, 01 Jun 2017 11:27:31 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4853 Lines: 119 Hi Martin, thanks for having a look at this! >> However, we >> might be able to let 2k and 4k page tables co-exist. We only need >> 4k page tables whenever we want to expose such memory to a guest. So >> turning on 4k page table allocation at one point and only allowing such >> memory to go into our gmap (guest mapping) might be a solution. >> User space tools like QEMU that create the VM before mmap-ing any memory >> that will belong to the guest can simply use the new VM type. Proper 4k >> page tables will be created for any memory mmap-ed afterwards. And these >> can be used in the gmap without problems. Existing user space tools >> will work as before - having to enable vm.alloc_pgste explicitly. > > I can not say that I like this approach. Right now a process either uses > 2K page tables or 4K page tables. With your patch it is basically per page > table page. Memory areas that existed before the switch to allocate > 4K page tables can not be mapped to the guests gmap anymore. There might > be hidden pitfalls e.g. with guest migration. As long as QEMU knows what it is doing, I don't see problems with migration. Even for migration, all memory is mmaped afterwards. But yes, we have to think about this carefully. (that's why QEMU explicitly has to switch this behavior on, to not break user space that mmaps it differently). > >> This should play fine with vSIE, as vSIE code works completely on the gmap. >> So if only page tables with pgste go into our gmap, we should be fine. >> >> Not sure if this breaks important concepts, has some serious performance >> problems or I am missing important cases. If so, I guess there is really >> no way to avoid setting vm.alloc_pgste. >> >> Possible modifications: >> - Enable this option via an ioctl (like KVM_S390_ENABLE_SIE) instead of >> a new VM type >> - Remember if we have mixed pgtables. If !mixed, we can make maybe faster >> decisions (if that is really a problem). > > What I do not like in particular is this function: > > static inline int pgtable_has_pgste(struct mm_struct *mm, unsigned long addr) > { > struct page *page; > > if (!mm_has_pgste(mm)) > return 0; > > page = pfn_to_page(addr >> PAGE_SHIFT); > return atomic_read(&page->_mapcount) & 0x4U; > } > > The check for pgstes got more complicated, it used to be a test-under-mask > of a bit in the mm struct and a branch. Now we have an additional pfn_to_page, > an atomic_read and a bit test. That is done multiple times for every ptep_xxx > operation. The "heavy overhead" only applies to processes using KVM, ordinary processes only have one bit test (just as before !mm_has_pgste(mm)). Can you judge how much overhead this could be in practice? (esp: can it be noticed?) We could of course "tune" the functions to reduce the number of pgtable_has_pgste() but we won't get rid of at least one such pfn_to_page. And the question is if already one such test degrades performance noticeably. Unfortunately the z/VM test system I have is not really the best fit for performance tests. > > Is the operational simplification of not having to set vm.allocate_pgste really > that important ? > The problem is that once we have a package that installs QEMU, and we don't want to completely confuse the user (let him care about vm.allocate_pgste), we have to set this automatically. E.g. see https://bugzilla.redhat.com/show_bug.cgi?id=1290589 a) install a config file that will set it on every boot (like fedora) b) set vm.allocate_pgste=1 right away, so QEMU can be used without a reboot Now, this is possible, but as you know this can implicitly influence other workload (double size of page tables for everything). Esp. thinking about systems that are not pure KVM hosts. So by simply installing the QEMU package, we change system behavior. Even if QEMU will only be used sometimes on that host. So actually what we want is : PGSTES only for processes that actually need it and handle the details internally. I don't really see too many ways to do this differently. The more we can hide "internal specialties", the better. E.g. think about guests being based on huge pages completely someday in the future: the concept of pgstes does not apply here. Still, pgste will have to be enabled for the complete system and there isn't too much we can do about it (s390_enable_sie can't know that QEMU will only use huge pages, so we would need yet another VM type in QEMU to handle guests that can only map huge pages). An alternative: Have some process that enables PGSTE for all of its future children. Fork+execv qemu. However, such a process in between will most likely confuse tooling like libvirt when it comes to process ids. Can you think of any other way to handle this? Setting vm.allocate_pgste=1 should really be avoided if possible. Thanks a lot again! -- Thanks, David