Date: Thu, 1 Jun 2017 12:46:51 +0200
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
To: David Hildenbrand <david@redhat.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
        Heiko Carstens <heiko.carstens@de.ibm.com>,
        Thomas Huth <thuth@redhat.com>,
        Christian Borntraeger <borntraeger@de.ibm.com>
Subject: Re: [PATCH RFC 0/2] KVM: s390: avoid having to enable
 vm.alloc_pgste
In-Reply-To: <20170529163202.13077-1-david@redhat.com>
References: <20170529163202.13077-1-david@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Message-Id: <20170601124651.3e7969ab@mschwideX1>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3297
Lines: 82

Hi David,

it is nice to see that you are still working on s390 related topics.

On Mon, 29 May 2017 18:32:00 +0200
David Hildenbrand <david@redhat.com> wrote:

> Having to enable vm.alloc_pgste globally might not be the best solution.
> 4k page tables are created for all processes and running QEMU KVM guests
> is more complicated than it should be.

To run KVM guests you need to issue a single sysctl to set vm.allocate_pgste,
this is the best solution we found so far.

> Unfortunately, converting all page tables to 4k pgste page tables is
> not possible without provoking various race conditions.

That is one approach we tried and was found to be buggy. The point is that
you are not allowed to reallocate a page table while a VMA exists that is
in the address range of that page table.

Another approach we tried is to use an ELF flag on the qemu executable.
That does not work either because fs/exec.c allocates and populates the
new mm struct for the argument pages before fs/binfmt_elf.c comes into
play.

> However, we
> might be able to let 2k and 4k page tables co-exist. We only need
> 4k page tables whenever we want to expose such memory to a guest. So
> turning on 4k page table allocation at one point and only allowing such
> memory to go into our gmap (guest mapping) might be a solution.
> User space tools like QEMU that create the VM before mmap-ing any memory
> that will belong to the guest can simply use the new VM type. Proper 4k
> page tables will be created for any memory mmap-ed afterwards. And these
> can be used in the gmap without problems. Existing user space tools
> will work as before - having to enable vm.alloc_pgste explicitly.

I can not say that I like this approach. Right now a process either uses
2K page tables or 4K page tables. With your patch it is basically per page
table page. Memory areas that existed before the switch to allocate
4K page tables can not be mapped to the guests gmap anymore. There might
be hidden pitfalls e.g. with guest migration.

> This should play fine with vSIE, as vSIE code works completely on the gmap.
> So if only page tables with pgste go into our gmap, we should be fine.
> 
> Not sure if this breaks important concepts, has some serious performance
> problems or I am missing important cases. If so, I guess there is really
> no way to avoid setting vm.alloc_pgste.
> 
> Possible modifications:
> - Enable this option via an ioctl (like KVM_S390_ENABLE_SIE) instead of
>   a new VM type
> - Remember if we have mixed pgtables. If !mixed, we can make maybe faster
>   decisions (if that is really a problem).

What I do not like in particular is this function:

static inline int pgtable_has_pgste(struct mm_struct *mm, unsigned long addr)
{
	struct page *page;

	if (!mm_has_pgste(mm))
		return 0;

	page = pfn_to_page(addr >> PAGE_SHIFT);
	return atomic_read(&page->_mapcount) & 0x4U;
}

The check for pgstes got more complicated, it used to be a test-under-mask
of a bit in the mm struct and a branch. Now we have an additional pfn_to_page,
an atomic_read and a bit test. That is done multiple times for every ptep_xxx
operation. 

Is the operational simplification of not having to set vm.allocate_pgste really
that important ?

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.