Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751456AbdH2UtH (ORCPT ); Tue, 29 Aug 2017 16:49:07 -0400 Received: from mx1.redhat.com ([209.132.183.28]:61066 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750909AbdH2UtF (ORCPT ); Tue, 29 Aug 2017 16:49:05 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 9DEC5C047B79 Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=aarcange@redhat.com Date: Tue, 29 Aug 2017 22:49:01 +0200 From: Andrea Arcangeli To: Linus Torvalds Cc: Jerome Glisse , Adam Borowski , Takashi Iwai , Bernhard Held , Nadav Amit , Paolo Bonzini , Wanpeng Li , Radim =?utf-8?B?S3LEjW3DocWZ?= , Joerg Roedel , "Kirill A. Shutemov" , Andrew Morton , kvm , "linux-kernel@vger.kernel.org" , Michal Hocko Subject: Re: kvm splat in mmu_spte_clear_track_bits Message-ID: <20170829204901.GC13559@redhat.com> References: <0dcca3a4-8ecd-0d05-489c-7f6d1ddb49a6@gmx.de> <79BC5306-4ED4-41E4-B2C1-12197D9D1709@gmail.com> <20170829125923.g3tp22bzsrcuruks@angband.pl> <20170829140924.GB21615@redhat.com> <20170829183405.GB7546@redhat.com> <20170829191351.GD7546@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.8.3 (2017-05-23) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Tue, 29 Aug 2017 20:49:05 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5330 Lines: 120 Hello Linus, On Tue, Aug 29, 2017 at 12:38:43PM -0700, Linus Torvalds wrote: > On Tue, Aug 29, 2017 at 12:13 PM, Jerome Glisse wrote: > > > > Yes and i am fine with page traversal being under spinlock and not > > being able to sleep during that. I agree doing otherwise would be > > insane. It is just that the existing behavior of try_to_unmap_one() > > and page_mkclean_one() have been broken and that no mmu_notifier > > calls were added around the lock section. > > Yeah, I'm actually surprised that ever worked. I'm surprised that > try_to_unmap_one didn't hold any locks earlier. > > In fact, I think at least some of them *did* already hold the page > table locks: ptep_clear_flush_young_notify() and friends very much > should have always held them. > > So it's literally just that mmu_notifier_invalidate_page() call that > used to be outside all the locks, but honestly, I think that was > always a bug. It means that you got notified of the page removal > *after* the page was already gone and all locks had been released, so > a completely *different* page could already have been mapped to that > address. > > So I think the old code was always broken exactly because the callback > wasn't serialized with the actual action. That's a very interesting point indeed. I thought the old code was safe because even if the secondary MMU view keeps crunching on the old page for a little while after the primary MMU got a new page mapped, you cannot measure it. If you cannot measure it, it doesn't exist? I.e. undefined. The same runtime you'd get by letting the guest mode crunch on the old page for a little while after the primary MMU is already computing on a brand new page, could materialize if the guest mode CPU just got a turbo boost and computed faster before the ->invalidate_page was run inside the PT lock. If the page is swapped-in and is the same page that was unmapped by try_to_unmap_one, then the primary MMU will return using the same page that the secondary MMU was still using and it'll even be coherent (swapin is probably prevented by the page lock, but still even if it happens it doesn't look an issue). If the page is suddenly replaced while vcpu is in guest mode, no coherency could be provided anyway if the guest mode wasn't serialize by something other than the PT lock. I'll keep thinking about it, this is a quick answer. On a side note, the other major requirement for the code not to have been always broken before is that the caller must hold a refcount: the put_page executed by try_to_unmap_one before calling mmu_notifier_invalidate_page cannot free the page or it's unsafe. This requirement also goes away if using range_start/end. > > I sent a patch that properly compute the range to invalidate and move > > to invalidate_range() but is lacking the invalidate_range_start()/ > > end() so i am gonna respin that with range_start/end bracketing and > > assume the worse for the range of address. > > So surrounding it with start/end _should_ make KVM happy. > > KVM people, can you confirm? Yes that always works as far as I can tell. For our current problem using start/end like you suggested would definitely make KVM happy and it is obviously safe. It would also work to teach page_vma_mapped_walk(&pvmw) to restart after dropping the lock with page_vma_mapped_walk_done() to call mmu_notifier_invalidate_page outside the lock. If such an alternative would be safe however, entirely depends if it was always broken before which again is a very good point to think about. The common case for such function is a single pte being unmapped. > > But I do note that there's a number of other users of that > "invalidate_page" callback. > > I think ib_umem_notifier_invalidate_page() the exact same blocking > issue, but changing to range_start/end should be good there too. > > amdgpu_mn_invalidate_page() and the xen/gntdev also seem to be happy > being replaced with start/end. > > In fact, I'm wondering if this actually means that we could get rid of > mmu_notifier_invalidate_page() entirely. There's only a couple of > callers, and the other one seems to be fs/dax.c, and it actually seems > to have the exact same issue that the try_to_unmap_one() code had: it > tried to invalidate an address too late - by the time it was called, > the page gad already been cleaned and locks had been released. > > So the more I look at that "turn mmu_notifier_invalidate_page() into > invalidate_range_start/end()" the more I think that's fundamentally > the right thing to do. mmu_notifier_invalidate_page exists purely as an optimized version of mmu_notifier_invalidate_range_start/end. Furthermore when it was always run inside the PT lock it was quite handy to replace a ptep_clear_flush into a ptep_clear_flush_notify and be done with it. It also requires a single branch when the MMU notifier is unarmed. mmu_notifier_invalidate_range_start has to first block all secondary MMU page faults, and invalidate the secondary MMU _before_ the pages are freed. mmu_notifier_invalidate_range_end unblocks the secondary MMU page faults after the primary MMU has been invalidated too (i.e. after zapping the ptes). mmu_notifier_invalidate_page has the advantage that it takes the secondary MMU KVM srcu and spinlock a single time. Thanks! Andrea