Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751404AbdH2TN6 (ORCPT ); Tue, 29 Aug 2017 15:13:58 -0400 Received: from mx1.redhat.com ([209.132.183.28]:41482 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751343AbdH2TNz (ORCPT ); Tue, 29 Aug 2017 15:13:55 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com C1F80883B6 Authentication-Results: ext-mx02.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx02.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=jglisse@redhat.com Date: Tue, 29 Aug 2017 15:13:51 -0400 From: Jerome Glisse To: Linus Torvalds Cc: Andrea Arcangeli , Adam Borowski , Takashi Iwai , Bernhard Held , Nadav Amit , Paolo Bonzini , Wanpeng Li , Radim =?utf-8?B?S3LEjW3DocWZ?= , Joerg Roedel , "Kirill A. Shutemov" , Andrew Morton , kvm , "linux-kernel@vger.kernel.org" , Michal Hocko Subject: Re: kvm splat in mmu_spte_clear_track_bits Message-ID: <20170829191351.GD7546@redhat.com> References: <0a85df4b-ca0a-7e70-51dc-90bd1c460c85@redhat.com> <20170827123505.u4kb24kigjqwa2t2@angband.pl> <0dcca3a4-8ecd-0d05-489c-7f6d1ddb49a6@gmx.de> <79BC5306-4ED4-41E4-B2C1-12197D9D1709@gmail.com> <20170829125923.g3tp22bzsrcuruks@angband.pl> <20170829140924.GB21615@redhat.com> <20170829183405.GB7546@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.8.3 (2017-05-23) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Tue, 29 Aug 2017 19:13:55 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3715 Lines: 84 On Tue, Aug 29, 2017 at 12:06:42PM -0700, Linus Torvalds wrote: > On Tue, Aug 29, 2017 at 11:34 AM, Jerome Glisse wrote: > > > > Kirill did regress invalidate_page as it use to be call outside the > > spinlock and now it is call inside the spinlock thus reverting will > > introduce back a regression. > > Honestly, this MMU notifier thing has been nothing but a badly > designed mistake from beginning to end, and bad rules for what can > sleep and what can not are one fundamental problem. > > There are fundamentally two levels of VM locking, and those two levels > are not going to go away, and we're not budging on them: > > - there's the "virtual address" level, which can block. We have a > nice mmap_semaphore, and we guarantee that it's held for writing for > all changes to the virtual memory layout > > This is the "mmap/munmap" kind of granularity. The mmu callbacks at > *this* level are fine to block. > > - then there is the "page level" VM handling, and honestly, that > *fundamentally* uses a spinlock. If we look at a particular page, that > page is meaningless without the lock. Really. > > I honestly believe that any MMU callback at this level needs to be > atomic. Some of the absolutely *have* to be (that "change_pte", for > example). > > In that second case, we might have a "begin/end" surrounding the > actual page table walk. And that might sleep, but then it > *fundamentally* cannot actually be able some particular single page > or stable range. Because without the page table spinlock, no such > stability exists. It's purely a "we are not going to start looking at > this range" kind of thing. > > I really don't understand why the nVidia crap cannot follow those > simple rules. Because either > > (a) you're working with virtual addresses, and you should be able to > work on that virtual layer > > (b) you're actually working with physical pages, and you can just > hold on to those physical pages yourself. > > I really detest our MMU callbacks. I shouldn't have allowed them to be > merged. And I definitely shoul.dn't allow them to screw up our VM > layer. > > But we have them, and we should work at making sure people do sane things. > > And yes, those sane things may include > > (a) guaranteeing that the start/end range calls are always done > around the actual locked region. > > (b) adding a ton of validation so that people *see* then they break > the rules. Even when they don't use some random notifier crud. > > That (b) may involve adding a number of "might_sleep()" calls (not > deep in the notifiers themselves, but in the actual wrapper functions > even when notifiers are compiled out entirely!), but also adding calls > to validate those kinds of "you can't call > mmu_notifier_invalidate_page() without having first called > mmu_notifier_invalidate_range_start() in a sleepable context". > > But (b) definitely should also be a very real onus on the mmu > notifiers themselves. No way can we sleep when we're traversing page > tables. We hold a page table lock. We can sleep before and after, but > not during actual page traversal. Yes and i am fine with page traversal being under spinlock and not being able to sleep during that. I agree doing otherwise would be insane. It is just that the existing behavior of try_to_unmap_one() and page_mkclean_one() have been broken and that no mmu_notifier calls were added around the lock section. I sent a patch that properly compute the range to invalidate and move to invalidate_range() but is lacking the invalidate_range_start()/ end() so i am gonna respin that with range_start/end bracketing and assume the worse for the range of address. Cheers, J?r?me