Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751415AbdH2TGq (ORCPT ); Tue, 29 Aug 2017 15:06:46 -0400 Received: from mail-io0-f172.google.com ([209.85.223.172]:36372 "EHLO mail-io0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751236AbdH2TGo (ORCPT ); Tue, 29 Aug 2017 15:06:44 -0400 MIME-Version: 1.0 In-Reply-To: <20170829183405.GB7546@redhat.com> References: <20170825131419.r5lzm6oluauu65nx@angband.pl> <0a85df4b-ca0a-7e70-51dc-90bd1c460c85@redhat.com> <20170827123505.u4kb24kigjqwa2t2@angband.pl> <0dcca3a4-8ecd-0d05-489c-7f6d1ddb49a6@gmx.de> <79BC5306-4ED4-41E4-B2C1-12197D9D1709@gmail.com> <20170829125923.g3tp22bzsrcuruks@angband.pl> <20170829140924.GB21615@redhat.com> <20170829183405.GB7546@redhat.com> From: Linus Torvalds Date: Tue, 29 Aug 2017 12:06:42 -0700 X-Google-Sender-Auth: KklqrhuJ7eXnGYwQmgE13rcc6Jw Message-ID: Subject: Re: kvm splat in mmu_spte_clear_track_bits To: Jerome Glisse Cc: Andrea Arcangeli , Adam Borowski , Takashi Iwai , Bernhard Held , Nadav Amit , Paolo Bonzini , Wanpeng Li , =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Joerg Roedel , "Kirill A. Shutemov" , Andrew Morton , kvm , "linux-kernel@vger.kernel.org" , Michal Hocko Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2958 Lines: 71 On Tue, Aug 29, 2017 at 11:34 AM, Jerome Glisse wrote: > > Kirill did regress invalidate_page as it use to be call outside the > spinlock and now it is call inside the spinlock thus reverting will > introduce back a regression. Honestly, this MMU notifier thing has been nothing but a badly designed mistake from beginning to end, and bad rules for what can sleep and what can not are one fundamental problem. There are fundamentally two levels of VM locking, and those two levels are not going to go away, and we're not budging on them: - there's the "virtual address" level, which can block. We have a nice mmap_semaphore, and we guarantee that it's held for writing for all changes to the virtual memory layout This is the "mmap/munmap" kind of granularity. The mmu callbacks at *this* level are fine to block. - then there is the "page level" VM handling, and honestly, that *fundamentally* uses a spinlock. If we look at a particular page, that page is meaningless without the lock. Really. I honestly believe that any MMU callback at this level needs to be atomic. Some of the absolutely *have* to be (that "change_pte", for example). In that second case, we might have a "begin/end" surrounding the actual page table walk. And that might sleep, but then it *fundamentally* cannot actually be able some particular single page or stable range. Because without the page table spinlock, no such stability exists. It's purely a "we are not going to start looking at this range" kind of thing. I really don't understand why the nVidia crap cannot follow those simple rules. Because either (a) you're working with virtual addresses, and you should be able to work on that virtual layer (b) you're actually working with physical pages, and you can just hold on to those physical pages yourself. I really detest our MMU callbacks. I shouldn't have allowed them to be merged. And I definitely shoul.dn't allow them to screw up our VM layer. But we have them, and we should work at making sure people do sane things. And yes, those sane things may include (a) guaranteeing that the start/end range calls are always done around the actual locked region. (b) adding a ton of validation so that people *see* then they break the rules. Even when they don't use some random notifier crud. That (b) may involve adding a number of "might_sleep()" calls (not deep in the notifiers themselves, but in the actual wrapper functions even when notifiers are compiled out entirely!), but also adding calls to validate those kinds of "you can't call mmu_notifier_invalidate_page() without having first called mmu_notifier_invalidate_range_start() in a sleepable context". But (b) definitely should also be a very real onus on the mmu notifiers themselves. No way can we sleep when we're traversing page tables. We hold a page table lock. We can sleep before and after, but not during actual page traversal. Linus