Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751562AbdH2Tir (ORCPT ); Tue, 29 Aug 2017 15:38:47 -0400 Received: from mail-it0-f53.google.com ([209.85.214.53]:34929 "EHLO mail-it0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751186AbdH2Tip (ORCPT ); Tue, 29 Aug 2017 15:38:45 -0400 MIME-Version: 1.0 In-Reply-To: <20170829191351.GD7546@redhat.com> References: <0a85df4b-ca0a-7e70-51dc-90bd1c460c85@redhat.com> <20170827123505.u4kb24kigjqwa2t2@angband.pl> <0dcca3a4-8ecd-0d05-489c-7f6d1ddb49a6@gmx.de> <79BC5306-4ED4-41E4-B2C1-12197D9D1709@gmail.com> <20170829125923.g3tp22bzsrcuruks@angband.pl> <20170829140924.GB21615@redhat.com> <20170829183405.GB7546@redhat.com> <20170829191351.GD7546@redhat.com> From: Linus Torvalds Date: Tue, 29 Aug 2017 12:38:43 -0700 X-Google-Sender-Auth: -be0uxQCz07vohyuvRHql_a-DjY Message-ID: Subject: Re: kvm splat in mmu_spte_clear_track_bits To: Jerome Glisse Cc: Andrea Arcangeli , Adam Borowski , Takashi Iwai , Bernhard Held , Nadav Amit , Paolo Bonzini , Wanpeng Li , =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Joerg Roedel , "Kirill A. Shutemov" , Andrew Morton , kvm , "linux-kernel@vger.kernel.org" , Michal Hocko Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2402 Lines: 55 On Tue, Aug 29, 2017 at 12:13 PM, Jerome Glisse wrote: > > Yes and i am fine with page traversal being under spinlock and not > being able to sleep during that. I agree doing otherwise would be > insane. It is just that the existing behavior of try_to_unmap_one() > and page_mkclean_one() have been broken and that no mmu_notifier > calls were added around the lock section. Yeah, I'm actually surprised that ever worked. I'm surprised that try_to_unmap_one didn't hold any locks earlier. In fact, I think at least some of them *did* already hold the page table locks: ptep_clear_flush_young_notify() and friends very much should have always held them. So it's literally just that mmu_notifier_invalidate_page() call that used to be outside all the locks, but honestly, I think that was always a bug. It means that you got notified of the page removal *after* the page was already gone and all locks had been released, so a completely *different* page could already have been mapped to that address. So I think the old code was always broken exactly because the callback wasn't serialized with the actual action. > I sent a patch that properly compute the range to invalidate and move > to invalidate_range() but is lacking the invalidate_range_start()/ > end() so i am gonna respin that with range_start/end bracketing and > assume the worse for the range of address. So surrounding it with start/end _should_ make KVM happy. KVM people, can you confirm? But I do note that there's a number of other users of that "invalidate_page" callback. I think ib_umem_notifier_invalidate_page() the exact same blocking issue, but changing to range_start/end should be good there too. amdgpu_mn_invalidate_page() and the xen/gntdev also seem to be happy being replaced with start/end. In fact, I'm wondering if this actually means that we could get rid of mmu_notifier_invalidate_page() entirely. There's only a couple of callers, and the other one seems to be fs/dax.c, and it actually seems to have the exact same issue that the try_to_unmap_one() code had: it tried to invalidate an address too late - by the time it was called, the page gad already been cleaned and locks had been released. So the more I look at that "turn mmu_notifier_invalidate_page() into invalidate_range_start/end()" the more I think that's fundamentally the right thing to do. Linus