Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754177AbdH2OJj (ORCPT ); Tue, 29 Aug 2017 10:09:39 -0400 Received: from mx1.redhat.com ([209.132.183.28]:53142 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752410AbdH2OJg (ORCPT ); Tue, 29 Aug 2017 10:09:36 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com E90406146C Authentication-Results: ext-mx10.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx10.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=aarcange@redhat.com Date: Tue, 29 Aug 2017 16:09:24 +0200 From: Andrea Arcangeli To: Adam Borowski Cc: Takashi Iwai , Bernhard Held , Nadav Amit , Paolo Bonzini , Wanpeng Li , Radim =?utf-8?B?S3LEjW3DocWZ?= , Joerg Roedel , "Kirill A. Shutemov" , Andrew Morton , Linus Torvalds , kvm , "linux-kernel@vger.kernel.org" , Michal Hocko Subject: Re: kvm splat in mmu_spte_clear_track_bits Message-ID: <20170829140924.GB21615@redhat.com> References: <1c270e76-05be-6f5f-29c6-9cb31f37f71d@redhat.com> <20170825131419.r5lzm6oluauu65nx@angband.pl> <0a85df4b-ca0a-7e70-51dc-90bd1c460c85@redhat.com> <20170827123505.u4kb24kigjqwa2t2@angband.pl> <0dcca3a4-8ecd-0d05-489c-7f6d1ddb49a6@gmx.de> <79BC5306-4ED4-41E4-B2C1-12197D9D1709@gmail.com> <20170829125923.g3tp22bzsrcuruks@angband.pl> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20170829125923.g3tp22bzsrcuruks@angband.pl> User-Agent: Mutt/1.8.3 (2017-05-23) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Tue, 29 Aug 2017 14:09:36 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3908 Lines: 84 Hello, On Tue, Aug 29, 2017 at 02:59:23PM +0200, Adam Borowski wrote: > On Tue, Aug 29, 2017 at 02:45:41PM +0200, Takashi Iwai wrote: > > [Put more people to Cc, sorry for growing too much...] > > We're all interested in 4.13.0 not crashing on us, so that's ok. > > > On Tue, 29 Aug 2017 11:19:13 +0200, > > Bernhard Held wrote: > > > > > > On 08/28/2017 at 06:56 PM, Nadav Amit wrote: > > > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a . > > > > > > Amit, thanks for your courage to expose your patch! > > > > > > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it > > > just accelerates the triggering of the splash. To be more sure the > > > kernel needs to be tested for a couple of days. It would be great if > > > others could assist in testing aac2fea94f7a. > > > > I'm testing with the revert for a while and it seems working. > > With nothing but aac2fea94f7a reverted, no explosions for me either. The aforementioned commit has 3 bugs. 1) mmu_notifier_invalidate_range cannot be used in replacement of mmu_notifier_invalidate_range_start/end. For KVM mmu_notifier_invalidate_range is a noop and rightfully so. A MMU notifier implementation has to implement either ->invalidate_range method or the invalidate_range_start/end methods, not both. And if you implement invalidate_range_start/end like KVM is forced to do, calling mmu_notifier_invalidate_range in common code is a noop for KVM. For those MMU notifiers that can get away only implementing ->invalidate_range, the ->invalidate_range is implicitly called by mmu_notifier_invalidate_range_end(). And only those secondary MMUs that share the same pagetable with the primary MMU (like AMD iommuv2) can get away only implementing ->invalidate_range. So all cases (THP on/off) are broken right now. To fix this is enough to replace mmu_notifier_invalidate_range with mmu_notifier_invalidate_range_start;mmu_notifier_invalidate_range_end. Either that or call multiple mmu_notifier_invalidate_page like before. 2) address + (1UL << compound_order(page) is buggy, it should be PAGE_SIZE << compound_order(page), it's bytes not pages, 2M not 512. 3) The whole invalidate_range thing was an attempt to call a single invalidate while walking multiple 4k ptes that maps the same THP (after a pmd virtual split without physical compound page THP split). It's unclear if the rmap_walk will always provide an address that is 2M aligned as parameter to try_to_unmap_one, in presence of THP. I think it needs also an address &= (PAGE_SIZE << compound_order(page)) - 1 to be safe. The other bug where you can reproduce the same corruption with OOM is unrelated and caused by the OOM reaper. OOM reaper was even corrupting data if a task was writing to disk and stuck in OOM in write() syscall or async io write. To fix the KVM corruption in the OOM reaper, it needs to call mmu_notifier_invalidate_start/end around oom_kill.c:unmap_page_range. This additional mmu_notifier_invalidate_start will not be good for the OOM reaper because it's yet another case (like the mmap_sem for writing) that will prevent the OOM reaper to run, so hindering its ability to hide XFS OOM deadlocks, and making those resurface. Not in KVM case because we use a spinlock to serialize against the secondary MMU activity and the KVM critical section under spinlock isn't going to allocate memory, but range_start can schedule or block on slow hardware where the secondary MMU is accessed through PCI (not KVM case). My preference is still to make the OOM reaper a config option and let it grow into the VM at zero cost if disabled, while at the same time having the option to keep the VM simpler and spend the time fixing the filesystem bugs instead (while still being able to reproduce them more easily with OOM reaper disabled). Thanks, Andrea