Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751297AbdH2PyD (ORCPT ); Tue, 29 Aug 2017 11:54:03 -0400 Received: from mail-io0-f173.google.com ([209.85.223.173]:32966 "EHLO mail-io0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751377AbdH2Pxu (ORCPT ); Tue, 29 Aug 2017 11:53:50 -0400 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: kvm splat in mmu_spte_clear_track_bits From: Nadav Amit In-Reply-To: Date: Tue, 29 Aug 2017 08:53:46 -0700 Cc: Adam Borowski , Paolo Bonzini , Wanpeng Li , =?utf-8?B?UmFkaW0gS3LEjW3DocWZ?= , kvm , "linux-kernel@vger.kernel.org" Message-Id: References: <20170820231302.s732zclznrqxwr46@angband.pl> <20170821191203.jospdwqpnixlotx3@angband.pl> <20170821195833.GA696@flask> <20170821223228.edc6jrm7bpybtqlj@angband.pl> <1c270e76-05be-6f5f-29c6-9cb31f37f71d@redhat.com> <20170825131419.r5lzm6oluauu65nx@angband.pl> <0a85df4b-ca0a-7e70-51dc-90bd1c460c85@redhat.com> <20170827123505.u4kb24kigjqwa2t2@angband.pl> <0dcca3a4-8ecd-0d05-489c-7f6d1ddb49a6@gmx.de> <79BC5306-4ED4-41E4-B2C1-12197D9D1709@gmail.com> To: Bernhard Held X-Mailer: Apple Mail (2.3273) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by nfs id v7TGJu1J026995 Content-Length: 2897 Lines: 51 Bernhard Held wrote: > On 08/28/2017 at 06:56 PM, Nadav Amit wrote: >> Bernhard Held wrote: >>> On 08/27/2017 at 02:35 PM, Adam Borowski wrote: >>>> 4.13-rc5 retested fails >>>> Crashed only after two hours or so of testing. >>>> 4.13-rc4 apparently works >>>> It survived several hours of varied tests (like 5 debian-installer runs, a >>>> win10 point release upgrade, some hurd package building, openbsd, etc), >>>> all while the host was likewise busy. >>>> Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5 >>>> but I wouldn't bet my life on it. >>> >>> I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure. >>> >>> There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related: >>> >>> $ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit >>> e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage >>> f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store() >>> aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl >>> d041353dc98a mm: fix list corruptions on shmem shrinklist >>> af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages >>> c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c >>> b3a81d0841a9 mm: fix KSM data corruption >>> 99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem >>> 0a2dd266dd6b mm: make tlb_flush_pending global >>> 56236a59556c mm: refactor TLB gathering API >>> a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible" >>> 0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending >>> 16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending >>> 9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context >>> 4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests >>> 9c56771316ef test_kmod: fix the lock in register_test_dev_kmod() >>> 434b06ae23ba test_kmod: fix bug which allows negative values on two config options >>> a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY" >>> 5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case >>> 75dddef32514 mm: ratelimit PFNs busy info message >>> d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads >> Don’t blame me for the TLB stuff... My money is on aac2fea94f7a . > > Amit, thanks for your courage to expose your patch! Just for the record, aac2fea94f7a is not mine (some others are). > > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it just accelerates the triggering of the splash. To be more sure the kernel needs to be tested for a couple of days. It would be great if others could assist in testing aac2fea94f7a.