Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752765AbdDZQSI (ORCPT ); Wed, 26 Apr 2017 12:18:08 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:32892 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751229AbdDZQR6 (ORCPT ); Wed, 26 Apr 2017 12:17:58 -0400 Date: Wed, 26 Apr 2017 09:17:51 -0700 From: "Paul E. McKenney" To: Suzuki K Poulose Cc: Radim =?utf-8?B?S3LEjW3DocWZ?= , pbonzini@redhat.com, christoffer.dall@linaro.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu, kvm@vger.kernel.org, marc.zyngier@arm.com, mark.rutland@arm.com, andreyknvl@google.com, Will Deacon Subject: Re: [PATCH 1/2] kvm: Fix mmu_notifier release race Reply-To: paulmck@linux.vnet.ibm.com References: <1493028624-29837-1-git-send-email-suzuki.poulose@arm.com> <1493028624-29837-2-git-send-email-suzuki.poulose@arm.com> <20170425184904.GI5713@potion> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17042616-0052-0000-0000-000001E7DF86 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00006977; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000208; SDB=6.00852795; UDB=6.00421573; IPR=6.00631632; BA=6.00005313; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00015188; XFM=3.00000013; UTC=2017-04-26 16:17:55 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17042616-0053-0000-0000-00005042F8CA Message-Id: <20170426161751.GV3956@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-04-26_10:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=2 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1703280000 definitions=main-1704260277 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6304 Lines: 140 On Wed, Apr 26, 2017 at 05:03:44PM +0100, Suzuki K Poulose wrote: > On 25/04/17 19:49, Radim Krčmář wrote: > >2017-04-24 11:10+0100, Suzuki K Poulose: > >>The KVM uses mmu_notifier (wherever available) to keep track > >>of the changes to the mm of the guest. The guest shadow page > >>tables are released when the VM exits via mmu_notifier->ops.release(). > >>There is a rare chance that the mmu_notifier->release could be > >>called more than once via two different paths, which could end > >>up in use-after-free of kvm instance (such as [0]). > >> > >>e.g: > >> > >>thread A thread B > >>------- -------------- > >> > >> get_signal-> kvm_destroy_vm()-> > >> do_exit-> mmu_notifier_unregister-> > >> exit_mm-> kvm_arch_flush_shadow_all()-> > >> exit_mmap-> spin_lock(&kvm->mmu_lock) > >> mmu_notifier_release-> .... > >> kvm_arch_flush_shadow_all()-> ..... > >> ... spin_lock(&kvm->mmu_lock) ..... > >> spin_unlock(&kvm->mmu_lock) > >> kvm_arch_free_kvm() > >> *** use after free of kvm *** > > > >I don't understand this race ... > >a piece of code in mmu_notifier_unregister() says: > > > > /* > > * Wait for any running method to finish, of course including > > * ->release if it was run by mmu_notifier_release instead of us. > > */ > > synchronize_srcu(&srcu); > > > >and code before that removes the notifier from the list, so it cannot be > >called after we pass this point. mmu_notifier_release() does roughly > >the same and explains it as: > > > > /* > > * synchronize_srcu here prevents mmu_notifier_release from returning to > > * exit_mmap (which would proceed with freeing all pages in the mm) > > * until the ->release method returns, if it was invoked by > > * mmu_notifier_unregister. > > * > > * The mmu_notifier_mm can't go away from under us because one mm_count > > * is held by exit_mmap. > > */ > > synchronize_srcu(&srcu); > > > >The call of mmu_notifier->release is protected by srcu in both cases and > >while it seems possible that mmu_notifier->release would be called > >twice, I don't see a combination that could result in use-after-free > >from mmu_notifier_release after mmu_notifier_unregister() has returned. > > Thanks for bringing it up. Even I am wondering why this is triggered ! (But it > does get triggered for sure !!) > > The only difference I can spot with _unregister & _release paths are the way > we use src_read_lock across the deletion of the entry from the list. > > In mmu_notifier_unregister() we do : > > id = srcu_read_lock(&srcu); > /* > * exit_mmap will block in mmu_notifier_release to guarantee > * that ->release is called before freeing the pages. > */ > if (mn->ops->release) > mn->ops->release(mn, mm); > srcu_read_unlock(&srcu, id); > > ## Releases the srcu lock here and then goes on to grab the spin_lock. > > spin_lock(&mm->mmu_notifier_mm->lock); > /* > * Can not use list_del_rcu() since __mmu_notifier_release > * can delete it before we hold the lock. > */ > hlist_del_init_rcu(&mn->hlist); > spin_unlock(&mm->mmu_notifier_mm->lock); > > While in mmu_notifier_release() we hold it until the node(s) are deleted from the > list : > /* > * SRCU here will block mmu_notifier_unregister until > * ->release returns. > */ > id = srcu_read_lock(&srcu); > hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) > /* > * If ->release runs before mmu_notifier_unregister it must be > * handled, as it's the only way for the driver to flush all > * existing sptes and stop the driver from establishing any more > * sptes before all the pages in the mm are freed. > */ > if (mn->ops->release) > mn->ops->release(mn, mm); > > spin_lock(&mm->mmu_notifier_mm->lock); > while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { > mn = hlist_entry(mm->mmu_notifier_mm->list.first, > struct mmu_notifier, > hlist); > /* > * We arrived before mmu_notifier_unregister so > * mmu_notifier_unregister will do nothing other than to wait > * for ->release to finish and for mmu_notifier_unregister to > * return. > */ > hlist_del_init_rcu(&mn->hlist); > } > spin_unlock(&mm->mmu_notifier_mm->lock); > srcu_read_unlock(&srcu, id); > > ## The lock is release only after the deletion of the node. > > Both are followed by a synchronize_srcu(). Now, I am wondering if the unregister path > could potentially miss SRCU read lock held in _release() path and go onto finish the > synchronize_srcu before the item is deleted ? May be we should do the read_unlock > after the deletion of the node in _unregister (like we do in the _release()) ? > > > > >Doesn't [2/2] solve the exact same issue (that the release method cannot > >be called twice in parallel)? > > Not really. This could be a race between a release() and one of the other notifier > callbacks. e.g, In [0], we were hitting a use-after-free in kvm_unmap_hva() where, > the unregister could have succeeded and released the KVM. > > > [0] http://lkml.kernel.org/r/febea966-3767-21ff-3c40-1a76d1399138@suse.de > > In effect this all could be due to the same reason, the synchronize in unregister > missing another reader. If this is at all reproducible, I suggest use of ftrace or event tracing to work out exactly what is happening. Thanx, Paul