Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754283Ab3I3Ixa (ORCPT ); Mon, 30 Sep 2013 04:53:30 -0400 Received: from mx1.redhat.com ([209.132.183.28]:15594 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754154Ab3I3Ix0 (ORCPT ); Mon, 30 Sep 2013 04:53:26 -0400 Date: Mon, 30 Sep 2013 10:52:43 +0200 From: Andrea Arcangeli To: Ingo Molnar Cc: Linus Torvalds , Waiman Long , Ingo Molnar , Andrew Morton , Linux Kernel Mailing List , Rik van Riel , Peter Hurley , Davidlohr Bueso , Alex Shi , Tim Chen , Peter Zijlstra , Matthew R Wilcox , Dave Hansen , Michel Lespinasse , Andi Kleen , "Chandramouleeswaran, Aswin" , "Norton, Scott J" , Haggai Eran , Sagi Grimberg , Or Gerlitz , Jerome Glisse Subject: Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t Message-ID: <20130930085243.GA25685@redhat.com> References: <1380308424-31011-1-git-send-email-Waiman.Long@hp.com> <20130928074144.GA17773@gmail.com> <20130928192123.GA8228@gmail.com> <20130928193739.GA8642@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130928193739.GA8642@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3674 Lines: 77 Hi everyone, On Sat, Sep 28, 2013 at 09:37:39PM +0200, Ingo Molnar wrote: > > * Ingo Molnar wrote: > > > If we do that then I suspect the next step will be queued rwlocks :-/ > > The current rwlock_t implementation is rather primitive by modern > > standards. (We'd probably have killed rwlock_t long ago if not for the > > tasklist_lock.) > > > > But yeah, it would work and conceptually a hard spinlock fits something > > as lowlevel as the anon-vma lock. > > > > I did a quick review pass and it appears nothing obvious is scheduling > > with the anon-vma lock held. If it did in a non-obvious way it's likely > > a bug anyway. The hugepage code grew a lot of logic running under the > > anon-vma lock, but it all seems atomic. > > > > So a conversion to rwlock_t could be attempted. (It should be relatively > > easy patch as well, because the locking operation is now nicely > > abstracted out.) > > Here's a totally untested patch to convert the anon vma lock to an > rwlock_t. Sorry having to break the party but the sleepable locks for anon_vma and i_mmap_mutex are now requirement for the "pageable RDMA" effort recently achieved upstream by mellanox with the MMU notifier. And as far as I can tell that's the only single good reason for why those locks shouldn't be spinlocks (otherwise I would have also complianed at the time of that conversion, the original regression was known, ask Andi). After the lock conversion it took a while to fix all other minor bits to make mmu notifier methods fully sleepable. The problem with the spinlocks is that in the rmap code (like try_to_unmap) we need to call mmu_notifier_invalidate_page with an "mm" as parameter, and the callee assumes the "mm" won't go away under it. The other second requirement is that the page cannot be freed until we call the mmu_notifier_invalidate_page (secondary MMU is ok to still access the page after the linux pte has been dropped and the TLB flushed). In the rmap code the only things that keep things afloat is either the anon_vma lock or the i_mmap_mutex so it is quite tricky to drop that lock while keeping "mm" and "page" both afloat for the invalidate post-anon-vma-unlock. Maybe there are ways to makes that safe? (reference counting, trylocking the mmap_sem). But there isn't a very strightforward way to do that. It isn't just mellanox drivers: originally SGI XPMEM driver also needed to schedule in those methods (then they figured how to get away in only scheduling in the mmu_notifier range calls but I suppose they prefer to be able to schedule in all invalidate methods including mmu_notifier_invalidate_page). Nvidia is now also going to use mmu notifier to allow the GPU (acting as a secondary MMU with pagetables) to access main memory without requiring RAM pinning and without disabling all VM features on the graphics memory (and without GART physical). Probably they don't need to schedule but I CC'ed Jerome just in case they need (as that would add one more relevant user of the feature). As far as KVM is concerned, there is no benefit in scheduling in the methods. The KVM mmu notifier invalidate consists in zeroing out one or more sptes and sending an IPI to flush the guest TLBs if others vcpus are running in other cpus. It's a mechanism pretty much identical to the one used by the primary MMU and it only requires irqs enabled to avoid deadlocks in case of cross-IPIs. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/