Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752355Ab3JWVm6 (ORCPT ); Wed, 23 Oct 2013 17:42:58 -0400 Received: from mail-lb0-f175.google.com ([209.85.217.175]:57440 "EHLO mail-lb0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751501Ab3JWVm5 (ORCPT ); Wed, 23 Oct 2013 17:42:57 -0400 MIME-Version: 1.0 In-Reply-To: References: <1382057438-3306-1-git-send-email-davidlohr@hp.com> <20131022154802.GA25490@localhost> <5266BBC7.9030207@mit.edu> From: Andy Lutomirski Date: Wed, 23 Oct 2013 14:42:34 -0700 Message-ID: Subject: Re: [PATCH 0/3] mm,vdso: preallocate new vmas To: Michel Lespinasse Cc: Davidlohr Bueso , Andrew Morton , Linus Torvalds , Ingo Molnar , Peter Zijlstra , Rik van Riel , Tim Chen , aswin@hp.com, linux-mm , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3795 Lines: 87 On Wed, Oct 23, 2013 at 3:13 AM, Michel Lespinasse wrote: > On Tue, Oct 22, 2013 at 10:54 AM, Andy Lutomirski wrote: >> On 10/22/2013 08:48 AM, walken@google.com wrote: >>> Generally the problems I see with mmap_sem are related to long latency >>> operations. Specifically, the mmap_sem write side is currently held >>> during the entire munmap operation, which iterates over user pages to >>> free them, and can take hundreds of milliseconds for large VMAs. >> >> This is the leading cause of my "egads, something that should have been >> fast got delayed for several ms" detector firing. > > Yes, I'm seeing such issues relatively frequently as well. > >> I've been wondering: >> >> Could we replace mmap_sem with some kind of efficient range lock? The >> operations would be: >> >> - mm_lock_all_write (drop-in replacement for down_write(&...->mmap_sem)) >> - mm_lock_all_read (same for down_read) >> - mm_lock_write_range(mm, start, end) >> - mm_lock_read_range(mm, start_end) >> >> and corresponding unlock functions (that maybe take a cookie that the >> lock functions return or that take a pointer to some small on-stack data >> structure). > > That seems doable, however I believe we can get rid of the latencies > in the first place which seems to be a better direction. As I briefly > mentioned, I would like to tackle the munmap problem sometime; Jan > Kara also has a project to remove places where blocking FS functions > are called with mmap_sem held (he's doing it for lock ordering > purposes, so that FS can call in to MM functions that take mmap_sem, > but there are latency benefits as well if we can avoid blocking in FS > with mmap_sem held). There will still be scalability issues if there are enough threads, but maybe this isn't so bad. (My workload may also have priority inversion problems -- there's a thread that runs on its own core and needs the mmap_sem read lock and a thread that runs on a highly contended core that needs the write lock.) > >> The easiest way to implement this that I can think of is a doubly-linked >> list or even just an array, which should be fine for a handful of >> threads. Beyond that, I don't really know. Creating a whole trie for >> these things would be expensive, and fine-grained locking on rbtree-like >> things isn't so easy. > > Jan also had an implementation of range locks using interval trees. To > take a range lock, you'd add the range you want to the interval tree, > count the conflicting range lock requests that were there before you, > and (if nonzero) block until that count goes to 0. When releasing the > range lock, you look for any conflicting requests in the interval tree > and decrement their conflict count, waking them up if the count goes > to 0. Yuck. Now we're taking a per-mm lock on the rbtree, doing some cacheline-bouncing rbtree operations, and dropping the lock to serialize access to something that probably only has a small handful of accessors at a time. I bet that an O(num locks) array or linked list will end up being faster in practice. I think the idea solution would be to shove these things into the page tables somehow, but that seems impossibly complicated. --Andy > > But as I said earlier, I would prefer if we could avoid holding > mmap_sem during long-latency operations rather than working around > this issue with range locks. > > -- > Michel "Walken" Lespinasse > A program is never fully debugged until the last user dies. -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/