Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp30048imm; Fri, 21 Sep 2018 17:26:19 -0700 (PDT) X-Google-Smtp-Source: ACcGV60SalFO7SZ6NDtOBswWISC3P+80k14/eIHNDjU1r33L4Y9+lUarSYYit8B1mby2Sitv6jAR X-Received: by 2002:a17:902:b03:: with SMTP id 3-v6mr136944plq.156.1537575979387; Fri, 21 Sep 2018 17:26:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1537575979; cv=none; d=google.com; s=arc-20160816; b=hwUByiHXsukbz/GHOQo5g0KKnS0H/HvkcSUqwPA5MXG9G8YIBM+PvdPSkN8sn5eKLo YO3YlQKB4FQy5XL4RA520gF3garlYeJVTMsO38HjMo4Qeq7v6t26MXurSStT8Xvei+DW e41i/xOJp1zNsOImyI1I4Fj2MzFoNYcbd/Bjc1ZUejKCxJw9zAZ9o1c3r+GbxYE4ueiL OpjLSN4Di4d3nnAmtn+DnzzuMtQBl5kjcTvUL0Wah3HtfDR/wnMaptA02L2JJcf/XVw8 5Qc5rzy/qSgElJ4z3PRpQpWu70dzhDPaOfWivYa4rcP1mjU5izvo5Prdka4ZHxe6yQr/ HFfw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:subject:message-id:date:cc:to :from:mime-version:content-transfer-encoding:content-disposition; bh=Gpv1OLecVP8hXRqj2dd2sy320CFoVpQvt7CGEQ5R0pg=; b=Wi9WwhJ7oox+ZHHvX42O7dUaHKi8BqdQ95V8zxFYWefcWTSCEUidJAAo24GrgGbJIv ZoTepot2eqzufh/S1tzW1MT6A2yBDlKH57gRt+mCizKz0GTQT7jaNmbrJInR4L2VBItC b2J+DQ0cPbvjUMdwSBgQ+77COFjd2Y82Ran1mmsS9vfa32Ej6hYQJNBXy+6s+xasGeT4 c3XW8zJL+S/hLEbSHGWtV/5c8pz7h37OdFeB7nU4Zw6I3cLz1TaBLZrBfQQD1w+gk7KJ Nc0dS8i1rLXDDxQrXQ7UXUemidkT59Pmf1ujkSsSY1iPhubcPRPi3Q+IzzsFqgWhalPg 511A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v30-v6si2066574pga.621.2018.09.21.17.26.04; Fri, 21 Sep 2018 17:26:19 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2403760AbeIVGQI (ORCPT + 99 others); Sat, 22 Sep 2018 02:16:08 -0400 Received: from shadbolt.e.decadent.org.uk ([88.96.1.126]:44124 "EHLO shadbolt.e.decadent.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2391744AbeIVGKp (ORCPT ); Sat, 22 Sep 2018 02:10:45 -0400 Received: from [2a02:8011:400e:2:cbab:f00:c93f:614] (helo=deadeye) by shadbolt.decadent.org.uk with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1g3Vds-0008BY-Lt; Sat, 22 Sep 2018 01:19:24 +0100 Received: from ben by deadeye with local (Exim 4.91) (envelope-from ) id 1g3Vdn-0000qb-EG; Sat, 22 Sep 2018 01:19:19 +0100 Content-Type: text/plain; charset="UTF-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit MIME-Version: 1.0 From: Ben Hutchings To: linux-kernel@vger.kernel.org, stable@vger.kernel.org CC: akpm@linux-foundation.org, "Mel Gorman" , "Sebastian Andrzej Siewior" , "Hugh Dickins" , dave@stgolabs.net, "Greg Kroah-Hartman" , "Chris Mason" , "Peter Zijlstra" , "Thomas Gleixner" , "Darren Hart" , "Linus Torvalds" , "Mel Gorman" , "Chenbo Feng" , "Ingo Molnar" , "Davidlohr Bueso" Date: Sat, 22 Sep 2018 01:15:42 +0100 Message-ID: X-Mailer: LinuxStableQueue (scripts by bwh) Subject: [PATCH 3.16 12/63] futex: Remove requirement for lock_page() in get_futex_key() In-Reply-To: X-SA-Exim-Connect-IP: 2a02:8011:400e:2:cbab:f00:c93f:614 X-SA-Exim-Mail-From: ben@decadent.org.uk X-SA-Exim-Scanned: No (on shadbolt.decadent.org.uk); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 3.16.58-rc1 review patch. If anyone has any objections, please let me know. ------------------ From: Mel Gorman commit 65d8fc777f6dcfee12785c057a6b57f679641c90 upstream. When dealing with key handling for shared futexes, we can drastically reduce the usage/need of the page lock. 1) For anonymous pages, the associated futex object is the mm_struct which does not require the page lock. 2) For inode based, keys, we can check under RCU read lock if the page mapping is still valid and take reference to the inode. This just leaves one rare race that requires the page lock in the slow path when examining the swapcache. Additionally realtime users currently have a problem with the page lock being contended for unbounded periods of time during futex operations. Task A get_futex_key() lock_page() ---> preempted Now any other task trying to lock that page will have to wait until task A gets scheduled back in, which is an unbound time. With this patch, we pretty much have a lockless futex_get_key(). Experiments show that this patch can boost/speedup the hashing of shared futexes with the perf futex benchmarks (which is good for measuring such change) by up to 45% when there are high (> 100) thread counts on a 60 core Westmere. Lower counts are pretty much in the noise range or less than 10%, but mid range can be seen at over 30% overall throughput (hash ops/sec). This makes anon-mem shared futexes much closer to its private counterpart. Signed-off-by: Mel Gorman [ Ported on top of thp refcount rework, changelog, comments, fixes. ] Signed-off-by: Davidlohr Bueso Reviewed-by: Thomas Gleixner Cc: Chris Mason Cc: Darren Hart Cc: Hugh Dickins Cc: Linus Torvalds Cc: Mel Gorman Cc: Peter Zijlstra Cc: Sebastian Andrzej Siewior Cc: dave@stgolabs.net Link: http://lkml.kernel.org/r/1455045314-8305-3-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar Signed-off-by: Chenbo Feng Signed-off-by: Greg Kroah-Hartman [bwh: Backported to 3.16: s/READ_ONCE/ACCESS_ONCE/] Signed-off-by: Ben Hutchings --- kernel/futex.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 91 insertions(+), 7 deletions(-) --- a/kernel/futex.c +++ b/kernel/futex.c @@ -394,6 +394,7 @@ get_futex_key(u32 __user *uaddr, int fsh unsigned long address = (unsigned long)uaddr; struct mm_struct *mm = current->mm; struct page *page, *page_head; + struct address_space *mapping; int err, ro = 0; /* @@ -472,7 +473,19 @@ again: } #endif - lock_page(page_head); + /* + * The treatment of mapping from this point on is critical. The page + * lock protects many things but in this context the page lock + * stabilizes mapping, prevents inode freeing in the shared + * file-backed region case and guards against movement to swap cache. + * + * Strictly speaking the page lock is not needed in all cases being + * considered here and page lock forces unnecessarily serialization + * From this point on, mapping will be re-verified if necessary and + * page lock will be acquired only if it is unavoidable + */ + + mapping = ACCESS_ONCE(page_head->mapping); /* * If page_head->mapping is NULL, then it cannot be a PageAnon @@ -489,18 +502,31 @@ again: * shmem_writepage move it from filecache to swapcache beneath us: * an unlikely race, but we do need to retry for page_head->mapping. */ - if (!page_head->mapping) { - int shmem_swizzled = PageSwapCache(page_head); + if (unlikely(!mapping)) { + int shmem_swizzled; + + /* + * Page lock is required to identify which special case above + * applies. If this is really a shmem page then the page lock + * will prevent unexpected transitions. + */ + lock_page(page); + shmem_swizzled = PageSwapCache(page) || page->mapping; unlock_page(page_head); put_page(page_head); + if (shmem_swizzled) goto again; + return -EFAULT; } /* * Private mappings are handled in a simple way. * + * If the futex key is stored on an anonymous page, then the associated + * object is the mm which is implicitly pinned by the calling process. + * * NOTE: When userspace waits on a MAP_SHARED mapping, even if * it's a read-only handle, it's expected that futexes attach to * the object not the particular process. @@ -518,16 +544,74 @@ again: key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */ key->private.mm = mm; key->private.address = address; + + get_futex_key_refs(key); /* implies smp_mb(); (B) */ + } else { + struct inode *inode; + + /* + * The associated futex object in this case is the inode and + * the page->mapping must be traversed. Ordinarily this should + * be stabilised under page lock but it's not strictly + * necessary in this case as we just want to pin the inode, not + * update the radix tree or anything like that. + * + * The RCU read lock is taken as the inode is finally freed + * under RCU. If the mapping still matches expectations then the + * mapping->host can be safely accessed as being a valid inode. + */ + rcu_read_lock(); + + if (ACCESS_ONCE(page_head->mapping) != mapping) { + rcu_read_unlock(); + put_page(page_head); + + goto again; + } + + inode = ACCESS_ONCE(mapping->host); + if (!inode) { + rcu_read_unlock(); + put_page(page_head); + + goto again; + } + + /* + * Take a reference unless it is about to be freed. Previously + * this reference was taken by ihold under the page lock + * pinning the inode in place so i_lock was unnecessary. The + * only way for this check to fail is if the inode was + * truncated in parallel so warn for now if this happens. + * + * We are not calling into get_futex_key_refs() in file-backed + * cases, therefore a successful atomic_inc return below will + * guarantee that get_futex_key() will still imply smp_mb(); (B). + */ + if (WARN_ON_ONCE(!atomic_inc_not_zero(&inode->i_count))) { + rcu_read_unlock(); + put_page(page_head); + + goto again; + } + + /* Should be impossible but lets be paranoid for now */ + if (WARN_ON_ONCE(inode->i_mapping != mapping)) { + err = -EFAULT; + rcu_read_unlock(); + iput(inode); + + goto out; + } + key->both.offset |= FUT_OFF_INODE; /* inode-based key */ - key->shared.inode = page_head->mapping->host; + key->shared.inode = inode; key->shared.pgoff = basepage_index(page); + rcu_read_unlock(); } - get_futex_key_refs(key); /* implies MB (B) */ - out: - unlock_page(page_head); put_page(page_head); return err; }