Received: by 10.192.165.156 with SMTP id m28csp553853imm; Thu, 19 Apr 2018 03:45:57 -0700 (PDT) X-Google-Smtp-Source: AIpwx48T36ntHs8mC1Vrr2mZP9tFgWkvruWR7mnJ5uMVOisHRvhGmxpgp5VRwHXbq/HnuExUolIb X-Received: by 2002:a17:902:b595:: with SMTP id a21-v6mr5684497pls.68.1524134756997; Thu, 19 Apr 2018 03:45:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524134756; cv=none; d=google.com; s=arc-20160816; b=Eq+3EZMOfzFZDUS9vHoAOEq/BPFYLxrrOAWWsrdLZToBQ2yhesi9bczsG1NUxWkTRt JZFIVZF8+uTuHg5zQiUOlajYzwzBTHy+OLv1HPpzsAJc169wvh47Xb/qJ0XpPlFuiNKX WSIQ+NGSgG28DCAsOYYzTSrdJsgofVQ2++WHEQWsdDoUYB1kF5Oiq/60b9JWppA9Qz/H ctIjtsoyPWd3ypgLzkyk1bvHp//SWv6WrEJubIShOhagqhptSKlEE+YUmdJxPTqz3efO UyQ4/4SEb5A86wCUK6cnJsJxOVEAW/UlIXEkup1ki/cQalxiDQLGVcTMUsi6Zh0UH44D SOVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=OSGw/ly32eMc9RXtEikpqUTGSo7dOeElWhsP9GrNCTI=; b=gBk0jg3oLqOcuMM7eJpzhJCXyV7/b1vIqcsMIFFnzaa7smaUn2vmBET7eqVUyslJNO OgTeyu8YqkrCE8YnVn2wWoqgbhjLnOxFdcROnsMaErFM7zxV2WQ5ERPYuQIPhRZ+DEuv wiAk3q8yzj6W11R7O2hKouqAh3qI7mUtC1r+iqJLnrCSL/I00PtrAn7XMGT5CuKCKCPL 38z+2MLCefD/rDLLsfLQaGuGf45mPEoLPMsH7WjiWVypKCbCJKFY/u1H3XgWNiCEXZLO wvf83qKBWwpRY4tEU/utogsXAPxvods/EtFgM9rRNjm2chKoIEY/EvlozyNzLt1qjkr4 wQUg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p15si2832472pgf.358.2018.04.19.03.45.42; Thu, 19 Apr 2018 03:45:56 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752530AbeDSKoj (ORCPT + 99 others); Thu, 19 Apr 2018 06:44:39 -0400 Received: from mx2.suse.de ([195.135.220.15]:41452 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751565AbeDSKog (ORCPT ); Thu, 19 Apr 2018 06:44:36 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 6D056AC17; Thu, 19 Apr 2018 10:44:34 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 8D4361E0AEA; Thu, 19 Apr 2018 12:44:32 +0200 (CEST) Date: Thu, 19 Apr 2018 12:44:32 +0200 From: Jan Kara To: Dan Williams Cc: Jan Kara , linux-nvdimm , Jeff Moyer , Dave Chinner , Matthew Wilcox , Alexander Viro , "Darrick J. Wong" , Ross Zwisler , Dave Hansen , Andrew Morton , Christoph Hellwig , linux-fsdevel , linux-xfs , Linux Kernel Mailing List , Mike Snitzer , Paul McKenney , Josh Triplett Subject: Re: [PATCH v8 15/18] mm, fs, dax: handle layout changes to pinned dax mappings Message-ID: <20180419104432.7lzk7nbjmwav6ojl@quack2.suse.cz> References: <152246892890.36038.18436540150980653229.stgit@dwillia2-desk3.amr.corp.intel.com> <152246901060.36038.4487158506830998280.stgit@dwillia2-desk3.amr.corp.intel.com> <20180404094656.dssixqvvdcp5jff2@quack2.suse.cz> <20180409164944.6u7i4wgbp6yihvin@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170421 (1.8.2) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 13-04-18 15:03:51, Dan Williams wrote: > On Mon, Apr 9, 2018 at 9:51 AM, Dan Williams wrote: > > On Mon, Apr 9, 2018 at 9:49 AM, Jan Kara wrote: > >> On Sat 07-04-18 12:38:24, Dan Williams wrote: > > [..] > >>> I wonder if this can be trivially solved by using srcu. I.e. we don't > >>> need to wait for a global quiescent state, just a > >>> get_user_pages_fast() quiescent state. ...or is that an abuse of the > >>> srcu api? > >> > >> Well, I'd rather use the percpu rwsemaphore (linux/percpu-rwsem.h) than > >> SRCU. It is a more-or-less standard locking mechanism rather than relying > >> on implementation properties of SRCU which is a data structure protection > >> method. And the overhead of percpu rwsemaphore for your use case should be > >> about the same as that of SRCU. > > > > I was just about to ask that. Yes, it seems they would share similar > > properties and it would be better to use the explicit implementation > > rather than a side effect of srcu. > > ...unfortunately: > > BUG: sleeping function called from invalid context at > ./include/linux/percpu-rwsem.h:34 > [..] > Call Trace: > dump_stack+0x85/0xcb > ___might_sleep+0x15b/0x240 > dax_layout_lock+0x18/0x80 > get_user_pages_fast+0xf8/0x140 > > ...and thinking about it more srcu is a better fit. We don't need the > 100% exclusion provided by an rwsem we only need the guarantee that > all cpus that might have been running get_user_pages_fast() have > finished it at least once. > > In my tests synchronize_srcu is a bit slower than unpatched for the > trivial 100 truncate test, but certainly not the 200x latency you were > seeing with syncrhonize_rcu. > > Elapsed time: > 0.006149178 unpatched > 0.009426360 srcu Hum, right. Yesterday I was looking into KSM for a different reason and I've noticed it also does writeprotect pages and deals with races with GUP. And what KSM relies on is: write_protect_page() ... entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); /* * Check that no O_DIRECT or similar I/O is in progress on the * page */ if (page_mapcount(page) + 1 + swapped != page_count(page)) { page used -> bail } And this really works because gup_pte_range() does: page = pte_page(pte); head = compound_head(page); if (!page_cache_get_speculative(head)) goto pte_unmap; if (unlikely(pte_val(pte) != pte_val(*ptep))) { bail } So either write_protect_page() page sees the elevated reference or gup_pte_range() bails because it will see the pte changed. In the truncate path things are a bit different but in principle the same should work - once truncate blocks page faults and unmaps pages from page tables, we can be sure GUP will not grab the page anymore or we'll see elevated page count. So IMO there's no need for any additional locking against the GUP path (but a comment explaining this is highly desirable I guess). Honza -- Jan Kara SUSE Labs, CR