Received: by 2002:a25:868d:0:0:0:0:0 with SMTP id z13csp1947192ybk; Mon, 11 May 2020 08:09:41 -0700 (PDT) X-Google-Smtp-Source: APiQypLfd3Cy4qO8BV2sfV/DQxTptRVKaFdEPstk/kBUaP5IacrnmIxKwqCfL9tnY4G3LHHKiQeT X-Received: by 2002:a17:906:68cf:: with SMTP id y15mr13185734ejr.260.1589209781415; Mon, 11 May 2020 08:09:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1589209781; cv=none; d=google.com; s=arc-20160816; b=fHIq7dJ8ZMh3XPvxWKrXZQ5xVprT2CtQXNFW0SxmGdy5798Qcu63v+b/Act9htBDUh hfK5Dc/F1McJy+iDd0yjHzJX8bVf2qhgVkXAoS8zUN0o4hKXUoYSdexsIc8Ls5f4mHFJ 1FulSEAXDG1y+GjAo7/P5lZZLPY3AuzDQbZR0vKdxMliGyWbNOUqK8krYcuPtxnEJRYj 5nRUcKWU8C8Ua5XspxIluHb/Q6yamh5wiFm2WP3tHIAzZ5utIBaWZ0ZfzHikMwFwg+8G 9ubQwSwQjvJVzxgl7DvNJOnkobALtRX5BU4IGEEsczPdBPzXEH8xRlM8S5lwZg4ix6mA MJzg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=ANAit2WX+K9wu4n6he8ngZbFdqawZzoQDVcmzxgA7go=; b=Otpdmic8hRw1mwP0X94wbgwYOXRsuCJ5RFif0Ng6vbWmL37pDuHYkY+bK2gpisVOXy dF9MAooy8w8sM+N4Ba3veFDXyKEG9jtoKXIwZnfwtLpDPUtItDLj9mbeMEK3ALLrxnec j9JxBBEXOk4R96BRFuerZeb0sVCbDvgcDSsQ6fOeCdPF8PQaOLmnI3jdAPSsoLxw2AFN Jh8xjWBxZrki2PYPzPOGAU7SONYtzIMM9jrxNd3SHRYaX86J4eFA85Y/XD4i+SUwfhP5 ptks2337DOVXo3wfPyjzvCHzV2zUzuyX4RPyO1ZKLwxf6S1CNzZgf3IPDje8KrC8laPH poaA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=vFO25KpX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id d13si6440693eds.361.2020.05.11.08.09.06; Mon, 11 May 2020 08:09:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=vFO25KpX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729820AbgEKPHI (ORCPT + 99 others); Mon, 11 May 2020 11:07:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45406 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1728090AbgEKPHI (ORCPT ); Mon, 11 May 2020 11:07:08 -0400 Received: from mail-qk1-x743.google.com (mail-qk1-x743.google.com [IPv6:2607:f8b0:4864:20::743]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 31DF8C061A0E for ; Mon, 11 May 2020 08:07:08 -0700 (PDT) Received: by mail-qk1-x743.google.com with SMTP id z80so4518078qka.0 for ; Mon, 11 May 2020 08:07:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=ANAit2WX+K9wu4n6he8ngZbFdqawZzoQDVcmzxgA7go=; b=vFO25KpXwg/4vtW8tD+XIfm9LH46+fQ6vvusO0tJeYwSNeBknXcUE5831e2qvIbxNB +vEmtna2uZpKf+W/kuG3dxSgsESJKBqc4IhSu+vI0v1ulFrcCqAfkILo2N91JoiKJ1VS IeKrygvNCP9TwKSM4nZZV+6aUOr9YP4M9uU2NnO214vrmh+gcV3XI6IYat+zBLyslXuz qBTXHi14HEys8m8tByH7DOimgJvM9+KyoJhvhLRU6cUdxDSd9pfqwDB/NC1bpgSTNoHb q4xx6h6LAXEjkdPchQEnaYQXvlQK/oG/0FN9w9FLsVDpbYuXFot5iHbtHU6WrxwDv07O REGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=ANAit2WX+K9wu4n6he8ngZbFdqawZzoQDVcmzxgA7go=; b=SBmDBEvjSWqBUE+587cpeyIq/rkqGJJFEUn5bBCjJ4a8cYt0Ig6yn7WvorlC5A9BvV Z0ldLuiN8qGydsCAXEG4txEnx0FdYD9o3oxIdZVnD/PFSamuHHCgHycJKjfnAcS4WpAF 2umNgnxiDGPBu8dBDZIqq20pyXBmngJxEYVE+T7gB1b0CmlFBVKUCnxLzwms+tkOKsIk jZ6vEN7ipW9I41zdzBrwwkCqPcqhdLbhIQCpdvTNjvsjEQCJoIlcgFCqN8BOMfsShlFR ZogFkxcSPgGZsMAvX6Z9TNfBh6L8xLSnCRuF/1lYH1uB85EYs/bI+2QRb6h0iMspQhZq NLoA== X-Gm-Message-State: AGi0PuYSJUcOkn3c38fVYtcJVlN24u0Ts9ONXbWfK4p5Aq43GGF+9xsP Ky/Z6x5Xru4dPf6EC7C5LMSbRA== X-Received: by 2002:a37:9344:: with SMTP id v65mr14585673qkd.366.1589209627242; Mon, 11 May 2020 08:07:07 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::1:2627]) by smtp.gmail.com with ESMTPSA id l22sm8360550qki.45.2020.05.11.08.07.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 May 2020 08:07:06 -0700 (PDT) Date: Mon, 11 May 2020 11:06:48 -0400 From: Johannes Weiner To: Hugh Dickins Cc: Joonsoo Kim , Alex Shi , Shakeel Butt , Michal Hocko , "Kirill A. Shutemov" , Roman Gushchin , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, Andrew Morton Subject: Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API Message-ID: <20200511150648.GA306292@cmpxchg.org> References: <20200420221126.341272-1-hannes@cmpxchg.org> <20200420221126.341272-6-hannes@cmpxchg.org> <20200422064041.GE6780@js1304-desktop> <20200422120946.GA358439@cmpxchg.org> <20200423052450.GA12538@js1304-desktop> <20200508160122.GB181181@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 11, 2020 at 12:38:04AM -0700, Hugh Dickins wrote: > On Fri, 8 May 2020, Johannes Weiner wrote: > > > > I looked at this some more, as well as compared it to non-shmem > > swapping. My conclusion is - and Hugh may correct me on this - that > > the deletion looks mandatory but is actually an optimization. Page > > reclaim will ultimately pick these pages up. > > > > When non-shmem pages are swapped in by readahead (locked until IO > > completes) and their page tables are simultaneously unmapped, the > > zap_pte_range() code calls free_swap_and_cache() and the locked pages > > are stranded in the swap cache with no page table references. We rely > > on page reclaim to pick them up later on. > > > > The same appears to be true for shmem. If the references to the swap > > page are zapped while we're trying to swap in, we can strand the page > > in the swap cache. But it's not up to swapin to detect this reliably, > > it just frees the page more quickly than having to wait for reclaim. > > I think you've got all that exactly right, thanks for working it out. > It originates from v3.7's 215c02bc33bb ("tmpfs: fix shmem_getpage_gfp() > VM_BUG_ON") - in which I also had to thank you. I should have looked where it actually came from - I had forgotten about that patch! > I think I chose to do the delete_from_swap_cache() right there, partly > because of following shmem_unuse_inode() code which already did that, > partly on the basis that while we have to observe the case then it's > better to clean it up, and partly out of guilt that our page lock here > is what had prevented shmem_undo_range() from completing its job; but > I believe you're right that unused swapcache reclaim would sort it out > eventually. That makes sense to me. > > diff --git a/mm/shmem.c b/mm/shmem.c > > index e80167927dce..236642775f89 100644 > > --- a/mm/shmem.c > > +++ b/mm/shmem.c > > @@ -640,7 +640,7 @@ static int shmem_add_to_page_cache(struct page *page, > > xas_lock_irq(&xas); > > entry = xas_find_conflict(&xas); > > if (entry != expected) > > - xas_set_err(&xas, -EEXIST); > > + xas_set_err(&xas, expected ? -ENOENT : -EEXIST); > > Two things on this. > > Minor matter of taste, I'd prefer that as > xas_set_err(&xas, entry ? -EEXIST : -ENOENT); > which would be more general and more understandable - > but what you have written should be fine for the actual callers. Yes, checking `expected' was to differentiate the behavior depending on the callsite. But testing `entry' is more obvious in that location. > Except... I think returning -ENOENT there will not work correctly, > in the case of a punched hole. Because (unless you've reworked it > and I just haven't looked) shmem_getpage_gfp() knows to retry in > the case of -EEXIST, but -ENOENT will percolate up to shmem_fault() > and result in a SIGBUS, or a read/write error, when the hole should > just get refilled instead. Good catch, I had indeed missed that. I'm going to make it retry on -ENOENT as well. We could have it go directly to allocating a new page, but it seems unnecessarily complicated: we've already been retrying in this situation until now, so I would stick to "there was a race, retry." > Not something that needs fixing in a hurry (it took trinity to > generate this racy case in the first place), I'll take another look > once I've pulled it into a tree (or collected next mmotm) - unless > you've already have changed it around by then. Attaching a delta fix based on your observations. Andrew, barring any objections to this, could you please fold it into the version you have in your tree already? --- From 33d03ceebce0a6261d472ddc9c5a07940f44714c Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Mon, 11 May 2020 10:45:14 -0400 Subject: [PATCH] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API fix Incorporate Hugh's feedback: - shmem_getpage_gfp() needs to handle the new -ENOENT that was previously implied in the -EEXIST when a swap entry changed under us in any way. Otherwise hole punching could cause a racing fault to SIGBUS instead of allocating a new page. - It is indeed page reclaim that picks up any swapcache we leave stranded when free_swap_and_cache() runs on a page locked by somebody else. Document that our delete_from_swap_cache() is an optimization, not something we rely on for correctness. - Style cleanup: testing `expected' to decide on -EEXIST vs -ENOENT differentiates the callsites, but is a bit awkward to read. Test `entry' instead. Signed-off-by: Johannes Weiner --- mm/shmem.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index afd5a057ebb7..00fb001e8f3e 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -638,7 +638,7 @@ static int shmem_add_to_page_cache(struct page *page, xas_lock_irq(&xas); entry = xas_find_conflict(&xas); if (entry != expected) - xas_set_err(&xas, expected ? -ENOENT : -EEXIST); + xas_set_err(&xas, entry ? -EEXIST : -ENOENT); xas_create_range(&xas); if (xas_error(&xas)) goto unlock; @@ -1686,10 +1686,13 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index, * We already confirmed swap under page lock, but * free_swap_and_cache() only trylocks a page, so it * is just possible that the entry has been truncated - * or holepunched since swap was confirmed. - * shmem_undo_range() will have done some of the - * unaccounting, now delete_from_swap_cache() will do - * the rest. + * or holepunched since swap was confirmed. This could + * occur at any time while the page is locked, and + * usually page reclaim will take care of the stranded + * swapcache page. But when we catch it, we may as + * well clean up after ourselves: shmem_undo_range() + * will have done some of the unaccounting, now + * delete_from_swap_cache() will do the rest. */ if (error == -ENOENT) delete_from_swap_cache(page); @@ -1765,7 +1768,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, if (xa_is_value(page)) { error = shmem_swapin_page(inode, index, &page, sgp, gfp, vma, fault_type); - if (error == -EEXIST) + if (error == -EEXIST || error == -ENOENT) goto repeat; *pagep = page; -- 2.26.2