Received: by 2002:a05:7412:419a:b0:f3:1519:9f41 with SMTP id i26csp3765833rdh; Tue, 28 Nov 2023 03:22:49 -0800 (PST) X-Google-Smtp-Source: AGHT+IHpW0Oy/8CtV4zeZSphOUbsdvr5QngMaqBRRpladZWLJZqla3oke/3syR1LeGetwRS60qw0 X-Received: by 2002:a05:6e02:508:b0:35c:a13c:9be7 with SMTP id d8-20020a056e02050800b0035ca13c9be7mr10539216ils.5.1701170568776; Tue, 28 Nov 2023 03:22:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701170568; cv=none; d=google.com; s=arc-20160816; b=c5vMazZ0pnHbeIpE+KqAzmYZzw5b5tYK0jm1gHXnemFQ0bNlFbJkUJv5US/aYDp7N2 7RpUmHiYI+BgF1aCdQHPg7sqIa5akIeurx79ijPlbgRf8tvt+uXIq2Sq9qYAtN1RUMEH AsQkfylEe8VH8lRVzOxwkVYcZ/33o8dUqbQoOFEFY3DIwyvN84n73lTOTTpVOaTQeUSJ 4xr5c9fIHvM+6rK4a9Q13ng8GFZhXu0N/kX891aQV8DQCLTiHFtLU3X5nNk83Ogr0ZZ7 kLW/vKGt/YJYvKxxgT7WWCBG1Y+0E85tEiQ1izMzRtvsW6hw5NizfDpxcZMZEyj+uiN8 68xg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=ULlAN2OTwH0qqZqtpyuFI3gtbbBSb9YRY8MDdp/pLEI=; fh=mxM+iyYud2eqLtG0pkLKi0uG+kielFRY+hhtqFHmhjA=; b=w7nVbXMscJwCZgoa6wP68be1scZypqGKgFJfLWxRMNvtULibuCEkeXEZc3U6dq1GFZ u2aO7lSGbOazrYAvePi0LNOAOWyCq+4fNHBIIvf2dUzoeK2PG4Ayl/FOn6oSDn9NWUoo Hi6Szdlp5klEp/2LlcMd5UeGtL7ZM2wQbZdpTlD5dJ1EFD92MVgSoqXUzIy1fsgNmZx/ +OXs/X4S6zbpRYxPvBcfXhFzc8CPQxicay61zIDK9REXLMr637z+zJge6FFJC8Vv2weJ FUu2vh7rI+IWPn+JF0FmEXp7Adxr8wPRFq7wbQu6T8ymjTwLL1P4PedvTJcJZPJiVf9y gqBg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b="NB/ZZ4Of"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id a2-20020a056a0011c200b006cc01c90d31si7610875pfu.312.2023.11.28.03.22.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 Nov 2023 03:22:48 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b="NB/ZZ4Of"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 66C288058A22; Tue, 28 Nov 2023 03:22:45 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344553AbjK1LW0 (ORCPT + 99 others); Tue, 28 Nov 2023 06:22:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50918 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1343838AbjK1LWY (ORCPT ); Tue, 28 Nov 2023 06:22:24 -0500 Received: from mail-lj1-x236.google.com (mail-lj1-x236.google.com [IPv6:2a00:1450:4864:20::236]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A4000D6 for ; Tue, 28 Nov 2023 03:22:30 -0800 (PST) Received: by mail-lj1-x236.google.com with SMTP id 38308e7fff4ca-2c997447ff9so38738441fa.0 for ; Tue, 28 Nov 2023 03:22:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701170549; x=1701775349; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ULlAN2OTwH0qqZqtpyuFI3gtbbBSb9YRY8MDdp/pLEI=; b=NB/ZZ4OfuekjUCD+HmArLOdmhiLrMmoxVotzVDeOo7bOqWJmJ5qg0O96SyUc1Dsxe4 TtWKMrm9Gm8RH4GBCiLAAwqCYhRajzw3+pLJ/oTA7QJPpDdN2O0muxBbZbphAnAyf8s2 l0BPR10T9AGKSttFFBCDOxtMJP7wbAKFNoC6hgca8YRyx0P3jWJwEUtWsompYkZDZiKY Np+SWevSlNPMouX7GQgL1aGM3bdB3iSyQGnEYe6TkGLO3wmqC/ogek11QQYC6aRSzOkY eL8gLAQYm9wWVX/3ymFbS21T+kpriIwHdxJZkWulgNEfReE8Gm1cGryB2JQh2Bvi8Hxg f3jw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701170549; x=1701775349; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ULlAN2OTwH0qqZqtpyuFI3gtbbBSb9YRY8MDdp/pLEI=; b=bD93b5Y799+SUG9Y12c8Xrb6T5o7ZBMtzWYWIkjaHDOjMa9oQtKuLJqm+DMM9G0e2J 97MDDB4RSBN21WIMSqA37eWU2JeVacTHrdxucxVWJfFSQdVajjh1RRIggVTtulp470BG gG1jeIsgsbh7Rm0CswqsbuMQj5Xhh6uF/eD7zOSynpjv6RwIo3i6nWqwgzTt3LGdtS3n +deickp1zUvAaAh3z/D39489Y/ipSGhzjTZKw9HAQNmnz5P8HsThbVk/nvyp6HTF8WgS IvRvNEKveozw2atMTwCJQHCX8ljeav9DqBJ/aDiNCREAp3lK7A90zaVukJkA7M6AnsrH uLeQ== X-Gm-Message-State: AOJu0YyUzwaB0ISTCQH8SV0qTczzHptL4MTTHwMdEG/yOYb6Gl4ZyEeR J9IwZPZVMfrQS0k+lQvYZGMv3JP+CV9IeeDNbf0= X-Received: by 2002:a2e:a7c8:0:b0:2c9:a0d6:1a2d with SMTP id x8-20020a2ea7c8000000b002c9a0d61a2dmr5997059ljp.8.1701170548525; Tue, 28 Nov 2023 03:22:28 -0800 (PST) MIME-Version: 1.0 References: <20231119194740.94101-1-ryncsn@gmail.com> <20231119194740.94101-19-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Tue, 28 Nov 2023 19:22:10 +0800 Message-ID: Subject: Re: [PATCH 18/24] mm/swap: introduce a helper non fault swapin To: Chris Li Cc: linux-mm@kvack.org, Andrew Morton , "Huang, Ying" , David Hildenbrand , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.6 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Tue, 28 Nov 2023 03:22:45 -0800 (PST) Chris Li =E4=BA=8E2023=E5=B9=B411=E6=9C=8822=E6=97=A5= =E5=91=A8=E4=B8=89 12:41=E5=86=99=E9=81=93=EF=BC=9A > > On Sun, Nov 19, 2023 at 11:49=E2=80=AFAM Kairui Song w= rote: > > > > From: Kairui Song > > > > There are two places where swapin is not direct caused by page fault: > > shmem swapin is invoked through shmem mapping, swapoff cause swapin by > > walking the page table. They used to construct a pseudo vmfault struct > > for swapin function. > > > > Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d > > ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff > > path is still using a pseudo vmfault. > > > > Introduce a helper for them both, this help save stack usage for swapof= f > > path, and help apply a unified swapin cache and readahead policy check. > > > > Also prepare for follow up commits. > > > > Signed-off-by: Kairui Song > > --- > > mm/shmem.c | 51 ++++++++++++++++--------------------------------- > > mm/swap.h | 11 +++++++++++ > > mm/swap_state.c | 38 ++++++++++++++++++++++++++++++++++++ > > mm/swapfile.c | 23 +++++++++++----------- > > 4 files changed, 76 insertions(+), 47 deletions(-) > > > > diff --git a/mm/shmem.c b/mm/shmem.c > > index f9ce4067c742..81d129aa66d1 100644 > > --- a/mm/shmem.c > > +++ b/mm/shmem.c > > @@ -1565,22 +1565,6 @@ static inline struct mempolicy *shmem_get_sbmpol= (struct shmem_sb_info *sbinfo) > > static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_inf= o *info, > > pgoff_t index, unsigned int order, pgoff_t *ilx= ); > > > > -static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp, > > - struct shmem_inode_info *info, pgoff_t index) > > -{ > > - struct mempolicy *mpol; > > - pgoff_t ilx; > > - struct page *page; > > - > > - mpol =3D shmem_get_pgoff_policy(info, index, 0, &ilx); > > - page =3D swap_cluster_readahead(swap, gfp, mpol, ilx); > > - mpol_cond_put(mpol); > > - > > - if (!page) > > - return NULL; > > - return page_folio(page); > > -} > > - > > Nice. Thank you. > > > /* > > * Make sure huge_gfp is always more limited than limit_gfp. > > * Some of the flags set permissions, while others set limitations. > > @@ -1854,9 +1838,12 @@ static int shmem_swapin_folio(struct inode *inod= e, pgoff_t index, > > { > > struct address_space *mapping =3D inode->i_mapping; > > struct shmem_inode_info *info =3D SHMEM_I(inode); > > - struct swap_info_struct *si; > > + enum swap_cache_result result; > > struct folio *folio =3D NULL; > > + struct mempolicy *mpol; > > + struct page *page; > > swp_entry_t swap; > > + pgoff_t ilx; > > int error; > > > > VM_BUG_ON(!*foliop || !xa_is_value(*foliop)); > > @@ -1866,34 +1853,30 @@ static int shmem_swapin_folio(struct inode *ino= de, pgoff_t index, > > if (is_poisoned_swp_entry(swap)) > > return -EIO; > > > > - si =3D get_swap_device(swap); > > - if (!si) { > > + mpol =3D shmem_get_pgoff_policy(info, index, 0, &ilx); > > + page =3D swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, = &result); Hi Chris, I've been trying to address these issues in V2, most issue in other patches have a straight solution, some could be discuss in seperate series, but I come up with some thoughts here: > > Notice this "result" CAN be outdated. e.g. after this call, the swap > cache can be changed by another thread generating the swap page fault > and installing the folio into the swap cache or removing it. This is true, and it seems a potential race also exist before this series for direct (no swapcache) swapin path (do_swap_page) if I understand it correctly: In do_swap_page path, multiple process could swapin the page at the same time (a mapped once page can still be shared by sub threads), they could get different folios. The later pte lock and pte_same check is not enough, because while one process is not holding the pte lock, another process could read-in, swap_free the entry, then swap-out the page again, using same entry, an ABA problem. The race is not likely to happen in reality but in theory possible. Same issue for shmem here, there are shmem_confirm_swap/shmem_add_to_page_cache check later to prevent re-installing into shmem mapping for direct swap in, but also not enough. Other process could read-in and re-swapout using same entry so the mapping entry seems unchanged during the time window. Still very unlikely to happen in reality, but not impossible. When swapcache is used there is no such issue, since swap lock and swap_map are used to sync all readers, and while one reader is still holding the folio, the entry is locked through swapcache, or if a folio is removed from swapcache, folio_test_swapcache will fail, and the reader could retry. I'm trying to come up with a better locking for direct swap in, am I missing anything here? Correct me if I get it wrong...