Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp6063431rwd; Mon, 19 Jun 2023 01:33:52 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6n5gjhjZCecu4OdVGOSIN5tz4qMO3XEfRjo+adK4xSV2ZAdGAimsqhh8NKZ2Z72LkzXDt+ X-Received: by 2002:a05:6358:cb2f:b0:12b:d242:f8c3 with SMTP id gr47-20020a056358cb2f00b0012bd242f8c3mr3218255rwb.18.1687163632594; Mon, 19 Jun 2023 01:33:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687163632; cv=none; d=google.com; s=arc-20160816; b=Teg3noZatxeTQGPBC9ROou1pXiYdrMUgsZ8OQSCv+/thR/pLGrjNAVh9yvDUNrKobl l4G4GrhJzBuUGp05dN2XjRAfuHKKKWHbyu6lssjJ1D2mf7ZkQGGBEeWTv4/1vjw4wXki 5UN63cI2+uaZliaTuyepWqh2+XqPp3kbvxm6e6JaDjSuUKc/9neUB2ZBlWwWRUaG8Enm jLt9MKELXGZIPBpnIOyPzH9MoKO87ENxpBU2VzeO0RMYkRbSMvfCPS6Sel+USTS7zcCP b7uUkSOg8TdaLsQ+y8gMxj/HNZO+PaX1pxCx9xat16lkp0EzYWHp9/gK8SoHhpMZI2J+ 3mFg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:dkim-signature:date; bh=e1f+pZURW+w1RCPnnZXu1sVV0QTjxA8H8mKJ4xL39JA=; b=1IpWX7K9gwZoh9+5qT6at6/z4QjOG7IYPMwY6bjB41HD3+WxUQEn4A33oE8JZVZrPA YRjVtVcstvAU8ZDxtl5kLJ07q8G8+wMxRJ2BpBJWt4a6F6zSH6k2qpaNWuJXgDi121+K yI7P7Gt1dq8FAcz+/0kAc6D575UfPg2EtEP2AQT7NPaEaOmQZedsf9RNHfm9q73xSZQ0 pqYrKredBAWJamSnxz1YPq/dEq/Wq/vMdX0Jllio1VS069H9sNSAjVP0GjkH1lJMtYm9 2XcGSodIMShIXX5GgUq4vRVM19fIJabN81iJtmVrKqVtbSRIBVXnd2MlCM1JWsAqR+kQ Ja0Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=CFy7fZtO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b12-20020aa7950c000000b006530c90eaa6si8019879pfp.192.2023.06.19.01.33.39; Mon, 19 Jun 2023 01:33:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=CFy7fZtO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230103AbjFSIXx (ORCPT + 99 others); Mon, 19 Jun 2023 04:23:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60044 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230234AbjFSIXq (ORCPT ); Mon, 19 Jun 2023 04:23:46 -0400 Received: from out-42.mta1.migadu.com (out-42.mta1.migadu.com [95.215.58.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3EB79E6E for ; Mon, 19 Jun 2023 01:23:43 -0700 (PDT) Date: Mon, 19 Jun 2023 17:23:30 +0900 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1687163019; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=e1f+pZURW+w1RCPnnZXu1sVV0QTjxA8H8mKJ4xL39JA=; b=CFy7fZtOVlnjuX11edB8SC+1mfUuiV37ttTYe0TfaE5rdfh09ywb23mSOR1z+lfR/cdZ8N 7oirQpd7N8PwmNxLTYrEuo4Xlps07Mi7eoLn3bDwrCJsjLzeAJwZGGffwmso6bZ+u4Srhs zQ64v4AHL/TbjQ/4JsjRHlG1oX93mhk= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Naoya Horiguchi To: Mike Kravetz Cc: Jiaqi Yan , HORIGUCHI =?utf-8?B?TkFPWUEo5aCA5Y+jIOebtOS5nyk=?= , "songmuchun@bytedance.com" , "shy828301@gmail.com" , "linmiaohe@huawei.com" , "akpm@linux-foundation.org" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "duenwen@google.com" , "axelrasmussen@google.com" , "jthoughton@google.com" Subject: Re: [PATCH v1 1/3] mm/hwpoison: find subpage in hugetlb HWPOISON list Message-ID: <20230619082330.GA1612447@ik1-406-35019.vs.sakura.ne.jp> References: <20230522044557.GA845371@hori.linux.bs1.fc.nec.co.jp> <20230523024305.GA920098@hori.linux.bs1.fc.nec.co.jp> <20230612041901.GA3083591@ik1-406-35019.vs.sakura.ne.jp> <20230616233447.GB7371@monkey> <20230617225927.GA3540@monkey> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20230617225927.GA3540@monkey> X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Jun 17, 2023 at 03:59:27PM -0700, Mike Kravetz wrote: > On 06/16/23 19:18, Jiaqi Yan wrote: > > On Fri, Jun 16, 2023 at 4:35 PM Mike Kravetz wrote: > > > On 06/16/23 14:19, Jiaqi Yan wrote: > > > > > > > > Now looking again this, I think concurrent adding and deleting are > > > > fine with each other and with themselves, because raw_hwp_list is > > > > lock-less llist. > > > > > > Correct. > > > > > > > As for synchronizing traversal with adding and deleting, I wonder is > > > > it a good idea to make __update_and_free_hugetlb_folio hold > > > > hugetlb_lock before it folio_clear_hugetlb_hwpoison(which traverse + > > > > delete raw_hwp_list)? In hugetlb, get_huge_page_for_hwpoison already > > > > takes hugetlb_lock; it seems to me __update_and_free_hugetlb_folio is > > > > missing the lock. > > > > > > I do not think the lock is needed. However, while looking more closely > > > at this I think I discovered another issue. > > > This is VERY subtle. > > > Perhaps Naoya can help verify if my reasoning below is correct. > > > > > > In __update_and_free_hugetlb_folio we are not operating on a hugetlb page. > > > Why is this? > > > Before calling update_and_free_hugetlb_folio we call remove_hugetlb_folio. > > > The purpose of remove_hugetlb_folio is to remove the huge page from the > > > list AND compound page destructor indicating this is a hugetlb page is changed. > > > This is all done while holding the hugetlb lock. So, the test for > > > folio_test_hugetlb(folio) is false. > > > > > > We have technically a compound non-hugetlb page with a non-null raw_hwp_list. > > > > > > Important note: at this time we have not reallocated vmemmap pages if > > > hugetlb page was vmemmap optimized. That is done later in > > > __update_and_free_hugetlb_folio. > > > > > > > > > > The 'good news' is that after this point get_huge_page_for_hwpoison will > > > not recognize this as a hugetlb page, so nothing will be added to the > > > list. There is no need to worry about entries being added to the list > > > during traversal. > > > > > > The 'bad news' is that if we get a memory error at this time we will > > > treat it as a memory error on a regular compound page. So, > > > TestSetPageHWPoison(p) in memory_failure() may try to write a read only > > > struct page. :( > > > > At least I think this is an issue. > > > > Would it help if dissolve_free_huge_page doesn't unlock hugetlb_lock > > until update_and_free_hugetlb_folio is done, or basically until > > dissolve_free_huge_page is done? > > Unfortunately, update_and_free_hugetlb_folio is designed to be called > without locks held. This is because we can not hold any locks while > allocating vmemmap pages. > > I'll try to think of some way to restructure the code. IIUC, this is a > potential general issue, not just isolated to memory error handling. Considering this issue as one specific to memory error handling, checking HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be helpful to detect the race. Then, an idea like the below diff (not tested) can make try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wait for complete the allocation of vmemmap pages. @@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags, int ret = 2; /* fallback to normal page handling */ bool count_increased = false; - if (!folio_test_hugetlb(folio)) + if (!folio_test_hugetlb(folio)) { + if (folio_test_hugetlb_vmemmap_optimized(folio)) + ret = -EBUSY; goto out; + } if (flags & MF_COUNT_INCREASED) { ret = 1; Thanks, Naoya Horiguchi > -- > Mike Kravetz > > > > > TestSetPageHWPoison in memory_failure is called after > > try_memory_failure_hugetlb, and folio_test_hugetlb is tested within > > __get_huge_page_for_hwpoison, which is wrapped by the hugetlb_lock. So > > by the time dissolve_free_huge_page returns, subpages already go > > through hugetlb_vmemmap_restore and __destroy_compound_gigantic_folio > > and become non-compound raw pages (folios). Now > > folio_test_hugetlb(p)=false will be correct for memory_failure, and it > > can recover p as a dissolved non-hugetlb page.