Received: by 2002:a05:7412:b10a:b0:f3:1519:9f41 with SMTP id az10csp950809rdb; Fri, 1 Dec 2023 03:10:14 -0800 (PST) X-Google-Smtp-Source: AGHT+IGpbg4aQPr8Jnsb8wSXCMtEl3ClxfEfG1TNmClDpzeMIrwly6d6CIWeRSba4tdq/zYUE9bZ X-Received: by 2002:a05:6a20:e113:b0:18c:5c04:a55e with SMTP id kr19-20020a056a20e11300b0018c5c04a55emr23814788pzb.49.1701429014252; Fri, 01 Dec 2023 03:10:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701429014; cv=none; d=google.com; s=arc-20160816; b=Z6gwFCi1qGc6Wd4YsfiHJq2wpXcoN/SYCqcc9VfWpylwae/oxJoQfv+xpbcQMpAbPY mph9nlho+4XIvfyR/54nN3x69O63+JkW8ffvKoe2C6JI2++2Zz5IvlWb57PheMZ+ib/h yeANDCWKaXv8NcyZj9xPzdZHMtUbVKABaNFb9px42jazvq0mPoVJZQXgfp8Ns3ttprgv 5OEYJKKrpU14TSdcT2w8GVKuZbArzO9vN4s8n1RsOspvt2urUDQikStzcjDzl2EmnqQp mofH2Ui2Sd8wlpWxcHn6/ioagUq+5Neh7KntHGVE1xqZK2zIpUKFE26AP/nzMp+ToBFI pHSA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=fTtLKl5wV6g1MGJAOp5ziroqqiQf4zTolbzOx+9adyo=; fh=foM/twO/6+tXv5zR+zQZvgjwSIQVofFVLoHPz5sFtKI=; b=T+kiJHMqkCmgW49zu3D5qJghZyNjH/R7cD+RCUbsWAD1r6hoRlW4ZzzxUHCKAy0IRV yWMzgelQI4fxLcdc4W64yeQHzD6uBlvdZ+JZs667SQMHy1g0kanbGjjVLzN2Xrkb6fxl bl0wQmKRSMnLuJl1PNM6WEvffXC4fbiAfjNedCCwpzajqUUQQVo6PQES2m2Jat8jn2jF 5dfZUQz2BQidhbVk3Rikf2Jqgc3VoN2EbvpB9x2bbd7nlg74mOgMDeDG2gV4FHOciEaI oIsD9KZ5PutIT4ZOR9AnAr4FXrigNs6aQXrjytHgg0l6OxwWCj7naxxsPrVgmNUwGuqR cGvw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from fry.vger.email (fry.vger.email. [23.128.96.38]) by mx.google.com with ESMTPS id bd9-20020a056a00278900b006bd360e70edsi3196709pfb.103.2023.12.01.03.09.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 01 Dec 2023 03:10:14 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id A37218083DF8; Fri, 1 Dec 2023 03:09:47 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1378475AbjLALJ3 (ORCPT + 99 others); Fri, 1 Dec 2023 06:09:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54736 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1378464AbjLALJ1 (ORCPT ); Fri, 1 Dec 2023 06:09:27 -0500 Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 15758193 for ; Fri, 1 Dec 2023 03:09:33 -0800 (PST) Received: from dggpemm100001.china.huawei.com (unknown [172.30.72.57]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4ShVZD5LMQzMngB; Fri, 1 Dec 2023 19:04:36 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemm100001.china.huawei.com (7.185.36.93) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Fri, 1 Dec 2023 19:09:29 +0800 Message-ID: <9e5c199a-9b4d-4d1b-97d4-dd2b776ac85f@huawei.com> Date: Fri, 1 Dec 2023 19:09:29 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/4] mm: pagewalk: assert write mmap lock only for walking the user page tables Content-Language: en-US To: Muchun Song , , , CC: , References: <20231127084645.27017-1-songmuchun@bytedance.com> <20231127084645.27017-2-songmuchun@bytedance.com> From: Kefeng Wang In-Reply-To: <20231127084645.27017-2-songmuchun@bytedance.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To dggpemm100001.china.huawei.com (7.185.36.93) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Fri, 01 Dec 2023 03:09:47 -0800 (PST) On 2023/11/27 16:46, Muchun Song wrote: > The 8782fb61cc848 ("mm: pagewalk: Fix race between unmap and page walker") > introduces an assertion to walk_page_range_novma() to make all the users > of page table walker is safe. However, the race only exists for walking the > user page tables. And it is ridiculous to hold a particular user mmap write > lock against the changes of the kernel page tables. So only assert at least > mmap read lock when walking the kernel page tables. And some users matching > this case could downgrade to a mmap read lock to relief the contention of > mmap lock of init_mm, it will be nicer in hugetlb (only holding mmap read > lock) in the next patch. > > Signed-off-by: Muchun Song > --- > mm/pagewalk.c | 29 ++++++++++++++++++++++++++++- > 1 file changed, 28 insertions(+), 1 deletion(-) > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > index b7d7e4fcfad7a..f46c80b18ce4f 100644 > --- a/mm/pagewalk.c > +++ b/mm/pagewalk.c > @@ -539,6 +539,11 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, > * not backed by VMAs. Because 'unusual' entries may be walked this function > * will also not lock the PTEs for the pte_entry() callback. This is useful for > * walking the kernel pages tables or page tables for firmware. > + * > + * Note: Be careful to walk the kernel pages tables, the caller may be need to > + * take other effective approache (mmap lock may be insufficient) to prevent > + * the intermediate kernel page tables belonging to the specified address range > + * from being freed (e.g. memory hot-remove). > */ > int walk_page_range_novma(struct mm_struct *mm, unsigned long start, > unsigned long end, const struct mm_walk_ops *ops, > @@ -556,7 +561,29 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start, > if (start >= end || !walk.mm) > return -EINVAL; > > - mmap_assert_write_locked(walk.mm); > + /* > + * 1) For walking the user virtual address space: > + * > + * The mmap lock protects the page walker from changes to the page > + * tables during the walk. However a read lock is insufficient to > + * protect those areas which don't have a VMA as munmap() detaches > + * the VMAs before downgrading to a read lock and actually tearing > + * down PTEs/page tables. In which case, the mmap write lock should > + * be hold. > + * > + * 2) For walking the kernel virtual address space: > + * > + * The kernel intermediate page tables usually do not be freed, so > + * the mmap map read lock is sufficient. But there are some exceptions. > + * E.g. memory hot-remove. In which case, the mmap lock is insufficient > + * to prevent the intermediate kernel pages tables belonging to the > + * specified address range from being freed. The caller should take > + * other actions to prevent this race. > + */ > + if (mm == &init_mm) > + mmap_assert_locked(walk.mm); > + else > + mmap_assert_write_locked(walk.mm); Maybe just use process_mm_walk_lock() and set correct page_walk_lock in struct mm_walk_ops? > > return walk_pgd_range(start, end, &walk); > }