Received: by 2002:a05:6358:45e:b0:b5:b6eb:e1f9 with SMTP id 30csp3218038rwe; Mon, 29 Aug 2022 07:48:31 -0700 (PDT) X-Google-Smtp-Source: AA6agR4LS/XUrWsr19XyOENkDzI7iC1nYQUPxP61xlke3CkuS4Ouz6RKWxkTVVWfMslOAsxU0Z4B X-Received: by 2002:a05:6402:51d1:b0:447:103b:7a70 with SMTP id r17-20020a05640251d100b00447103b7a70mr17436148edd.365.1661784510769; Mon, 29 Aug 2022 07:48:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1661784510; cv=none; d=google.com; s=arc-20160816; b=d6XPAI3pBMD5Hi95CtT+by3jUQIEfyupCDqJLLd2bX94hetrEsd3D9yKANwKtJTR3N T/ACqweQk/if7rVMWD09s12G5Vz/z35ekHJj7UtMbRquLMLiFLX7N9Ha5a1a3DIxVzIc LXlydKHMv44hQKXXVHfqkYdme7t7ulul0lI6clR43gs4rouwWwCT8guWuRJ2rGvq3uQK lKgtMR/QaJsXe+hZCrjTx+7V/GPtrBYdthtVaCxhO8kfH7OW+eQhgwvCxjP03MeD7UEU JLMWfuD8oMkNGlnNKSkDOLQEcLcil6tZ4TeoOj1atAzASuDJ9W8EveEWdY12fxRHc4Lb pVbA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=VwucjO56FRTXH12J+9VmzrsKwG2woKnxAdupUBhNqZk=; b=vShunJyQXus1cLBRBADDr4sEFs2lCA3EsrxztSdWqCsdFbkjVgVnuZMAwsuYnQc5bt RoEoiEg3JvYZ2ilXnbmRwywH0Wxc2m/Li58tuE1XZ+t3Vg3E7rnTHZQ44tMNeRzQtGkp hX1loRDCwmcuEws21Fft8dBsIcRJ9SRXNt3JAr1hRz+TRydEB5OfJc4L3yDcqhQV2cdk DyBzN80doQSTiHUPFfUFYHlJULcDL/i/cmAhIaummSimsemMo7ohwaWiPPpUPqo3N7JU /BcOiF1377MsvyBh2iIfxXcbAqJrJgmgYtTUFp8gF6zSBfgPxJefSC2Y8EFgBDtxwASD vqbA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b="pJg3FM/S"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id u8-20020a170906780800b0073083c63edcsi6011068ejm.306.2022.08.29.07.48.03; Mon, 29 Aug 2022 07:48:30 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b="pJg3FM/S"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229932AbiH2OA7 (ORCPT + 99 others); Mon, 29 Aug 2022 10:00:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43490 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229477AbiH2OA5 (ORCPT ); Mon, 29 Aug 2022 10:00:57 -0400 Received: from mail-pj1-x102b.google.com (mail-pj1-x102b.google.com [IPv6:2607:f8b0:4864:20::102b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 593811D0FB for ; Mon, 29 Aug 2022 07:00:56 -0700 (PDT) Received: by mail-pj1-x102b.google.com with SMTP id t11-20020a17090a510b00b001fac77e9d1fso14743783pjh.5 for ; Mon, 29 Aug 2022 07:00:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc; bh=VwucjO56FRTXH12J+9VmzrsKwG2woKnxAdupUBhNqZk=; b=pJg3FM/SVT7c7CGf9zgV9P8uQTwk+0K2zEhQ9snPlJzes0qNEE4Uoz15USvy7vk7rW zoYDq+YJYe0woGR1NEVwVw9PbB2YNu6TlOCTgLq0HYPhgPIR0+JMiW1GWnt7d6OSmkRB p3LWle2z9mlCNzNjtGzrilUsNUEizIOvlugfVawJdf2Dw5I/9nNwON2KiI/dcNoi4HCM Lagcx/cNmZi53hWmsdOG07kdrB9ufwj1PvVItSUkqa1sfERBrDIuP3YsdPh6syGYwL8c wwCoJzu1TKXA1EU1sYXRbh7gMKeJzFseelPLAecAbCK549sxuED0iVBJzrLbRS77lRfx w7zg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc; bh=VwucjO56FRTXH12J+9VmzrsKwG2woKnxAdupUBhNqZk=; b=cwJ55ILHYoszEsWS51zATmvBQnW1WY2vJqMab6HQ2Slt24+3fV6OdVEVkB14RFFyLy rBB3++fUPA4mRZs5RedLWvbuoqq0qZS1gg+piH2L+PUCigp87J4Ljb7SmpyRmwgh71bC g/fmKlxwEFOak06CljQS3wxGkJeOdhIS4BrNoqA5FSMqSImlW6VX5XsiE9uE3ER8PBgc KEFgu4KYyC3KLbz34hpbPvWLhHKhnzzClxRM5erDIMezuJx16HZUTnYVMsu9QGjFjZmk QXTzST3Si9EyPiPdUJVS3eemoSZI+z6H1wVy05b5kgPJQXWIZ8gwrFSOui0VPi8eQbHR yU1Q== X-Gm-Message-State: ACgBeo1p2fm3p7I5+CFn4U85Gwjb+S7f9Kg9/NhQqRMsCmYWL82AqUX4 tEGksb0MJHYgzXRJ3MMrRMh5Aw== X-Received: by 2002:a17:902:8ec6:b0:172:dc2c:306d with SMTP id x6-20020a1709028ec600b00172dc2c306dmr17023408plo.104.1661781655448; Mon, 29 Aug 2022 07:00:55 -0700 (PDT) Received: from [10.4.115.67] ([139.177.225.244]) by smtp.gmail.com with ESMTPSA id u123-20020a626081000000b0053813de1fdasm3500660pfb.28.2022.08.29.07.00.50 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 29 Aug 2022 07:00:54 -0700 (PDT) Message-ID: <68f43b57-32b6-1844-a0a6-d22fb0e089aa@bytedance.com> Date: Mon, 29 Aug 2022 22:00:47 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.1.2 Subject: Re: [RFC PATCH 0/7] Try to free empty and zero user PTE page table pages Content-Language: en-US To: David Hildenbrand , akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, jgg@nvidia.com, tglx@linutronix.de, willy@infradead.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, muchun.song@linux.dev References: <20220825101037.96517-1-zhengqi.arch@bytedance.com> From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2022/8/29 18:09, David Hildenbrand wrote: > On 25.08.22 12:10, Qi Zheng wrote: >> Hi, >> >> Before this, in order to free empty user PTE page table pages, I posted the >> following patch sets of two solutions: >> - atomic refcount version: >> https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/ >> - percpu refcount version: >> https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@bytedance.com/ >> >> Both patch sets have the following behavior: >> a. Protect the page table walker by hooking pte_offset_map{_lock}() and >> pte_unmap{_unlock}() >> b. Will automatically reclaim PTE page table pages in the non-reclaiming path >> >> For behavior a, there may be the following disadvantages mentioned by >> David Hildenbrand: >> - It introduces a lot of complexity. It's not something easy to get in and most >> probably not easy to get out again >> - It is inconvenient to extend to other architectures. For example, for the >> continuous ptes of arm64, the pointer to the PTE entry is obtained directly >> through pte_offset_kernel() instead of pte_offset_map{_lock}() >> - It has been found that pte_unmap() is missing in some places that only >> execute on 64-bit systems, which is a disaster for pte_refcount >> >> For behavior b, it may not be necessary to actively reclaim PTE pages, especially >> when memory pressure is not high, and deferring to the reclaim path may be a >> better choice. >> >> In addition, the above two solutions are only for empty PTE pages (a PTE page >> where all entries are empty), and do not deal with the zero PTE page ( a PTE >> page where all page table entries are mapped to shared zero page) mentioned by >> David Hildenbrand: >> "Especially the shared zeropage is nasty, because there are >> sane use cases that can trigger it. Assume you have a VM >> (e.g., QEMU) that inflated the balloon to return free memory >> to the hypervisor. >> >> Simply migrating that VM will populate the shared zeropage to >> all inflated pages, because migration code ends up reading all >> VM memory. Similarly, the guest can just read that memory as >> well, for example, when the guest issues kdump itself." >> >> The purpose of this RFC patch is to continue the discussion and fix the above >> issues. The following is the solution to be discussed. > > Thanks for providing an alternative! It's certainly easier to digest :) Hi David, Nice to see your reply. > >> >> In order to quickly identify the above two types of PTE pages, we still >> introduced a pte_refcount for each PTE page. We put the mapped and zero PTE >> entry counter into the pte_refcount of the PTE page. The bitmask has the >> following meaning: >> >> - bits 0-9 are mapped PTE entry count >> - bits 10-19 are zero PTE entry count > > I guess we could factor the zero PTE change out, to have an even simpler OK, we can deal with the empty PTE page case first. > first version. The issue is that some features (userfaultfd) don't > expect page faults when something was aleady mapped previously. > > PTE markers as introduced by Peter might require a thought -- we don't > have anything mapped but do have additional information that we have to > maintain. I see the pte marker entry is non-present entry not empty entry (pte_none()). So we've dealt with this situation, which is also what's done in [RFC PATCH 1/7]. > >> >> In this way, when mapped PTE entry count is 0, we can know that the current PTE >> page is an empty PTE page, and when zero PTE entry count is PTRS_PER_PTE, we can >> know that the current PTE page is a zero PTE page. >> >> We only update the pte_refcount when setting and clearing of PTE entry, and >> since they are both protected by pte lock, pte_refcount can be a non-atomic >> variable with little performance overhead. >> >> For page table walker, we mutually exclusive it by holding write lock of >> mmap_lock when doing pmd_clear() (in the newly added path to reclaim PTE pages). > > I recall when I played with that idea that the mmap_lock is not > sufficient to rip out a page table. IIRC, we also have to hold the rmap > lock(s), to prevent RMAP walkers from still using the page table. Oh, I forgot this. We should also hold rmap lock(s) like move_normal_pmd(). > > Especially if multiple VMAs intersect a page table, things might get > tricky, because multiple rmap locks could be involved. Maybe we can iterate over the vma list and just process the 2M aligned part? > > We might want/need another mechanism to synchronize against page table > walkers. This is a tricky problem, equivalent to narrowing the protection scope of mmap_lock. Any preliminary ideas? Thanks, Qi > -- Thanks, Qi