Received: by 2002:a05:6358:4e97:b0:b3:742d:4702 with SMTP id ce23csp1129131rwb; Thu, 11 Aug 2022 16:49:07 -0700 (PDT) X-Google-Smtp-Source: AA6agR5ANv5IM1yMkFb04arm9HVvOcGs5YesQqk3niqgUi8A9PZShS1UuL2wtJTlvo+CJnUFd2ec X-Received: by 2002:a17:90b:390c:b0:1f5:5bbc:2e8 with SMTP id ob12-20020a17090b390c00b001f55bbc02e8mr11171214pjb.233.1660261746899; Thu, 11 Aug 2022 16:49:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1660261746; cv=none; d=google.com; s=arc-20160816; b=Na5CnbmDuQgF0D3VJgYmIwcuyZnKTXeAj7qlX1qhwmPdKllP4y3BQMeXLyuRaNKHm/ 9cSbEVFR1Xc6YHmCyFOXw2H/ssVDNt4X6zYo+xzDMDU0jg179TwPT8xoECLCXACHJS2H sLxEiUoDC9bfUVIYfbG5PorLeKAYQfBI71GjT+xjt1Tz78EC99tbfV/PI8bt/gDBT9bJ ZuANal3UmgFFD5d4JV3Ld4eJI0vl7XxHZES5kFBXPOvRNZWQKkdjUL80Vj30snyPtBX+ GQ9Cyx5gM3rN9IihN/ehu4DK4Fci5W0FcYakFfTnn4CwBGkCWO9qMCkPrWbFWdK/T54j l+NQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=27PgIpqf/7IrKXemxCZ0QU4yOMIMu6eeW8/O/aowK08=; b=L9cDZ4iy6LUE2PFYDcHq0xYeGKkxsa0Qx79SXwEjJlCsyWZ1+c9ZbaXNIzioXkIIdB UVdNiWmQTxDb1HKTX/CYyerwSjoZQH50vXkol0/JdKm1Nh7yn4jfO4bikb4G3FGks5nU y7i9Ysc3zWGEKFTI+Dw1EPocovXug6ASiQQbDC2htjvM7UK/Q81FCM+C7x23F8eMplYF jvXYJzKjJ/w9mNjixdU1fHBnTjncxwQxQiHkh21oj08RIwIbTXx9SdWakL5wEGK4xruP 0MGX7+2KhxBHeGYraEC0vxUrMzo2DQgEphOd/ZU0v7jfDDya3p7XSEjOCtaambrE5gS0 tdNA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=sbHc6PYT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h65-20020a636c44000000b00415d8740505si741861pgc.499.2022.08.11.16.48.53; Thu, 11 Aug 2022 16:49:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=sbHc6PYT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236128AbiHKXAJ (ORCPT + 99 others); Thu, 11 Aug 2022 19:00:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59888 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229591AbiHKXAH (ORCPT ); Thu, 11 Aug 2022 19:00:07 -0400 Received: from mail-vs1-xe2c.google.com (mail-vs1-xe2c.google.com [IPv6:2607:f8b0:4864:20::e2c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7AF759F0D7 for ; Thu, 11 Aug 2022 16:00:05 -0700 (PDT) Received: by mail-vs1-xe2c.google.com with SMTP id m67so19697569vsc.12 for ; Thu, 11 Aug 2022 16:00:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc; bh=27PgIpqf/7IrKXemxCZ0QU4yOMIMu6eeW8/O/aowK08=; b=sbHc6PYTA1u3Vdj4XX/DprcMwSPQrCYVachJlRFTyu6wHb27JeSZlni2sXrwGwbsXw zuCHJolXIxsiCycfLERmcvUUPuSg27+fSFe6UIaFPaSChYUPJS8XNBtd+Yd4gWXUAL1f GrkvwwItrIQ2PBUPiWRJvY6GmnPh1OsP1LD0ZXdd4/tbaeU6XScVXpkzFt3Dw4pEYct+ 2zQtSA5zONpzWvS2AQCiXpAwFAzmOOm/xRlgSUEGQ/+uWCaPia+PgNFoNs6O5CGFur14 bKHMcavso3hScUezz0KZzPck1tG7KTHRVzwXULJxak7QFkBEzDxypTRGzGj3Gl5G8RHy Wbtw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc; bh=27PgIpqf/7IrKXemxCZ0QU4yOMIMu6eeW8/O/aowK08=; b=mFf3isSIxi08wR3im1oW1cXIEyB/gfwzJErEc70x/Q197jBu4Wq5gEMS+EI7wHKuHD CSPxXxL3TGL9gBZzaEJDBdV5eHw/mcyAt+nqYT+bglhFls0P5UfQSyJ/cZcuOYqesnwE 37z5sFQ20UXMFzQXajYiH3JuYyAWiCeqD93wyNmnXbY2iSjrGshEWsEWDahAI42nM1i3 IE0MUHlnRt+96m8owVpEJvItr8l/5Gj/yFIJL23So6htuksRvQBgAjS5wOt18NBIrTyZ qq5C7B2hoBQ6VXd9R/M/U0/qZl13P+0CaUNGaDgijboiGdcjBsMmwv38YdqvFZztv0wx sXvw== X-Gm-Message-State: ACgBeo05kCRuzUcGIr1OQiAKMOAc9uuGbTDQj83VoShVmK6O+Evo94c2 Ag0yT4AphiCZS2zoZ82T1zEJIhXGqwLOlyOXL9SbCg== X-Received: by 2002:a05:6102:5094:b0:388:6903:5f09 with SMTP id bl20-20020a056102509400b0038869035f09mr742614vsb.46.1660258804465; Thu, 11 Aug 2022 16:00:04 -0700 (PDT) MIME-Version: 1.0 References: <20220805184016.2926168-1-alexlzhu@fb.com> <0b16dbac6444bfcdfbeb4df4280354839bfe1a8f.camel@fb.com> <1F8B9D85-A735-4832-AD58-CA4BD474248D@fb.com> <868F0874-70E8-4416-B39B-DA74C9D76A40@fb.com> <3195C304-2140-4E5D-890D-AC55653193E5@fb.com> In-Reply-To: From: Yu Zhao Date: Thu, 11 Aug 2022 16:59:28 -0600 Message-ID: Subject: Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization To: Yang Shi Cc: "Alex Zhu (Kernel)" , Rik van Riel , Kernel Team , "linux-mm@kvack.org" , "willy@infradead.org" , "linux-kernel@vger.kernel.org" , "akpm@linux-foundation.org" , Ning Zhang , Miaohe Lin Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Aug 11, 2022 at 4:12 PM Yang Shi wrote: > > On Thu, Aug 11, 2022 at 2:55 PM Yu Zhao wrote: > > > > On Thu, Aug 11, 2022 at 1:20 PM Alex Zhu (Kernel) wro= te: > > > > > > Hi Yu, > > > > > > I=E2=80=99ve updated your patch set from last year to work with folio= and am testing it now. The functionality in split_huge_page() is the same = as what I have. Was there any follow up work done later? > > > > Yes, but it won't change the landscape any time soon (see below). So > > please feel free to continue along your current direction. > > > > > If not, I would like to incorporate this into what I have, and then r= esubmit. Will reference the original patchset. We need this functionality f= or the shrinker, but even the changes to split_huge_page() by itself it sho= uld show some performance improvement when used by the existing deferred_sp= lit_huge_page(). > > > > SGTM. Thanks! > > > > A side note: > > > > I'm working on a new mode: THP=3Dauto, meaning the kernel will detect > > internal fragmentation of 2MB compound pages to decide whether to map > > them by PMDs or split them under memory pressure. The general workflow > > of this new mode is as follows. > > I tend to agree that avoiding allocating THP in the first place is the > preferred way to avoid internal fragmentation. But I got some > questions about your design/implementation: > > > > > In the page fault path: > > 1. Compound pages are allocated as usual. > > 2. Each is mapped by 512 consecutive PTEs rather than a PMD. > > 3. There will be more TLB misses but the same number of page faults. > > 4. TLB coalescing can mitigate the performance degradation. > > Why not just allocate base pages in the first place? Khugepaged has > max_pte_none tunable to detect internal fragmentation. If you worry > about zero page, you could add max_pte_zero tunable. > > Or did you investigate whether the new MADV_COLLAPSE may be helpful or > not? It leaves the decision to the userspace. There are two problems we have to workaround. 1. External fragmentation that prevents later compound page allocations. 2. The cost of taking mmap_lock for write. IIRC, the first reference I listed describes the first problem. (It uses a similar reservation technique.) From a very high level, smaller page allocations add more entropy than larger ones and accelerate the system toward equilibrium, in which state the system can't allocate more THPs without exerting additional force (compaction). Reserving compound pages delays the equilibrium whereas MADV_COLLAPSE tries to reverse the equilibrium. The latter has a higher cost. In addition, it needs to take mmap_lock for write. > > In khugepaged: > > 1. Check the dirty bit in the PTEs mapping a compound page, to > > determine its utilization. > > 2. Remap compound pages that meet a certain utilization threshold by > > PMDs in place, i.e., no migrations. > > > > In the reclaim path, e.g., MGLRU page table scanning: > > 1. Decide whether compound pages mapped by PTEs should be split based > > on their utilizations and memory pressure, e.g., reclaim priority. > > 2. Clean subpages should be freed directly after split, rather than swa= pped out. > > > > N.B. > > 1. This workflow relies on the dirty bit rather examining the content o= f a page. > > 2. Sampling can be done by periodically switching between a PMD and > > 512 consecutive PTEs. > > 3. It only needs to hold mmap_lock for read because this special mode > > (512 consecutive PTEs) is not considered the split mode. > > 4. Don't hold your breath :) > > > > Other references: > > 1. https://www.usenix.org/system/files/atc20-zhu-weixi_0.pdf > > 2. https://www.usenix.org/system/files/osdi21-hunter.pdf