Received: by 2002:a05:7412:bbc7:b0:fc:a2b0:25d7 with SMTP id kh7csp1235505rdb; Fri, 2 Feb 2024 20:30:37 -0800 (PST) X-Google-Smtp-Source: AGHT+IFfWRUtM7O1vVGjKoPrhxYMRT9w7NwF/7dgOGHoA9K2a9AmAGm22YO2pQYbxNfmxO7K+fCr X-Received: by 2002:ac8:7d41:0:b0:42c:779:2396 with SMTP id h1-20020ac87d41000000b0042c07792396mr338445qtb.25.1706934637138; Fri, 02 Feb 2024 20:30:37 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706934637; cv=pass; d=google.com; s=arc-20160816; b=vO+suk0Dn9xOJiky4IHNN3FowO/nbSPz35sk3jOjIWm+4xUrLoD6i2i8Ck9G8X6yaY r8DYHJqSQnKlbNaAHfcFITsVHk/HGumgC2mD6iD224iOPvm7IXCgYst9uihmPKnjKW4R SUJveJKqodHz+NxvIgEVCWNkV7IN5O+9U56bM1k6gaQHmtzGxNbduiQ7uwymdQ5mSHoN 5pj+sZZgM+Jk53uQ6WnYEC017TqKEysRQ1musjDnhJAL/uxryL59Jv4CfUK3vTlOPi5p IO7LqTISC8mLM1g9MXKN8bxiJKuN9/3YuGdnCqyq6iq592i7VC9tevV5BsWhTSx6mioG c9sw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=nsl3WqQ+k+ykbNiLwPF1PCJ3ZiKDT5+I14JPm9QENBQ=; fh=JGOTsUIqsrBurt63HbfCL2vReW5JXrIP8w7MMI6Cclg=; b=Y56t1ITzpEI+PX84B+1sG0VUz/wS5WAoIDBGFYdDLdCEshENJbp67kqE0u8YXFNG+R CTOqLRSvFTn2ZNpuM1JDYSugtzKTLaC3RXwoz/kcq4pOQ0wkP46+fRsL9upSj3vujZG1 VW6BUYeKqCkBCxhsm6kNXgsxQ1WMjV2azbFprsUB0+rWQ2SN4nDVO5nADAXPVx2BEH07 mGbEhF5ygVp+jyR6cjbpnJz01fH7K2kQYQDANWBjdS0dkNuhU/1SeVQlCRVFA3/M/4Q8 OO7SFyPe0tTtqbE2pfqWhV/osG3K06iM7eHAFdotDmGv+HgnPKdB7PD1X9seiofYVW2u dNXg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=Ze0n6zBk; arc=pass (i=1 spf=pass spfdomain=gmail.com dkim=pass dkdomain=gmail.com dmarc=pass fromdomain=gmail.com); spf=pass (google.com: domain of linux-kernel+bounces-50913-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-50913-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com X-Forwarded-Encrypted: i=1; AJvYcCVDqYEAr+lhL6fLPEjB50kgLOKvBJCcRyBJYv/S+vPQUG5UORUd8QwAFdm5DGrc76TK+ISf4zSTyk9RskmadOG9VVObYx0eHdbPiLOukg== Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id dn26-20020a05622a471a00b0042c0efe3099si648950qtb.360.2024.02.02.20.30.37 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Feb 2024 20:30:37 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-50913-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=Ze0n6zBk; arc=pass (i=1 spf=pass spfdomain=gmail.com dkim=pass dkdomain=gmail.com dmarc=pass fromdomain=gmail.com); spf=pass (google.com: domain of linux-kernel+bounces-50913-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-50913-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id BB0D91C237CD for ; Sat, 3 Feb 2024 04:30:12 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 343A150275; Sat, 3 Feb 2024 04:18:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Ze0n6zBk" Received: from mail-yb1-f180.google.com (mail-yb1-f180.google.com [209.85.219.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8AE2950259 for ; Sat, 3 Feb 2024 04:17:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.180 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706933880; cv=none; b=URQYL2K7o7Y9jpLGtPYgxd56EejXHLpkQFSaGJd5M5Sk2L88ZMN9wCyNnnDQiupPQT2jKDPQdnzFkvvrEvFPuNL4heCP+iWYmBxSHv1QGCCJrQG8CqhuGXbLNPJMpfDU6kWpjJ0WUenp69uRJ8bASy2lGaORgfdf6SekEzLNKMw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706933880; c=relaxed/simple; bh=nsl3WqQ+k+ykbNiLwPF1PCJ3ZiKDT5+I14JPm9QENBQ=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=AWja88qY7VIqxdkMC3J7QQLVIiQRy8Jb1lpxJrWEqw8USIhnMousw0EByfQDpvcmkODcpcBILoOxSoGi0eyf5r01gpoyi3ZIdm1XlkuH2/nDtLNcFIlCi+1XZ9Gf86Z5vMRg37THN4kwOXBGery8ng+i9jY4EBNhv27zGJRdV2s= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Ze0n6zBk; arc=none smtp.client-ip=209.85.219.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-yb1-f180.google.com with SMTP id 3f1490d57ef6-dc6cbe1ac75so1124348276.1 for ; Fri, 02 Feb 2024 20:17:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1706933877; x=1707538677; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nsl3WqQ+k+ykbNiLwPF1PCJ3ZiKDT5+I14JPm9QENBQ=; b=Ze0n6zBkzeNSaNLwYm5BbCvoQqA5KXXPSeEFPl5GnI6l10SXg7U9RWCT+srAQdmDsg g5QR4swG+iD6R8wVAXDqz2aEhI3BjG/kckN3G/spsijE9V7P5Q+dDPFJlzSQC6fsVgJM d7iTmgehVfNpJMJibF7L/XuIlOphB9JZCZ7LU9o7+Qgwv/uHo92WDPhajq9TeYq0fAY/ l7DnEy13uPGDKsYBmEz6GClYRJ7LVbcEra4mlfLcGqf261HSFw24RSX4kk+G6Xc/MIOF DfR/+NMilgxx+Ca9UimUiNcgsBKqMXXUoPfhEFZCiCAhcQRAr9R0AQYjtmfcrm5nNXbI BTMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706933877; x=1707538677; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nsl3WqQ+k+ykbNiLwPF1PCJ3ZiKDT5+I14JPm9QENBQ=; b=gKFQ7gsLeieczrrlnMsO6+rszfbqiyeLKPl8JWGR3TlltOaheoR0qMneVo6rhxMpQP 3X+PStWNEYBbnILJpW+CzN1n8z+X0akw8pz9VfjaZXGC+dQFmoWBujTK8hqjPZxvKS1Y T716MPLkIMYKxIVRD6lkiH2gbpzje10Gwi4VUxYAmoZBdbl3oqwPMQD6FatrofLh9xQO uPioPkI7CIGJgncMg4Ti2EhfVhqKr2F714M7jQSsJUQmXT7cMoXW2sZHP+dnvVveHbys hUCmL8kU1UkqeJoUEwkaz3knvoSjrwsNYSdDssoJ3oNFHJhHnNBM6SecLI5p4LWBQcge ABng== X-Gm-Message-State: AOJu0Yx3XMlX4LmlSKzsPmZAxzYadcleT3NHaGdMafVfQeMsXMyd1B/r oh1iFLhm6ZlBiPxXim+Gah9YbSXPv9/3YFUwLyalwEGSsqQK9Y+WJHN155VwCWn2LVKCxecxrNu tRuQE+Y4HjzgqFBRaGm/4+8MRZ6c= X-Received: by 2002:a25:8183:0:b0:dc6:5570:898e with SMTP id p3-20020a258183000000b00dc65570898emr353556ybk.17.1706933877392; Fri, 02 Feb 2024 20:17:57 -0800 (PST) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20240201125226.28372-1-ioworker0@gmail.com> In-Reply-To: From: Lance Yang Date: Sat, 3 Feb 2024 12:17:45 +0800 Message-ID: Subject: Re: [PATCH 1/1] mm/khugepaged: skip copying lazyfree pages on collapse To: Yang Shi , Michal Hocko , David Hildenbrand Cc: akpm@linux-foundation.org, zokeefe@google.com, songmuchun@bytedance.com, peterx@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hey Michal, David, Yang, I sincerely appreciate your time! I still have two questions that are perplexing me. First question: Given that khugepaged doesn't treat MADV_FREE pages as pte_none, why skip the 2M block when all the pages within the range are old and unreferenced, but won't skip if the partial range is MADV_FREE, even if it's not redirtied? Why make this distinction? Would it not be more straightforward to maintain if either all were skipped or not? Second question: Does copying lazyfree pages (not redirtied) to the new huge page during khugepaged collapse undermine the semantics of MADV_FREE? Users mark pages as lazyfree with MADV_FREE, expecting these pages to be eventually reclaimed. Even without subsequent writes, these pages will no longer be reclaimed, even if memory pressure occurs. BR, Lance On Sat, Feb 3, 2024 at 1:42=E2=80=AFAM Yang Shi wrote= : > > On Fri, Feb 2, 2024 at 6:53=E2=80=AFAM Lance Yang w= rote: > > > > How about blocking khugepaged from > > collapsing lazyfree pages? This way, > > is it not better to keep the semantics > > of MADV_FREE? > > > > What do you think? > > First of all, khugepaged doesn't treat MADV_FREE pages as pte_none > IIUC. The khugepaged does skip the 2M block if all the pages are old > and unreferenced pages in the range in hpage_collapse_scan_pmd(), then > repeat the check in collapse_huge_page() again. > > And MADV_FREE pages are just old and unreferenced. This is actually > what your first test case does. The whole 2M range is MADV_FREE range, > so they are skipped by khugepaged. > > But if the partial range is MADV_FREE, khugepaged won't skip them. > This is what your second test case does. > > Secondly, I think it depends on the semantics of MADV_FREE, > particularly how to treat the redirtied pages. TBH I'm always confused > by the semantics. For example, the page contained "abcd", then it was > MADV_FREE'ed, then it was written again with "1234" after "abcd". So > the user should expect to see "abcd1234" or "00001234". > > I'm supposed it should be "abcd1234" since MADV_FREE pages are still > valid and available, if I'm wrong please feel free to correct me. If > so we should always copy MADV_FREE pages in khugepaged regardless of > whether it is redirtied or not otherwise it may incur data corruption. > If we don't copy, then the follow up redirty after collapse to the > hugepage may return "00001234", right? > > The current behavior is copying the page. > > > > > Thanks, > > Lance > > > > On Fri, Feb 2, 2024 at 10:42=E2=80=AFPM Michal Hocko = wrote: > > > > > > On Fri 02-02-24 21:46:45, Lance Yang wrote: > > > > Here is a part from the man page explaining > > > > the MADV_FREE semantics: > > > > > > > > The kernel can thus free thesepages, but the > > > > freeing could be delayed until memory pressure > > > > occurs. For each of the pages that has been > > > > marked to be freed but has not yet been freed, > > > > the free operation will be canceled if the caller > > > > writes into the page. If there is no subsequent > > > > write, the kernel can free the pages at any time. > > > > > > > > IIUC, if there is no subsequent write, lazyfree > > > > pages will eventually be reclaimed. > > > > > > If there is no memory pressure then this might not > > > ever happen. User cannot make any assumption about > > > their content once madvise call has been done. The > > > content has to be considered lost. Sure the userspace > > > might have means to tell those pages from zero pages > > > and recheck after the write but that is about it. > > > > > > > khugepaged > > > > treats lazyfree pages the same as pte_none, > > > > avoiding copying them to the new huge page > > > > during collapse. It seems that lazyfree pages > > > > are reclaimed before khugepaged collapses them. > > > > This aligns with user expectations. > > > > > > > > However, IMO, if the content of MADV_FREE pages > > > > remains valid during collapse, then khugepaged > > > > treating lazyfree pages the same as pte_none > > > > might not be suitable. > > > > > > Why? > > > > > > Unless I am missing something (which is possible of > > > course) I do not really see why dropping the content > > > of those pages and replacing them with a THP is any > > > difference from reclaiming those pages and then faulting > > > in a non-THP zero page. > > > > > > Now, if khugepaged reused the original content of MADV_FREE > > > pages that would be a slightly different story. I can > > > see why users would expect zero pages to back madvised > > > area. > > > -- > > > Michal Hocko > > > SUSE Labs