Received: by 2002:a05:6358:a55:b0:ec:fcf4:3ecf with SMTP id 21csp1759132rwb; Thu, 19 Jan 2023 15:14:45 -0800 (PST) X-Google-Smtp-Source: AMrXdXu2YTT0O7Kjq/du0hMe9rFASYWjS+6zTDdz3uvlF1Re393AyEeCcjdlYc86Ve0fybW9i0Ia X-Received: by 2002:a17:906:e0c5:b0:86d:67ee:d607 with SMTP id gl5-20020a170906e0c500b0086d67eed607mr23131910ejb.64.1674170085002; Thu, 19 Jan 2023 15:14:45 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674170084; cv=none; d=google.com; s=arc-20160816; b=OmkZIuNsqPLqdnw5p6fAGQ6cUSnwOLRsKYmojzboC/BbKmiK43TT6s2vIc3hSeZJD1 5avwnjD883C0wIAXmbWpT9ViVr5F7z62NDLQOBhMKZfDJxXMopeZMIXVPpjIWP9f8nIK tNvMz+nD9G/8f3v8I1zWlwS5kUovgTgD6TXRiK6nfnVOmBCTCJ+x5JHCRYQTDyMZ4UHG 9KlRpi9/99/qKFhK6lMPIC9fpiWDcyl67uZPZ/RZT5WcbJHzIK5ne+/OZsQH8b8tvW4B oaATwOnmtQ/9khloT68V4HoSZSTC6Z4GTDekS3E6WIvOxYWH7Yie5dfMVR6fZ1tIgkR2 841w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=hcDtohvp2B7qEghqHFvGOxURNU+p0D+oExVevoB6IBY=; b=X1qg7q2bB1TwKNB0DV+mlXf6RrFxCFVRho85a+3tUuNUsp28bhyBF/EjX0fHqjLqox SFLyvqxBHanDOFZRsJkhWzfGMnLfff70fclNlw/PGvH1wOr50Dw0iTco5l7zJlSBoKid UNASebtIhJz0KgcEI3mAVHApbZ4cRHZ152k4tYTnPelLGvHvfM4vpU3XmrN0NzbbhFQE F6DHgxIEfZSk5T9hO0InglYD1e3DeckYdIVswFiiAuGfNr2Pwb9Qsb04v/KwrpafjPoq BqJD9d4PlUV2u/skFCB65lxRrUqijSOaSKt7WaY29W79o+HFOx3uBVb44joJv+WMuhlS eXgw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=asVroelQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hs36-20020a1709073ea400b00870a82dac19si15450142ejc.398.2023.01.19.15.14.33; Thu, 19 Jan 2023 15:14:44 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=asVroelQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230161AbjASW7q (ORCPT + 48 others); Thu, 19 Jan 2023 17:59:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45996 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229948AbjASW7H (ORCPT ); Thu, 19 Jan 2023 17:59:07 -0500 Received: from mail-wr1-x42e.google.com (mail-wr1-x42e.google.com [IPv6:2a00:1450:4864:20::42e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BF08FA3175 for ; Thu, 19 Jan 2023 14:45:48 -0800 (PST) Received: by mail-wr1-x42e.google.com with SMTP id t5so3316194wrq.1 for ; Thu, 19 Jan 2023 14:45:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=hcDtohvp2B7qEghqHFvGOxURNU+p0D+oExVevoB6IBY=; b=asVroelQi2KhRLvgv1N4Xv3vnf0/gxTLwJDcrOZbehdN0rSMpxWM5p/sFRpmkgqtzJ lBTCz+P1xfFdmbAKMuW/s7azblR9ntIxxZqy9ej1pBz9edazBnibHUnIdj5AJzvxdccb 8p9aHGG2/hYZTmK7V9iqx1jFWtBV5aet9qqeSgFaLRK5UVjKPjaQ+buGHtBLrf+kcj0S IWUZr7gbdDdmpGbgsDYK/s3YaVk3oJrcJNGtvOokmYlOkjbavXV24YEZEyH06ig5LJ7Z Lq9sXZAQk+YkgKSiOBi8EUA6tjNVil1X+aoIf1hC+LbKcbvx+nWMO2O391z3vLpCsnCV ofkg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=hcDtohvp2B7qEghqHFvGOxURNU+p0D+oExVevoB6IBY=; b=pBtf3SYrQZ69X9di4XNiQHTpryJse/ykeOWhx+oKUCPackV45xYZfci0C1ElGk5REC 100Ma7d+xao40SL4Jspxu+930I3cYwKucDnVzbmrRBfV4iZnryt/BMbwkFR/+VCoOqWz 3a5/wmmD9JRE9PWiMTVKM08pOIoJpGKNhg7IpRtmkkdU56UWKcChXRgh1dLAeUrOnN1F qvpvmHjuJEad9x/i/AMC+YaRfmwbplCW/F6c0ZakYs5I91TKU/3w6f5K4oOhLAnNaVAj caKzl/LzSHjTU3BYUcag0hCm20w9MO91bLv2HKFa65u5iMdfyFvai1HNMVOytWrPPHdE pYHQ== X-Gm-Message-State: AFqh2koO/ekcyAJFjJmH5Wkz1CfxMM7i2o++9MW3Fs4FWwO0wka6v8bO CrcbrYqg8UZP5CRwSDtey8+vduBolQhTbGKcNHdh2Q== X-Received: by 2002:a05:6000:818:b0:2bd:df18:28f9 with SMTP id bt24-20020a056000081800b002bddf1828f9mr491219wrb.355.1674168347258; Thu, 19 Jan 2023 14:45:47 -0800 (PST) MIME-Version: 1.0 References: <06423461-c543-56fe-cc63-cabda6871104@redhat.com> <6548b3b3-30c9-8f64-7d28-8a434e0a0b80@redhat.com> In-Reply-To: From: James Houghton Date: Thu, 19 Jan 2023 14:45:10 -0800 Message-ID: Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range To: Peter Xu Cc: Mike Kravetz , David Hildenbrand , Muchun Song , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 19, 2023 at 12:53 PM Peter Xu wrote: > > On Thu, Jan 19, 2023 at 11:42:26AM -0800, James Houghton wrote: > > - We avoid problems related to compound PTEs (the problem being: two > > threads racing to populate a contiguous and non-contiguous PTE that > > take up the same space could lead to user-detectable incorrect > > behavior. This isn't hard to fix; it will be when I send the arm64 > > patches up.) > > Could you elaborate this one a bit more? In hugetlb_mcopy_atomic_pte(), we check that the PTE we're about to overwrite is pte_none() before overwriting it. For contiguous PTEs, this only checks the first PTE in the bunch. If someone came around and populated one of the PTEs that lied in the middle of a potentially contiguous group of PTEs, we could end up overwriting that PTE if we later UFFDIO_CONTINUEd in such a way to create a contiguous PTE. We would expect to get EEXIST here, but in this case the operation would succeed. To fix this, we can just check that ALL the PTEs in the contiguous bunch have the value that we're expecting, not just the first one. hugetlb_no_page() has the same problem, but it's not immediately clear to me how it would result in incorrect behavior. > > > This might seem kind of contrived, but let's say you have a VM with 1T > > of memory, and you find 100 memory errors all in different 1G pages > > over the life of this VM (years, potentially). Having 10% of your > > memory be 4K-mapped is definitely worse than having 10% be 2M-mapped > > (lost performance and increased memory overhead). There might be other > > cases in the future where being able to have intermediate mapping > > sizes could be helpful. > > This is not the norm, or is it? How the possibility of bad pages can > distribute over hosts over years? This can definitely affect how we should > target the intermediate level mappings. I can't really speak for norms generally, but I can try to speak for Google Cloud. Google Cloud hasn't had memory error virtualization for very long (only about a year), but we've seen cases where VMs can pick up several memory errors over a few days/weeks. IMO, 100 errors in separate 1G pages over a few years isn't completely nonsensical, especially if the memory that you're using isn't so reliable or was damaged in shipping (like if it was flown over the poles or something!). Now there is the concern about how an application would handle it. In a VMM's case, we can virtualize the error for the guest. In the guest, it's possible that a good chunk of the errors lie in unused pages and so can be easily marked as poisoned. It's possible that recovery is much more difficult. It's not unreasonable for an application to recover from a lot of memory errors. - James