Received: by 2002:a05:6a10:2785:0:0:0:0 with SMTP id ia5csp3013002pxb; Tue, 12 Jan 2021 04:19:36 -0800 (PST) X-Google-Smtp-Source: ABdhPJwt7FpbNpsKAPFHQDSUtFzocb7y5n8q0C15DK+oayni9P4cZIr809VuGHQlBv7dbccY0QTj X-Received: by 2002:a17:906:f85:: with SMTP id q5mr3049164ejj.105.1610453975843; Tue, 12 Jan 2021 04:19:35 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1610453975; cv=none; d=google.com; s=arc-20160816; b=jaj2Epl86UERgmzC7eOzw99KGmNISCdQuwhV/J1+U/M5Y0scX+H1lg4OLEQmau3qbN ZaHax+ARe9qA8J661cNZL20OuLwIhmFWm6ueFW94wd31QR2bBZZaQMU7OmERjt6Y5aS3 bm7k9yfz1UIh++zD88ANrBg6usqX+s10cD7zBUWh9h9EADJ2bNQP6j10eCLbwSK5k0ln wb/Fv7Lkyj8iBDbAApkKFWyXb1wnbJIxOKj8OVGZOmKQMwrKRgi6fIE11mMkpkUIfgRK nnayB3WTFCmvmZ1JdvduiR1Laltwi6yFu3kv00HWOVQUuLj3oAyRHh2Zl0Ohkgzy/MEb Q32A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=0RSCBUADFHXC7ujFfJT5dXXGGcwPYqsp6Fb9kyT+trQ=; b=V2HVkD8Ukbyg49cOALZnCnqRRARGPoRXmSZIM9+Y9vL0AY3oj5hhaUXAHvI21BuDnz /EdfM+0F7oxBb94rQMohIW/0zNFBhUQ/9S9XEAPP/0raR6HGmnydBZYeUVvIJKLO5TjT eJqmJyzkE+jvPkw+Gb7d9Sh1VFSgg2v4Oo5NWun9XPax7IgLnJz85ySKO3MeQEbYMqon y29kFxeGg8DnyTYkf8gY6ZYwHqmfugaHqXY9ByEzK8zlzGX81oeqAy8wWvqj1MbTiuOM aQpsUebOop5Zz1avRnnhj5IIHmKBDycf40VJ7YzJaORLm8+0W0LFGEuWC3XVZmgMkliF uFrw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=aFkV8dYZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id t14si1236197edw.119.2021.01.12.04.19.11; Tue, 12 Jan 2021 04:19:35 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=aFkV8dYZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389894AbhALJQE (ORCPT + 99 others); Tue, 12 Jan 2021 04:16:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54052 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729821AbhALJQE (ORCPT ); Tue, 12 Jan 2021 04:16:04 -0500 Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8F4E1C061575 for ; Tue, 12 Jan 2021 01:15:23 -0800 (PST) Received: by mail-ej1-x629.google.com with SMTP id w1so2419198ejf.11 for ; Tue, 12 Jan 2021 01:15:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=0RSCBUADFHXC7ujFfJT5dXXGGcwPYqsp6Fb9kyT+trQ=; b=aFkV8dYZxRlzaXtnSF1vMe/nuMAjesIm9kXVgiquMIls/7DGln3YaMuyCNL7TQmOBZ 3MSp4t4Ivvd8E7gIj18rCSbZMIS9TwiKlMxBbc9wIYaeQwHv92lw82fr5IEMucJEPdBA gfvHlPQShpw7Hmv8FL/XN0nxLRT6QF2SNCThvmxuT+DODjdWq0UpRk4Ih4/rqwmAX5no xZ92cSUOQdo6YD/J6j6Arek27Eg6qw3EHCNWqVMU5xd2ag2A+vbu4ousQ7entX9dTtsP g4DC5MJfrXfEV25Fb43Pt3Ew6Unwlsv+Bz20ymrys9JzvBHQ1jQpOViB21fHAmHfx3NS oTcw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0RSCBUADFHXC7ujFfJT5dXXGGcwPYqsp6Fb9kyT+trQ=; b=QqtyjXoEPqS9J5C1J66ikbcJGGIXNkb1vbwqicZ/R0ModWxkITi/Y9/62sDUiDRflW MXehR5T73SJa1bLCTMaTPGRz22FXWvq55cZIyHfZC7bshUqJofhwTcp9A9DF9to9KZCG Xqw70R0sXZH2N4aRIh6haUh3fDqdQOcJf0jSzARa0F0+ES/8mcfLE5/Fd49zTc8KGjy0 0oDdW0Wnwlblkilpad1KjZsqNrBOm91/se3cVezaL92Guv+TxQn4K/ZRH2K8BHV1mozR PIsps5flCC+Ev1gniHse5r5tntrq+3VPH0gu+sp/+Jf39el9C4son/cQQlfB/jPSy+6t M0dA== X-Gm-Message-State: AOAM53250GyFxzs7s9zMjb75OGzCKOXqag98QQn9KmQX/9eDVok7hjHA b5Uc8Wm+66UPAHZi3MqigApFLGO6u/cyrO00zS5Ik19U+R+YTg== X-Received: by 2002:a17:906:a29a:: with SMTP id i26mr2514293ejz.45.1610442922252; Tue, 12 Jan 2021 01:15:22 -0800 (PST) MIME-Version: 1.0 References: <160990599013.2430134.11556277600719835946.stgit@dwillia2-desk3.amr.corp.intel.com> <20210106095520.GJ13207@dhcp22.suse.cz> In-Reply-To: <20210106095520.GJ13207@dhcp22.suse.cz> From: Dan Williams Date: Tue, 12 Jan 2021 01:15:12 -0800 Message-ID: Subject: Re: [PATCH] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions To: Michal Hocko Cc: Linux MM , Andrew Morton , David Hildenbrand , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 6, 2021 at 1:55 AM Michal Hocko wrote: > > On Tue 05-01-21 20:07:18, Dan Williams wrote: > > While pfn_to_online_page() is able to determine pfn_valid() at > > subsection granularity it is not able to reliably determine if a given > > pfn is also online if the section is mixed with ZONE_DEVICE pfns. > > I would call out the problem more explicitly. E.g. something like > " > This means that pfn_to_online_page can lead to false positives and allow > to return a struct page which is not fully initialized because it > belongs to ZONE_DEVICE (PUT AN EXAMPLE OF SUCH AN UNITIALIZED PAGE > HERE). That can lead to either crash on PagePoisoned assertion or a > silently broken page state with debugging disabled. > " Apologies for the long pause in this conversation, I went off and wrote a test to trigger this condition so I could quote it directly. It turns out soft_offline_page(), even before fixing pfn_to_online_page(), is broken as it leaks a page reference. > > I would also appreciate a more specific note about how the existing HW > can trigger this. You have mentioned 64MB subsection examples in the > other email. It would be great to mention it here as well. Sure. > > > Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a > > section that mixes ZONE_DEVICE pfns with other online pfns. With > > SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall > > back to a slow-path check for ZONE_DEVICE pfns in an online section. > > > > With this implementation of pfn_to_online_page() pfn-walkers mostly only > > need to check section metadata to determine pfn validity. In the > > rare case of mixed-zone sections the pfn-walker will skip offline > > ZONE_DEVICE pfns as expected. > > The above paragraph is slightly confusing. You do not require > pfn-walkers to check anything right? Section metadata is something that > is and should be hidden from them. They are asking for an online page > and get it or NULL. Nothing more nothing less. > Yeah, I'll drop it. I was describing what service pfn_to_online_page() performs for a pfn-walker, but it's awkwardly worded. > > > Other notes: > > > > Because the collision case is rare, and for simplicity, the > > SECTION_TAINT_ZONE_DEVICE flag is never cleared once set. > > I do not see a problem with that. > > > pfn_to_online_page() was already borderline too large to be a macro / > > inline function, but the additional logic certainly pushed it over that > > threshold, and so it is moved to an out-of-line helper. > > Worth a separate patch. > > The approach is sensible. Thanks! > > I was worried that we do not have sufficient space for a new flag but > the comment explains we have 6 bits available. I haven't double checked > that for the current state of the code. The comment is quite recent and > I do not remember any substantial changes in this area. Still something > that is rather fragile because an unexpected alignment would be a > runtime failure which is good to stop random corruptions but not ideal. > > Is there any way to explicitly test for this? E.g. enforce a shared > section by qemu and then trigger a pfn walk? > > > Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") > > Cc: Andrew Morton > > Reported-by: Michal Hocko > > Reported-by: David Hildenbrand > > Signed-off-by: Dan Williams > > [...] > > > +static int zone_id(const struct zone *zone) > > +{ > > + struct pglist_data *pgdat = zone->zone_pgdat; > > + > > + return zone - pgdat->node_zones; > > +} > > We already have zone_idx() Noted. > > > + > > +static void section_taint_zone_device(struct zone *zone, unsigned long pfn) > > +{ > > + struct mem_section *ms = __nr_to_section(pfn_to_section_nr(pfn)); > > + > > + if (zone_id(zone) != ZONE_DEVICE) > > + return; > > + > > + if (IS_ALIGNED(pfn, PAGES_PER_SECTION)) > > + return; > > + > > + ms->section_mem_map |= SECTION_TAINT_ZONE_DEVICE; > > +} > > + > > /* > > * Associate the pfn range with the given zone, initializing the memmaps > > * and resizing the pgdat/zone data to span the added pages. After this > > @@ -707,6 +769,15 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, > > resize_pgdat_range(pgdat, start_pfn, nr_pages); > > pgdat_resize_unlock(pgdat, &flags); > > > > + /* > > + * Subsection population requires care in pfn_to_online_page(). > > + * Set the taint to enable the slow path detection of > > + * ZONE_DEVICE pages in an otherwise ZONE_{NORMAL,MOVABLE} > > + * section. > > + */ > > + section_taint_zone_device(zone, start_pfn); > > + section_taint_zone_device(zone, start_pfn + nr_pages); > > I think it would be better to add the checks here and only set the flag > in the called function. SECTION_TAINT_ZONE_DEVICE should go to where we > define helpers for ther flags. > Done.