Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp3856748pxj; Mon, 7 Jun 2021 23:43:24 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx5HXVtIsLEhnXc6Fi2m2uIxtzLegqkkZtqh7pgnlB4L7zG3rXjiliokJmPuVJUQx4ytT2C X-Received: by 2002:a17:906:c00f:: with SMTP id e15mr22925886ejz.458.1623134604336; Mon, 07 Jun 2021 23:43:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1623134604; cv=none; d=google.com; s=arc-20160816; b=pqtw/xL57TGEOJE7xP2XyTcTUB2PHhHQSbLJso6pNcWrgFfLLNKJWLcm8yTv+8xBVt wqoF/8yQidf44IvleSqvuoPmy6upAbX06r5TiKCf9e69hTrekEsloS/Odf/L7t758+SB 0dbPAOG1geFCnsp6MCDM3nkpNcfwFi543ha3xSULM2w0+DIy+xMPEDmFU+Qp7FJUR7Jl JC6qTSii/usUC7t82aEc+VcsDLjnGoTXHf/jWyavBYtPaSs7AE310k0FIOiDl6ILqp/P ETjtRbhSiaCv2nT7d+VY2N/KPvKfVYtZusHY58z+kjALXY72Wccm/JXLoH4ZvROo17pG vh1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=LuL40cuMBibz7ZQQ1srfQNbk5qVCLt4Pr82LmYmyub4=; b=dlfkILk4RqWRZIw/PC+XsvM6m63cpgf6m2L0jyvFTrPZp6uFLGjgtd3F0LkUPwmaAI tkyHFbc4NCftsUmQQolEUV7YpKnnQ8TUbreK4yf4ZQh7BQmNKsdjvepF9lfRKB+Aqzc2 l8biGG2yji4MBuO1BHyc03iU5oYWMhnPdcizyUkruVJvOlIJa8OyUy8jvFKeP31jt7GA EoqX4DohSvyVmUgarJDYeVelztOOVVG7Lpo0lp+1a0LSEK8phgSG64XSGMvp/MHbgVEV L6DvP64e9LhSmai7H0TyBhxKqcJecr/pCwtpdgiwEp8mjCWzwLfNIu6blqrHpUh7ZiJO 8now== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=on9ecdXX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id j25si14939112edq.489.2021.06.07.23.43.00; Mon, 07 Jun 2021 23:43:24 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=on9ecdXX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229678AbhFHGnW (ORCPT + 99 others); Tue, 8 Jun 2021 02:43:22 -0400 Received: from smtp-out1.suse.de ([195.135.220.28]:50994 "EHLO smtp-out1.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229548AbhFHGnV (ORCPT ); Tue, 8 Jun 2021 02:43:21 -0400 Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 9900B219BE; Tue, 8 Jun 2021 06:41:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1623134487; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=LuL40cuMBibz7ZQQ1srfQNbk5qVCLt4Pr82LmYmyub4=; b=on9ecdXXsvc5iHQa6orqhYTFGmeRr1FT0I5DxM0ICAbccYZuglOidtfaZPa5/obGxBx4Vs 9DEm6PWxLeLKsJEVZuWEOTLkcokOG+CwpbsKF1F0DVq+R8r12+rlchEnZirfvvGBBVfHlP RxV9j2P7025c1LlfYM09K5/YzhUAe5o= Received: from suse.cz (unknown [10.100.201.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 67F93A3B81; Tue, 8 Jun 2021 06:41:27 +0000 (UTC) Date: Tue, 8 Jun 2021 08:41:26 +0200 From: Michal Hocko To: Yang Shi Cc: Zi Yan , nao.horiguchi@gmail.com, "Kirill A. Shutemov" , Hugh Dickins , Andrew Morton , Linux MM , Linux Kernel Mailing List Subject: Re: [PATCH] mm: mempolicy: don't have to split pmd for huge zero page Message-ID: References: <20210604203513.240709-1-shy828301@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 07-06-21 15:02:39, Yang Shi wrote: > On Mon, Jun 7, 2021 at 11:55 AM Michal Hocko wrote: > > > > On Mon 07-06-21 10:00:01, Yang Shi wrote: > > > On Sun, Jun 6, 2021 at 11:21 PM Michal Hocko wrote: > > > > > > > > On Fri 04-06-21 13:35:13, Yang Shi wrote: > > > > > When trying to migrate pages to obey mempolicy, the huge zero page is > > > > > split then the page table walk at PTE level just skips zero page. So it > > > > > seems pointless to split huge zero page, it could be just skipped like > > > > > base zero page. > > > > > > > > My THP knowledge is not the best but this is incorrect AIACS. Huge zero > > > > page is not split. We do split the pmd which is mapping the said page. I > > > > suspect you refer to vm_normal_page when talking about a zero page but > > > > please be aware that huge zero page is not a normal zero page. It is > > > > allocated dynamically (see get_huge_zero_page). > > > > > > For a normal huge page, yes, split_huge_pmd() just splits pmd. But > > > actually the base zero pfn will be inserted to PTEs when splitting > > > huge zero pmd. Please check __split_huge_zero_page_pmd() out. > > > > My bad. I didn't have a look all the way down there. The naming > > suggested that this is purely page table operations and I have suspected > > that ptes just point to the offset of the THP. > > > > But I am obviously wrong here. Sorry about that. > > > > > I should make this point clearer in the commit log. Sorry for the confusion. > > > > > > > > > > > So in the end you patch disables mbind of zero pages to a target node > > > > and that is a regression. > > > > > > Do we really migrate zero page? IIUC zero page is just skipped by > > > vm_normal_page() check in queue_pages_pte_range(), isn't it? > > > > Yeah, normal zero pages are skipped indeed. I haven't studied why this > > is the case yet. It surely sounds a bit suspicious because this is an > > explicit request to migrate memory and if the zero page is misplaced it > > should be moved. On the hand this would increase RSS so maybe this is > > the point. > > The zero page is a global shared page, I don't think "misplace" > applies to it. It doesn't make too much sense to migrate a shared > page. Actually there is page mapcount check in migrate_page_add() to > skip shared normal pages as well. I didn't really mean to migrate zero page itself. What I meant was to instanciate a new page when the global one is on a different NUMA node than the bind() requests. This can be either done by having per NUMA zero page or simply allocate a new page for the exclusive mapping. > > > > Have you tested the patch? > > > > > > No, just build test. I thought this change was straightforward. > > > > > > > > > > > > Set ACTION_CONTINUE to prevent the walk_page_range() split the pmd for > > > > > this case. > > > > > > > > Btw. this changelog is missing a problem statement. I suspect there is > > > > no actual problem that it should fix and it is likely driven by reading > > > > the code. Right? > > > > > > The actual problem is it is pointless to split a huge zero pmd. Yes, > > > it is driven by visual inspection. > > > > Is there any actual workload that cares? This is quite a subtle area so > > I would be careful to do changes just because... > > I'm not sure whether there is measurable improvement for actual > workloads, but I believe this change does eliminate some unnecessary > work. I can see why being consistent here is a good argument. On the other hand it would be imho better to look for reasons why zero pages are left misplaced before making the code consistent. From a very quick git archeology it seems that vm_normal_page has been used since MPOL_MF_MOVE was introduced. At the time (dc9aa5b9d65fd) vm_normal_page hasn't skipped through zero page AFAICS. I do not remember all the details about zero page (wrt. pte special) handling though so it might be hidden at some other place. In any case the existing code doesn't really work properly. The question is whether anybody actually cares but this is definitely something worth looking into IMHO. > I think the test shown in the previous email gives us some confidence > that the change doesn't have regression. Yes, this is true. -- Michal Hocko SUSE Labs