Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp3263392pxb; Mon, 18 Oct 2021 11:26:24 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw3XY/FDedA8UQAZRp+pBjT7l3YYs+AACKS9PFsR4yzoWpWqkjv2RUNmes8YzUvCka0peM6 X-Received: by 2002:a62:ce07:0:b0:44c:fef2:e410 with SMTP id y7-20020a62ce07000000b0044cfef2e410mr30148081pfg.71.1634581583717; Mon, 18 Oct 2021 11:26:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634581583; cv=none; d=google.com; s=arc-20160816; b=YijM7F2kzcsiWKaKOrqNZT2JfYFrImvIuk22s5ZFqQbFEE9EuVaOApcLLvyhdnUhCs lSFHT43dC1WFHvBVeqSmQ9vnF76riOUzwIs/uQOjWzEte9m/kQl2xXH+DedNil4MDdXh YSq10LHacXx0+hUQD9jfFPziAowGW+s5Qm3bmOrJJRWqU2diyD1/OH3jpF8qFnhvV5z1 y/cyypGwOS0LgnycFeucqVNi/zfJBhu/DCdCTz5eXeqlWBj9r5skgLnoWSqex49homyi J0g9AafL87FZ1BO8BwKEml+J8n/ZC3+7f7j3N/vlb5al30VmwKEquOOi4VjVsppnu4oc FQQw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=VWav6NFcp7Imar+k++zv5L6bzbFiMGeoi4IbLNr1qXk=; b=UT4UTDhmcLgKr0CGYsp6dejtUMahlbMBhEZNAd1iJ/oUS2ttAyAO04EwiPT5jACMUY D0vq5yBH+9qpmKPssiwaewip/XWYdrAH7+ln//NhFLfPmHqqoLooNdErwp4yKUekpCQr 23/yU/5oG2ZdLJZ/bgmRcHx8fu+b9IwGqQ9JgloQT4yAjs3U5QnJZWRM3uJ2ibOrvlBY XIjN8ghbt2O0Tt9+fuoAuZkuVeFlhuT2iTbf+FwQ1Ivz4jY72w5yjw1o0viG9825KKgH E8/pqexy9g938OT+J3SupcIDGGTustg07ENXiCIqOXl36kIPqwO5eg323RluFIxrLozb 2Wkg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ziepe.ca header.s=google header.b="iulG1Hi/"; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f18si23162021pfe.368.2021.10.18.11.26.04; Mon, 18 Oct 2021 11:26:23 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@ziepe.ca header.s=google header.b="iulG1Hi/"; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231739AbhJRS2N (ORCPT + 99 others); Mon, 18 Oct 2021 14:28:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47436 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231215AbhJRS2N (ORCPT ); Mon, 18 Oct 2021 14:28:13 -0400 Received: from mail-qk1-x733.google.com (mail-qk1-x733.google.com [IPv6:2607:f8b0:4864:20::733]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C1BEAC061745 for ; Mon, 18 Oct 2021 11:26:01 -0700 (PDT) Received: by mail-qk1-x733.google.com with SMTP id r15so16196952qkp.8 for ; Mon, 18 Oct 2021 11:26:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=VWav6NFcp7Imar+k++zv5L6bzbFiMGeoi4IbLNr1qXk=; b=iulG1Hi/BWXnGHXIUJtE3Sch4DxTtHHGjYAbR7SgfDwl3Tc6og/olLkcCwEBbB2k6G IfgEoidJStEKIbj2It5DifRzrECL7P9NI1viyq+lBosbBqcDaY1Cx5CKvpaCMMtwcLaK LJk+g8GcJQ9MYUf7YDlnJqNpxqQtZJQQNHXot+kUflb9VL1jiqaRs9roMA774Pq07G8P z6MQINCUTuDZeMIbiJyaMaqTsUS5Xe2aLaKTMnNU2nVECI+IUQvATxqyqgKryVcE6Odi UPrSa/zR8XpvPyVh6RkKVjlgDcQlRQzheAMj9zSJ3nZtedo1kmeWigT4fsV71OfsmWN0 6Uxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=VWav6NFcp7Imar+k++zv5L6bzbFiMGeoi4IbLNr1qXk=; b=wYycobp+P7XRANiOQS/gWkMZOzCYzBpOawQMVtdn0DXma85DigozLIGE3QK90dcUDA 7bouZ3VL1RuYcd+WI3IFrFdhXXRVUhFhdCg7RnEMMsx9gDhwgCxxNKbbFPUFlLkLUM/Q AxkM0wHaexMefocdV6l7zQZ5PTsjAsRxU0RvSFNu5k0Z3pR8WfdYNHZFJNPq80M5bSEc gkUGDlB8QqZ7Mm9vckZHumEqiWsZi5RnTLh00bMuUNJyrjNHkkGmdGtVRBWZ3F5SArLb /u5IdE+PHRbtJAEpYy6feOrVwsd2Sn9HBxi3Ndt3DL09ikxh+jprm0NdjbYmMHS71H0c wgcg== X-Gm-Message-State: AOAM532MwYhipleXxszC7+bXWXnL2pGx1BYhyc88IDelqWneFKuX4tIq aKtscCaE0W61XmLmhL2DTp6ZKw== X-Received: by 2002:a05:620a:2947:: with SMTP id n7mr24141155qkp.60.1634581560548; Mon, 18 Oct 2021 11:26:00 -0700 (PDT) Received: from ziepe.ca (hlfxns017vw-142-162-113-129.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.162.113.129]) by smtp.gmail.com with ESMTPSA id m195sm6853505qke.73.2021.10.18.11.25.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 18 Oct 2021 11:25:59 -0700 (PDT) Received: from jgg by mlx with local (Exim 4.94) (envelope-from ) id 1mcXKh-00GLjg-2o; Mon, 18 Oct 2021 15:25:59 -0300 Date: Mon, 18 Oct 2021 15:25:59 -0300 From: Jason Gunthorpe To: Dan Williams Cc: Matthew Wilcox , Alex Sierra , Andrew Morton , "Kuehling, Felix" , Linux MM , Ralph Campbell , linux-ext4 , linux-xfs , amd-gfx list , Maling list - DRI developers , Christoph Hellwig , =?utf-8?B?SsOpcsO0bWU=?= Glisse , Alistair Popple , Vishal Verma , Dave Jiang , Linux NVDIMM , David Hildenbrand , Joao Martins Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount Message-ID: <20211018182559.GC3686969@ziepe.ca> References: <20211014153928.16805-1-alex.sierra@amd.com> <20211014153928.16805-3-alex.sierra@amd.com> <20211014170634.GV2744544@nvidia.com> <20211014230606.GZ2744544@nvidia.com> <20211016154450.GJ2744544@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Sun, Oct 17, 2021 at 11:35:35AM -0700, Dan Williams wrote: > > DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with > > THP would make using normal refconting much simpler. I looked at > > teaching the mm core to deal with page arrays - it is certainly > > doable, but it is quite inefficient and ugly mm code. > > THP does not support PUD, and neither does FSDAX, so it's only PMDs we > need to worry about. device-dax uses PUD, along with TTM, they are the only places. I'm not sure TTM is a real place though. > > So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find? > > > > Joao has a series that does this to device-dax: > > > > https://lore.kernel.org/all/20210827145819.16471-1-joao.m.martins@oracle.com/ > > That assumes there's never any need to fracture a huge page which > FSDAX could not support unless the filesystem was built with 2MB block > size. As I understand things, something like FSDAX post-folio should generate maximal compound pages for extents in the page cache that are physically contiguous. A high order folio can be placed in any lower order in the page tables, so we never have to fracture it, unless the underlying page are moved around - which requires an unmap_mapping_range() cycle.. > > Assuming changing FSDAX is hard.. How would DAX people feel about just > > deleting the PUD/PMD support until it can be done with compound pages? > > There are end users that would notice the PMD regression, and I think > FSDAX PMDs with proper compound page metadata is on the same order of > work as fixing the refcount. Hmm, I don't know.. I sketched out the refcount stuff and the code is OK but ugly and will add a conditional to some THP cases On the other hand, making THP unmap cases a bit slower is probably a net win compared to making put_page a bit slower.. Considering unmap is already quite heavy. > > 4) Ask what the pgmap owner wants to do: > > > > if (head->pgmap->deny_foll_longterm) > > return FAIL > > The pgmap itself does not know, but the "holder" could specify this > policy. Here I imagine the thing that creates the pgmap would specify the policy it wants. In most cases the policy is tightly coupled to what the free function in the the provided dev_pagemap_ops does.. > Which is in line with the 'dax_holder_ops' concept being introduced > for reverse mapping support. I.e. when the FS claims the dax-device > it can specify at that point that it wants to forbid longterm. Which is a reasonable refinment if we think there are cases where two nvdim users would want different things. Anyhow, I'm wondering on a way forward. There are many balls in the air, all linked: - Joao's compound page support for device_dax and more - Alex's DEVICE_COHERENT - The refcount normalization - Removing the pgmap test from GUP - Removing the need for the PUD/PMD/PTE special bit - Removing the need for the PUD/PMD/PTE devmap bit - Remove PUD/PMD vma_is_special - folios for fsdax - shootdown for fsdax Frankly I'm leery to see more ZONE_DEVICE users crop up that depend on the current semantics as that will only make it even harder to fix.. I think it would be good to see Joao's compound page support move ahead.. So.. Does anyone want to work on finishing this patch series?? I can give some guidance on how I think it should work at least Jason