Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp230442pxf; Thu, 25 Mar 2021 02:44:00 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz4NzgIXSNMgZQDcdhOU7hoApDKkrk/kqE+04uhnoJEfPz+ZWfAHPE7PSLdDMnO9O6oFnd3 X-Received: by 2002:a17:906:489b:: with SMTP id v27mr8343886ejq.1.1616665439968; Thu, 25 Mar 2021 02:43:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1616665439; cv=none; d=google.com; s=arc-20160816; b=Jw0aVFoAgSNoSvXidzqpByTOXko3xuOJWPrsWKPUCDqmD7Eb4/Ylic6IXXOOWgfz2s Ib1u+ePBmsB1zKbffXmhICb70CkK7dMTSLjASUAmySEbN2lK800jwH89kSxHvajmeB0C 5P1tXez4231gbWcnVq/kDUJm6jZUX2nqc0N+6jMqA+/EG9GZGWfJI7mc/2e9viXXhIoE iePStLmIfqF6/lAWsjk6XudOq2K1sAFR3Qws5xCEAjSJmMUS+s9rGoKdErDSlu93Qkyg I07zO7OZx61ZCM5wrO1+KAoZO971h2MVlrdaaqtzuMYSHYO2K+mmPE6X2GmmI13voPzS qvUQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=ytpgGJ+GtYvC05QBQ0nA79g5rjGcUQUWfto8MGT8T5A=; b=kpq7vm7Lbnvu4nQVkcRcbLzpz+/TRV2aU0yi8B3MATtmyOb8ZUc2HIaWQdvAwlO0x4 9EtWUv+iZN4q4f+cUt0pis0PpV6QGyJwvwmB8H32jJD0ZnYHy77yO4xBYf9KbGYid/fI SM20RRcBV6R9wCltI5oIyPOZcDyGv80eKZgAqy6eyq0Y9f9+Mls4s0s8CNawrvXJgzm0 vpt6DWlswwj5Cr7Gg+uWaWQ6pEP0kaYWJ+J4jYdJLrO4ttLbcdNRwaA349Sw9rB2LOAY 542EuQPcRQ9QxpWyqzQmN362jET2Wqa7ChBM9USm/5HT11m1oU+CfzxNJ98VwLBgu1qs grqw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ffwll.ch header.s=google header.b=hW77Ci8f; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id lj27si3745013ejb.513.2021.03.25.02.43.37; Thu, 25 Mar 2021 02:43:59 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@ffwll.ch header.s=google header.b=hW77Ci8f; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229676AbhCYJm3 (ORCPT + 99 others); Thu, 25 Mar 2021 05:42:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45066 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229781AbhCYJl5 (ORCPT ); Thu, 25 Mar 2021 05:41:57 -0400 Received: from mail-oi1-x233.google.com (mail-oi1-x233.google.com [IPv6:2607:f8b0:4864:20::233]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 350CCC06174A for ; Thu, 25 Mar 2021 02:41:55 -0700 (PDT) Received: by mail-oi1-x233.google.com with SMTP id k25so1506264oic.4 for ; Thu, 25 Mar 2021 02:41:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=ytpgGJ+GtYvC05QBQ0nA79g5rjGcUQUWfto8MGT8T5A=; b=hW77Ci8f+U/tmbEmXL8ihJtLZwDMcW8v0ECQY1tqoOjKzhYO2e2BocQHeYO9UZXfKY 59sqIoLcgr+1t2YNCZ4buVs4UrMdexYoh/cHwYTwYDc7GjNlICuuoObPtE7fwovvCv2z 7VxapeeBtYVHShmbzgcHfuY1yDFFIL4nqKov0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=ytpgGJ+GtYvC05QBQ0nA79g5rjGcUQUWfto8MGT8T5A=; b=CxAY4jBJudT5KwEEVOOz2OVQCvESC3NMQblA46BioPZs3oU1zkXRxK47Hn8E9kh2a8 mClpuLfEqTWfemypVq9scEX/BM+EMH+gNpOh4JfiAh7K1a34REeDyGEk3TZKkgVMk6Uc mudRTm+EWYyMrANBMHTpGUYklu1Op0BYAYuEShBklURI1eRGp/HvCmWSacJN1rG2v6vY GqlyO/cW/OLi7HWDi+QDYMpcuNN6Nb9YgpF7A8RVOkE6DXWL3S4xRzNVNITp+jlC3VS3 wdoi00Zgz8Jd/kguAemSdpXdUXgYqEKK4g4OB383TAYf8TkBRb8O8iLLsnyhEC15HmHZ vG2Q== X-Gm-Message-State: AOAM532AM6LD1zg5053BobHWXCpCij+ao4UNaCS3vRnmkYfkVh4Zf/R6 H7difyrQkGXRDjBjLt0YD0m+EYTumyJ/oD3Msgudxg== X-Received: by 2002:aca:4188:: with SMTP id o130mr5362730oia.101.1616665314626; Thu, 25 Mar 2021 02:41:54 -0700 (PDT) MIME-Version: 1.0 References: <20210324122430.GW2356281@nvidia.com> <20210324124127.GY2356281@nvidia.com> <6c9acb90-8e91-d8af-7abd-e762d9a901aa@shipmail.org> <20210324134833.GE2356281@nvidia.com> <0b984f96-00fb-5410-bb16-02e12b2cc024@shipmail.org> <20210324163812.GJ2356281@nvidia.com> <08f19e80-d6cb-8858-0c5d-67d2e2723f72@amd.com> <730eb2ff-ba98-2393-6d42-61735e3c6b83@shipmail.org> <20210324231419.GR2356281@nvidia.com> In-Reply-To: From: Daniel Vetter Date: Thu, 25 Mar 2021 10:41:43 +0100 Message-ID: Subject: Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages To: =?UTF-8?Q?Christian_K=C3=B6nig?= Cc: Jason Gunthorpe , =?UTF-8?Q?Thomas_Hellstr=C3=B6m_=28Intel=29?= , David Airlie , Linux MM , Andrew Morton , Linux Kernel Mailing List , dri-devel Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 25, 2021 at 8:50 AM Christian K=C3=B6nig wrote: > > Am 25.03.21 um 00:14 schrieb Jason Gunthorpe: > > On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellstr=C3=B6m (Intel)= wrote: > >> On 3/24/21 7:31 PM, Christian K=C3=B6nig wrote: > >>> > >>> Am 24.03.21 um 17:38 schrieb Jason Gunthorpe: > >>>> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellstr=C3=B6m (Int= el) > >>>> wrote: > >>>>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote: > >>>>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellstr=C3=B6m > >>>>>> (Intel) wrote: > >>>>>> > >>>>>>>> In an ideal world the creation/destruction of page > >>>>>>>> table levels would > >>>>>>>> by dynamic at this point, like THP. > >>>>>>> Hmm, but I'm not sure what problem we're trying to solve > >>>>>>> by changing the > >>>>>>> interface in this way? > >>>>>> We are trying to make a sensible driver API to deal with huge page= s. > >>>>>>> Currently if the core vm requests a huge pud, we give it > >>>>>>> one, and if we > >>>>>>> can't or don't want to (because of dirty-tracking, for > >>>>>>> example, which is > >>>>>>> always done on 4K page-level) we just return > >>>>>>> VM_FAULT_FALLBACK, and the > >>>>>>> fault is retried at a lower level. > >>>>>> Well, my thought would be to move the pte related stuff into > >>>>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK. > >>>>>> > >>>>>> I don't know if the locking works out, but it feels cleaner that t= he > >>>>>> driver tells the vmf how big a page it can stuff in, not the vm > >>>>>> telling the driver to stuff in a certain size page which it might = not > >>>>>> want to do. > >>>>>> > >>>>>> Some devices want to work on a in-between page size like 64k so th= ey > >>>>>> can't form 2M pages but they can stuff 64k of 4K pages in a batch = on > >>>>>> every fault. > >>>>> Hmm, yes, but we would in that case be limited anyway to insert ran= ges > >>>>> smaller than and equal to the fault size to avoid extensive and > >>>>> possibly > >>>>> unnecessary checks for contigous memory. > >>>> Why? The insert function is walking the page tables, it just updates > >>>> things as they are. It learns the arragement for free while doing th= e > >>>> walk. > >>>> > >>>> The device has to always provide consistent data, if it overlaps int= o > >>>> pages that are already populated that is fine so long as it isn't > >>>> changing their addresses. > >>>> > >>>>> And then if we can't support the full fault size, we'd need to > >>>>> either presume a size and alignment of the next level or search for > >>>>> contigous memory in both directions around the fault address, > >>>>> perhaps unnecessarily as well. > >>>> You don't really need to care about levels, the device should be > >>>> faulting in the largest memory regions it can within its efficiency. > >>>> > >>>> If it works on 4M pages then it should be faulting 4M pages. The pag= e > >>>> size of the underlying CPU doesn't really matter much other than som= e > >>>> tuning to impact how the device's allocator works. > >> Yes, but then we'd be adding a lot of complexity into this function th= at is > >> already provided by the current interface for DAX, for little or no ga= in, at > >> least in the drm/ttm setting. Please think of the following situation:= You > >> get a fault, you do an extensive time-consuming scan of your VRAM buff= er > >> object into which the fault goes and determine you can fault 1GB. Now = you > >> hand it to vmf_insert_range() and because the user-space address is > >> misaligned, or already partly populated because of a previous eviction= , you > >> can only fault single pages, and you end up faulting a full GB of sing= le > >> pages perhaps for a one-time small update. > > Why would "you can only fault single pages" ever be true? If you have > > 1GB of pages then the vmf_insert_range should allocate enough page > > table entries to consume it, regardless of alignment. > > Completely agree with Jason. Filling in the CPU page tables is > relatively cheap if you fill in a large continuous range. > > In other words filling in 1GiB of a linear range is *much* less overhead > than filling in 1<<18 4KiB faults. > > I would say that this is always preferable even if the CPU only wants to > update a single byte. > > > And why shouldn't DAX switch to this kind of interface anyhow? It is > > basically exactly the same problem. The underlying filesystem block > > size is *not* necessarily aligned to the CPU page table sizes and DAX > > would benefit from better handling of this mismatch. > > > >> On top of this, unless we want to do the walk trying increasingly smal= ler > >> sizes of vmf_insert_xxx(), we'd have to use apply_to_page_range() and = teach > >> it about transhuge page table entries, because pagewalk.c can't be use= d (It > >> can't populate page tables). That also means apply_to_page_range() nee= ds to > >> be complicated with page table locks since transhuge pages aren't stab= le and > >> can be zapped and refaulted under us while we do the walk. > > I didn't say it would be simple :) But we also need to stop hacking > > around the sides of all this huge page stuff and come up with sensible > > APIs that drivers can actually implement correctly. Exposing drivers > > to specific kinds of page levels really feels like the wrong level of > > abstraction. > > > > Once we start doing this we should do it everywhere, the io_remap_pfn > > stuff should be able to create huge special IO pages as well, for > > instance. > > Oh, yes please! > > We easily have 16GiB of VRAM which is linear mapped into the kernel > space for each GPU instance. > > Doing that with 1GiB mapping instead of 4KiB would be quite a win. io_remap_pfn is for userspace mmaps. Kernel mappings should be as big as possible already I think for everything. -Daniel > Regards, > Christian. > > > > >> On top of this, the user-space address allocator needs to know how lar= ge gpu > >> pages are aligned in buffer objects to have a reasonable chance of ali= gning > >> with CPU huge page boundaries which is a requirement to be able to ins= ert a > >> huge CPU page table entry, so the driver would basically need the drm = helper > >> that can do this alignment anyway. > > Don't you have this problem anyhow? > > > > Jason > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel --=20 Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch